CN116110437A - Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics - Google Patents

Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics Download PDF

Info

Publication number
CN116110437A
CN116110437A CN202310395720.1A CN202310395720A CN116110437A CN 116110437 A CN116110437 A CN 116110437A CN 202310395720 A CN202310395720 A CN 202310395720A CN 116110437 A CN116110437 A CN 116110437A
Authority
CN
China
Prior art keywords
voice
layer
pathological
fusion
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310395720.1A
Other languages
Chinese (zh)
Other versions
CN116110437B (en
Inventor
张涛
侯晓慧
刘赣俊
赵鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310395720.1A priority Critical patent/CN116110437B/en
Publication of CN116110437A publication Critical patent/CN116110437A/en
Application granted granted Critical
Publication of CN116110437B publication Critical patent/CN116110437B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Animal Behavior & Ethology (AREA)
  • Veterinary Medicine (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a pathological voice quality evaluation method based on voice characteristics and speaker characteristics fusion, which comprises the steps of inputting pathological voice, extracting two voice characteristics of a spectrogram and a Mel frequency cepstrum coefficient, and carrying out characteristic fusion; taking the fused voice characteristics as input, extracting time information and predicting frame-level scores; taking the voice characteristics of the mel frequency cepstrum coefficient as input to extract the characteristics of a speaker; and taking the voice characteristics obtained after the time information extraction and the speaker characteristics as inputs, and carrying out characteristic fusion to obtain the prediction of the speech-level quality score. According to the invention, the voice characteristics and the speaker characteristics are extracted from the pathological voice, the characteristics are fused, and finally the score prediction is carried out, so that the mapping relation between the pathological voice and the subjective quality score corresponding to the pathological voice is found, and objective and quantitative evaluation on the quality of the pathological voice is realized.

Description

Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics
Technical Field
The invention belongs to the technical field of pathological voice quality evaluation, and particularly relates to a pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics.
Background
Along with the acceleration of life rhythm in modern society, people have increasingly bad sounding habits, living habits and voice abuse conditions so that the incidence rate of voice diseases is higher. Voice diseases can obstruct communication of people, separate individuals from society, even cause diseases such as depression, and have great influence on physiological and psychological conditions of people. Therefore, effective diagnosis and treatment of voice diseases are becoming an increasing focus of attention. And the accurate and quantitative evaluation of the quality of pathological voice plays an important role in the diagnosis and treatment of voice diseases.
The pathological voice quality evaluation is a new direction in the field of voice disorder research, and the corresponding quality fraction is obtained on the basis of pathological voice signal analysis. The main methods of pathological voice quality evaluation are divided into subjective perception evaluation and objective acoustic analysis evaluation. At present, the subjective perception evaluation is mainly used for pathological voice diagnosis clinically, and in the process, a plurality of doctors can make corresponding MOS scores for voices according to a certain subjective measurement standard by listening to the voices made by patients so as to measure the pathological voice quality. However, subjective and occasional evaluation exists, because doctors experience levels and acoustic perceptions have certain differences, and the evaluation mode has poor repeatability and high price.
The quality evaluation by the objective acoustic analysis method can be further divided into a reference evaluation mode and a non-reference evaluation mode. The referenced objective evaluation mode needs to have original voice and distorted voice at the same time and requires strict alignment of the original voice and the distorted voice, and has a certain limitation in real life, while the non-referenced objective evaluation mode can obtain corresponding quality fraction only through the distorted voice. In recent years, with the development of deep learning, reference-free evaluation with high accuracy and end-to-end advantage is possible, but at present, the subjects of the scholars in this direction are normal voices.
At present, the research on pathological voice mainly focuses on the detection of healthy and pathological voice and the classification of pathological symptoms, but the research on the quality perception of pathological voice is very little. Although some studies explore the relationship between objective parameters and pathological voice quality, these indicators can only be used for qualitative analysis of pathological voice quality. Currently, fresh students quantitatively evaluate pathological voice quality by using a deep learning technology.
Therefore, the mapping relation between the pathological voice quality and the subjective MOS score, which is expressed by using the deep learning technology, is explored, and the construction of the pathological voice quality evaluation model based on the deep learning has important research significance and practical application value.
Disclosure of Invention
In view of the above, the invention provides a pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics, so as to solve the defects of poor repeatability and high price of subjective quality evaluation and make up for the study blank of objective evaluation in the aspect of quantitative evaluation of pathological voice quality.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
a pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics comprises the following steps:
step 1: inputting pathological voice, extracting two voice features of a spectrogram and a Mel frequency cepstrum coefficient, and carrying out feature fusion;
step 2: taking the fused voice characteristics as input, extracting time information and predicting frame-level scores;
step 3: taking the voice characteristics of the mel frequency cepstrum coefficient as input to extract the characteristics of a speaker;
step 4: and (3) taking the voice characteristics obtained after the time information extraction in the step (2) and the speaker characteristics obtained in the step (3) as inputs, and carrying out characteristic fusion to obtain the prediction of the speech-level quality score.
Further, in the step 1,
extracting spectrogram voice characteristics comprises framing pathological voice, adding a hamming window and performing short-time Fourier transform to obtain an amplitude spectrogram of voice signals;
extracting the voice characteristics of the mel-frequency cepstrum coefficient includes voice to pathology
Figure SMS_1
Using p mel filters, m mel frequency cepstral coefficients are obtained per frame.
Further, in the step 1, feature fusion of two voices includes:
inputting the obtained spectrogram into a convolution module formed by stacking k convolution layers, and finally obtaining each frame
Figure SMS_2
Feature vector of dimension->
Figure SMS_3
The method comprises the steps of carrying out a first treatment on the surface of the The process is expressed by the following formula:
Figure SMS_4
Figure SMS_5
wherein ,
Figure SMS_7
u is the input spectrogram of pathological voice, < ->
Figure SMS_10
Represents a ReLU activation functionCount (n)/(l)>
Figure SMS_13
Is a weight parameter of the first layer convolution layer, < ->
Figure SMS_8
Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>
Figure SMS_9
;/>
Figure SMS_12
and />
Figure SMS_15
Are respectively->
Figure SMS_6
Weight parameters and bias parameters of the layer convolution layer, < ->
Figure SMS_11
Is the output of the ith convolution layer, +.>
Figure SMS_14
Is->
Figure SMS_16
An output of the layer convolution layer;
will be
Figure SMS_17
Corresponding to the m-dimensional Meier frequency cepstrum coefficient obtained from each frame, adding to obtain +.>
Figure SMS_18
Frame-level features of the preliminary fusion of dimensions->
Figure SMS_19
Further, the step 2 specifically includes:
step 201: will first
Figure SMS_20
As input, via the input embedding layer, get +.>
Figure SMS_21
It has three dimensions: [ n, u, v]The meaning of the expression is: the number n of one batch vector, the number u of each vector value and the embedding dimension v of each value;
step 202:
Figure SMS_22
then the position coding is carried out to obtain +.>
Figure SMS_23
The formula for position coding is:
Figure SMS_24
wherein ,
Figure SMS_25
for the location information of each value in the input vector, 2t refers to the dimension of the word vector encoded by each value,
and then will be
Figure SMS_26
And->
Figure SMS_27
Corresponding dimensions are added to obtain a new vector with position-coding information +.>
Figure SMS_28
Step 203: will be
Figure SMS_29
Sending the signals to an M-head self-attention mechanism module to obtain +.>
Figure SMS_30
Its dimension and->
Figure SMS_31
The same applies to the following formula:
Figure SMS_32
Figure SMS_33
wherein ,
Figure SMS_34
, />
Figure SMS_35
is the weight matrix in the t-th head in the M-head attention mechanism;
Figure SMS_36
wherein ,
Figure SMS_37
for the dimension of the resulting K vector, < >>
Figure SMS_38
Is the output of the attention mechanism module of the t-th head,/-, a first switch is provided>
Figure SMS_39
Is the total output of the M-head attention mechanism;
step 204: will be
Figure SMS_40
And->
Figure SMS_41
Residual connection is carried out to obtain->
Figure SMS_42
And then will
Figure SMS_43
Performing batch normalization to obtain +.>
Figure SMS_44
The calculation formula of the process is as follows:
Figure SMS_45
Figure SMS_46
step 205: will be
Figure SMS_47
Obtaining ∈K through feedforward calculation>
Figure SMS_48
The calculation formula is as follows:
Figure SMS_49
wherein ,
Figure SMS_50
is a weight matrix>
Figure SMS_51
Representing a ReLU activation function;
step 206: will be
Figure SMS_52
and />
Figure SMS_53
Residual connection is carried out to obtain->
Figure SMS_54
Then->
Figure SMS_55
Performing batch normalization to obtain +.>
Figure SMS_56
The calculation formula of the process is as follows:
Figure SMS_57
Figure SMS_58
step 207: will be
Figure SMS_59
Performing N times of repetition coding to obtain +.>
Figure SMS_60
;/>
Step 208: will be
Figure SMS_61
Obtaining frame fraction +.>
Figure SMS_62
Figure SMS_63
wherein
Figure SMS_64
Represented as a fully connected layer.
Further, the step 3 includes:
input of m-dimensional mel frequency cepstral coefficient features per frame
Figure SMS_65
The time delay neural network is arranged, the time delay neural network integrates the frame-level characteristics into a speech-level characteristic while considering the context information between adjacent frames, and finally, the speech-level characteristics are calculated and spliced
Figure SMS_66
Mean and variance of the outputs of the layer delay neural network, then go through +.>
Figure SMS_67
A time delay neural network for obtaining a characteristic vector +.>
Figure SMS_68
Further, in the step 4, feature fusion is performed on the voice feature obtained after the time information extraction in the step 2 and the speaker feature obtained in the step 3 as inputs, including:
will be
Figure SMS_69
Sequentially performing dimension conversion, self-adaptive average pooling operation and dimension conversion to obtain +.>
Figure SMS_70
The calculation formula is as follows:
Figure SMS_71
wherein ,
Figure SMS_72
representation->
Figure SMS_73
Exchange of dimensions, add>
Figure SMS_74
Representing an adaptive averaging pooling operation;
feature vector
Figure SMS_75
And->
Figure SMS_76
Splicing to obtain preliminarily fused speaking-level characteristic vector
Figure SMS_77
,
Figure SMS_78
Will result in
Figure SMS_79
As input, a convolution module formed by overlapping k convolution layers is fed to obtain high-dimensional speaking-level characteristics +.>
Figure SMS_80
The process is expressed by the following formula:
Figure SMS_81
Figure SMS_82
Figure SMS_83
wherein ,
Figure SMS_85
,/>
Figure SMS_87
representing the ReLU activation function,)>
Figure SMS_90
Is a weight parameter of the first layer convolution layer, < ->
Figure SMS_86
Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>
Figure SMS_89
;/>
Figure SMS_92
and />
Figure SMS_94
Are respectively->
Figure SMS_84
Weight parameters and bias parameters of the layer convolution layer, < ->
Figure SMS_88
Is the output of the ith convolution layer, +.>
Figure SMS_91
Is->
Figure SMS_93
And (3) outputting a layer convolution layer.
Further, in the step 4, the predicting the speech-level quality score includes:
will be
Figure SMS_95
Sequentially performing dimension conversion and self-adaptive average pooling operation to obtain final fraction +.>
Figure SMS_96
The calculation formula is as follows: />
Figure SMS_97
wherein ,
Figure SMS_98
representing the exchange of dimensions->
Figure SMS_99
Representing an adaptive averaging pooling operation.
Compared with the prior art, the pathological voice quality evaluation method based on the fusion of the voice characteristics and the speaker characteristics has the following advantages: according to the invention, the voice characteristics and the speaker characteristics are extracted from the pathological voice, the characteristics are fused, and finally the score prediction is carried out, so that the mapping relation between the pathological voice and the subjective quality score corresponding to the pathological voice is found, and objective and quantitative evaluation on the quality of the pathological voice is realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is a schematic diagram of the overall process of the present invention;
FIG. 2 is a schematic diagram of an X-Vector extraction according to the present invention;
FIG. 3 is a schematic diagram of an encoder network architecture according to the present invention;
FIG. 4 is a graph comparing the fitting result of the present invention with the MOSNet to subjective MOS score;
fig. 5 is a graph of the results of the fitting of the present invention to subjective quality scores.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.
The invention will be described in detail below with reference to the drawings in connection with embodiments.
As shown in figure 1, the invention provides a pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics, and compared with the previous method, the method has the advantages that the voice characteristics and the speaker characteristics are extracted from the pathological voice, and the characteristic fusion is carried out, so that the network can better extract effective information in the pathological voice. The invention inputs pathological voice and outputs the corresponding mass fraction, namely MOS fraction, has the advantages of end-to-end and accuracy, and has great practical significance.
The method of the invention generally comprises the following steps:
and step 1, extracting and fusing voice features. And inputting pathological voice, extracting various voice features, and carrying out feature fusion. The invention takes spectrograms and Mel Frequency Cepstrum Coefficients (MFCC) as voice features which need to be extracted from original pathological voice, and takes a multi-layer convolution structure as an example to perform feature fusion.
And 2, predicting the frame level score. And (3) taking the fused voice characteristics obtained in the step (1) as input, and extracting time information and predicting frame fraction. The invention takes the encoder as a time sequence processing model to extract the time sequence information of the fused voice characteristics, thereby realizing the prediction of pathological voice frame fraction.
And 3, extracting the speaker characteristics. And (3) taking the Mel Frequency Cepstrum Coefficient (MFCC) extracted in the step (1) as input to extract the speaker characteristics. The invention uses X-Vector as the speaker characteristic to increase the information representing the speaker identity.
And 4, predicting the speech level score. And (3) taking the voice characteristics obtained after the time information is extracted in the step (2) and the speaker characteristics obtained in the step (3) as inputs, and carrying out characteristic fusion to finally obtain the speech-level quality score. The invention takes a multilayer convolution structure as an example to perform feature fusion; taking the adaptive pooling operation as an example, the speech-level quality score is predicted.
In one embodiment, the voice feature extraction and fusion in step 1 of the present invention specifically includes:
1. extraction of speech features
To the pathological voice of the input
Figure SMS_100
The following two methods are respectively carried out to obtain two voice characteristics: spectrogram, mel-frequency cepstral coefficient (MFCC).
1) Frequency spectrum diagram: for pathological voice
Figure SMS_101
And carrying out the steps of framing, adding a hamming window, carrying out short-time Fourier transform and the like to obtain an amplitude spectrogram U of the voice signal.
2) Mel frequency cepstral coefficients: for pathological voice
Figure SMS_102
Using p mel filters, m MFCC coefficients are obtained per frame.
2. Fusion of speech features
Inputting the obtained spectrogram U into a convolution module formed by stacking k convolution layers, and finally obtaining each frame
Figure SMS_103
Feature vector of dimension->
Figure SMS_104
. The process can be expressed by the following formula:
Figure SMS_105
Figure SMS_106
wherein ,
Figure SMS_108
u is the input spectrogram of pathological voice, < ->
Figure SMS_112
Representing the ReLU activation function,)>
Figure SMS_115
Is a weight parameter of the first layer convolution layer, < ->
Figure SMS_109
Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>
Figure SMS_111
;/>
Figure SMS_114
and />
Figure SMS_116
Are respectively->
Figure SMS_107
Weight parameters and bias parameters of the layer convolution layer, < ->
Figure SMS_110
Is the output of the ith convolution layer, +.>
Figure SMS_113
Is->
Figure SMS_117
And (3) outputting a layer convolution layer.
Will be
Figure SMS_118
Correspondingly adding the m-dimensional MFCCs obtained from each frame to obtain +.>
Figure SMS_119
Frame-level features of the preliminary fusion of dimensions->
Figure SMS_120
In one embodiment, in step 2 of the present invention, the prediction of the frame level score specifically includes:
will be
Figure SMS_121
As input, by means of a time-series processing module, encoder, for +.>
Figure SMS_122
Extracting time sequence information to obtain +.>
Figure SMS_123
Dimension frame level feature->
Figure SMS_124
The encoder network architecture is shown in fig. 3. The detailed description of this process is as follows:
1) Will first
Figure SMS_125
As input, via the input embedding layer, get +.>
Figure SMS_126
It has three dimensions: [ n, u, v ]>
Figure SMS_127
]The meaning of the expression is: number of lot vectors n->
Figure SMS_128
The number of values u per vector and the embedding dimension v per value.
2)
Figure SMS_129
Then the position coding is carried out to obtain +.>
Figure SMS_130
The formula for position coding is:
Figure SMS_131
wherein ,
Figure SMS_132
for the position information of each value in the input vector, 2t refers to the dimension of the word vector encoded by each value.
And then will be
Figure SMS_133
And->
Figure SMS_134
Corresponding dimension addition to obtain new vector with position coding information
Figure SMS_135
3) Will be
Figure SMS_136
Sending the signals to an M-head self-attention mechanism module to obtain +.>
Figure SMS_137
Its dimension and
Figure SMS_138
the same applies. The formula is as follows:
Figure SMS_139
wherein ,
Figure SMS_140
, />
Figure SMS_141
is the weight matrix in the t-th head in the M-head attention mechanism.
Figure SMS_142
Figure SMS_143
wherein ,
Figure SMS_144
for the dimension of the resulting K vector, < >>
Figure SMS_145
Is the output of the attention mechanism module of the t-th head,/-, a first switch is provided>
Figure SMS_146
Is the total output of the M-head attention mechanism.
4) Will be
Figure SMS_147
And->
Figure SMS_148
Residual connection is carried out to obtain->
Figure SMS_149
And then will
Figure SMS_150
Performing batch normalization to obtain +.>
Figure SMS_151
The calculation formula of the process is as follows:
Figure SMS_152
Figure SMS_153
/>
5) Will be
Figure SMS_154
Obtaining ∈K through feedforward calculation>
Figure SMS_155
The calculation formula is as follows:
Figure SMS_156
wherein ,
Figure SMS_157
is a weight matrix>
Figure SMS_158
Representing a ReLU activation function.
6) Will be
Figure SMS_159
and />
Figure SMS_160
Residual connection is carried out to obtain->
Figure SMS_161
Then->
Figure SMS_162
Performing batch normalization to obtain +.>
Figure SMS_163
The calculation formula of the process is as follows:
Figure SMS_164
Figure SMS_165
7) Will be
Figure SMS_166
Go->
Figure SMS_167
Repeating the encoding to obtain->
Figure SMS_168
8) Will be
Figure SMS_169
Obtaining frame fraction +.>
Figure SMS_170
Figure SMS_171
wherein
Figure SMS_172
Represented as a fully connected layer.
In one embodiment, in step 3 of the present invention, the extracting of the speaker characteristic includes:
to the pathological voice of the input
Figure SMS_173
The speaker ID X-Vector was obtained by the following method. As shown in fig. 2.
Inputting m-dimensional MFCC features of each frame
Figure SMS_174
Layer delay neural network (TDNN), which combines frame-level features into one speech-level feature while considering context information between adjacent frames, and finally calculates and splices ∈ ->
Figure SMS_175
Mean and variance of outputs of layer TDNN network, then go through +.>
Figure SMS_176
TDNN, obtain the eigenvector +.>
Figure SMS_177
In one embodiment, in step 4 of the present invention, the utterance fraction prediction includes:
1. fusion of speech features and speaker features
Will be
Figure SMS_178
Sequentially performing dimension conversion, self-adaptive average pooling operation and dimension conversion to obtain +.>
Figure SMS_179
The calculation formula is as follows:
Figure SMS_180
wherein ,
Figure SMS_181
representation->
Figure SMS_182
Exchange of dimensions, add>
Figure SMS_183
Representing an adaptive averaging pooling operation.
Feature vector
Figure SMS_184
And->
Figure SMS_185
Splicing to obtain preliminarily fused speaking-level characteristic vector
Figure SMS_186
Figure SMS_187
Will result in
Figure SMS_188
As input, sendA convolution module formed by overlapping k convolution layers is added to obtain high-dimensional speaking-level characteristics ∈>
Figure SMS_189
. The process can be expressed by the following formula:
Figure SMS_190
/>
Figure SMS_191
Figure SMS_192
wherein ,
Figure SMS_194
,/>
Figure SMS_196
representing the ReLU activation function,)>
Figure SMS_201
Is a weight parameter of the first layer convolution layer, < ->
Figure SMS_195
Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>
Figure SMS_198
;/>
Figure SMS_200
and />
Figure SMS_203
Are respectively->
Figure SMS_193
Weight parameters and bias parameters of the layer convolution layer, < ->
Figure SMS_197
Is the output of the ith convolution layer, +.>
Figure SMS_199
Is->
Figure SMS_202
And (3) outputting a layer convolution layer.
2. Prediction of speech-level scores
Will be
Figure SMS_204
Sequentially performing dimension conversion and self-adaptive average pooling operation to obtain final fraction +.>
Figure SMS_205
The calculation formula is as follows:
Figure SMS_206
wherein ,
Figure SMS_207
representing the exchange of dimensions->
Figure SMS_208
Representing an adaptive averaging pooling operation.
The implementation of the invention is illustrated by specific examples.
1. Speech feature extraction and fusion
1. Extraction of speech features
1) Extracting a spectrogram of pathological voice: the pathological voice data is read, the sampling rate is 16KHz, then frame division is carried out, the frame is shifted by 256, a Hamming window is added, and the window length is 512. And performing 512-point short-time Fourier transform on the windowed pathological voice data segment to obtain an amplitude spectrogram of each frame.
2) Extracting a mel frequency cepstrum coefficient of pathological voice: according to the MFCC extraction procedure in the specific implementation, 13 Mel-frequency cepstral coefficients are extracted using 40 Mel filters.
2. Fusion of speech features
1) The amplitude spectrum obtained by processing is input into a convolution module composed of 4 layers of convolution layers, the parameter number of the 1 st layer is 3 multiplied by 16, the size of the 1 st layer convolution kernel is 3 multiplied by 3, the number of output channels is 16, the parameter number of the 2 nd layer is 3 multiplied by 32, the size of the 2 nd layer convolution kernel is 3 multiplied by 3, and the number of output channels is 32. The number of parameters of layer 3 is 3×3×64, where the size of the layer 3 convolution kernel is 3×3 and the number of output channels is 64. The number of parameters of layer 4 is 3×3×128, where the size of the convolution kernel of layer 4 is 3×3 and the number of output channels is 128. The activation function of each of the 4 convolutional layers is a ReLU function.
2) The output of the convolution module is subjected to dimension transformation to obtain 512-dimensional vectors of each frame, and the 512-dimensional vectors are spliced with the 13-dimensional MFCC according to the corresponding frames to obtain 525-dimensional vectors.
2. Extraction of temporal information and prediction of frame-level scores
1. Extraction of time information
The 525-dimensional feature vector of each frame is input into an encoder, and time information is extracted to obtain 512-dimensional vectors of each frame. Wherein the number of heads m=8 of the multi-head attention mechanism in the encoder, and the number of encoders n=6.
2. Prediction of frame level scores
Output of encoder
Figure SMS_209
The feature vector of 512 dimensions per frame is mapped to a one-dimensional vector, i.e., a frame-level MOS score, via the fully connected layer. />
3. Speaker feature extraction
The method comprises the steps of inputting 13-dimensional MFCC parameters extracted from each frame into a five-layer TDNN network for feature extraction, inputting 13-dimensional vectors into a first-layer TDNN network, outputting 512-dimensional vectors, respectively obtaining the average value and the variance of outputs of the fifth-layer TDNN, splicing the two to form a 1024-dimensional Vector, and mapping the 1024-dimensional Vector into a 512-dimensional X-Vector through a full-connection layer.
4. Fusion of speech and speaker features and speech score prediction
1. Fusion of speech features and speaker features
Will be
Figure SMS_210
Sequentially performing dimension transformation, self-adaptive average pooling operation, and aggregating frame-level features as speech-level features to obtain 128-dimensional feature vectors of each pathological voice>
Figure SMS_211
Will be 128-dimensional
Figure SMS_212
And 512-dimensional X-vectors are spliced together to obtain a preliminarily fused 640-dimensional speech-level feature Vector. And then the deep-level characteristic extraction is carried out by a convolution module. The convolution module is divided into 4 layers, the parameter number of the 1 st layer is 3 multiplied by 8, wherein the size of the 1 st layer convolution kernel is 3 multiplied by 03, and the number of output channels is 8; the parameter amount of the layer 2 is 3×13×16, wherein the size of the convolution kernel of the layer 2 is 3×3, and the number of output channels is 16; the number of parameters of the 3 rd layer is 3×3×32, wherein the size of the 3 rd layer convolution kernel is 3×3, and the number of output channels is 32; the number of parameters of the 4 th layer is 3×3×64, wherein the size of the 4 th layer convolution kernel is 3×3, and the number of output channels is 64. Wherein the activation function of each of the 4 convolutional layers is a ReLU function. And then, carrying out dimension transformation processing to obtain 512-dimensional feature vectors of each pathological voice.
2. Prediction of speech-level scores
And mapping the obtained 512-dimensional feature vectors into one-dimensional MOS scores through self-adaptive average pooling operation to obtain the speech grade scores.
5. Loss function
1. Frame-level fractional network loss function:
to ensure accuracy of network output pathology voice frame fraction, network predicted frame fraction is used
Figure SMS_213
And true frame level score->
Figure SMS_214
The mean square loss between them optimizes the frame-level-difference prediction network.
Figure SMS_215
2. Utterance level fraction network loss function:
to ensure accuracy of network output pathology voice utterance fraction, network predicted utterance fraction is used
Figure SMS_216
And true speech level score->
Figure SMS_217
The mean square loss between them optimizes the speech level fraction prediction network.
Figure SMS_218
3. Total loss function
The total loss function for pathological voice objective quality assessment can be expressed as:
Figure SMS_219
figure 4 emphasizes the performance of the MOSNet model and the proposed method. It can be seen from fig. 4 that the prediction score and the true value score of the two are distributed around the vicinity of the image of the direct proportion function, that is, the prediction value and the true value have stronger correlation, but the data obtained by prediction of the invention are distributed around the image of the direct proportion function more intensively, and the fitting effect is better. This shows that the method based on the fusion of the voice characteristics and the speaker characteristics can respectively perform effective characteristic representation and characteristic extraction on pathological voice.
Fig. 5 shows the effect of the present invention on the fitting of MOS score truth labels. The invention predicts the pathological voice quality fraction, draws the predicted data as a curve, and draws a histogram by MOS fraction truth value data. As can be seen from fig. 5, the true value distribution interval of the voice data is more dispersed, but the invention can still make reliable quality score prediction for pathological voice, because the prediction score curve can be regarded as the fitting curve of the histogram made by the true value data, the invention proves that the invention is used as an objective and automatic evaluation method, can fit the subjective evaluation of people, and makes reliable score prediction for the intelligibility of pathological voice.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (7)

1. A pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics is characterized in that: the method comprises the following steps:
step 1: inputting pathological voice, extracting two voice features of a spectrogram and a Mel frequency cepstrum coefficient, and carrying out feature fusion;
step 2: taking the fused voice characteristics as input, extracting time information and predicting frame-level scores;
step 3: taking the voice characteristics of the mel frequency cepstrum coefficient as input to extract the characteristics of a speaker;
step 4: and (3) taking the voice characteristics obtained after the time information extraction in the step (2) and the speaker characteristics obtained in the step (3) as inputs, and carrying out characteristic fusion to obtain the prediction of the speech-level quality score.
2. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 1, wherein: in the step (1) of the above-mentioned process,
extracting spectrogram voice characteristics comprises framing pathological voice, adding a hamming window and performing short-time Fourier transform to obtain an amplitude spectrogram of voice signals;
extracting the voice characteristics of the mel-frequency cepstrum coefficient includes voice to pathology
Figure QLYQS_1
Using p mel filters, m mel frequency cepstral coefficients are obtained per frame.
3. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 2, wherein: in the step 1, feature fusion of two voices includes:
inputting the obtained spectrogram into a convolution module formed by stacking k convolution layers, and finally obtaining each frame
Figure QLYQS_2
Feature vector of dimension
Figure QLYQS_3
The method comprises the steps of carrying out a first treatment on the surface of the The process is expressed by the following formula:
Figure QLYQS_4
Figure QLYQS_9
wherein ,/>
Figure QLYQS_13
U is the input spectrogram of pathological voice, < ->
Figure QLYQS_6
Representing the ReLU activation function,)>
Figure QLYQS_10
Is a weight parameter of the first layer convolution layer, < ->
Figure QLYQS_12
Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>
Figure QLYQS_15
;/>
Figure QLYQS_5
and />
Figure QLYQS_11
Are respectively->
Figure QLYQS_14
Weight parameters and bias parameters of the layer convolution layer, < ->
Figure QLYQS_16
Is the output of the ith convolution layer, +.>
Figure QLYQS_7
Is->
Figure QLYQS_8
An output of the layer convolution layer;
will be
Figure QLYQS_17
M ∈ obtained with each frame>
Figure QLYQS_18
Corresponding addition of the cepstrum coefficients of the Weibull frequency to obtain +.>
Figure QLYQS_19
Frame-level features of the preliminary fusion of dimensions->
Figure QLYQS_20
4. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 3, wherein: the step 2 specifically includes:
step 201: will first
Figure QLYQS_21
As input, via the input embedding layer, get +.>
Figure QLYQS_22
It has three dimensions: [ n, u, v]The meaning of the expression is: the number n of one batch vector, the number u of each vector value and the embedding dimension v of each value;
step 202:
Figure QLYQS_23
then the position coding is carried out to obtain +.>
Figure QLYQS_24
The formula for position coding is: />
Figure QLYQS_25
wherein ,/>
Figure QLYQS_26
For inputting the position information of each value in the vector, 2t refers to the dimension of the word vector encoded by each value, and then +.>
Figure QLYQS_27
And (3) with
Figure QLYQS_28
Corresponding dimensions are added to obtain a new vector with position-coding information +.>
Figure QLYQS_29
Step 203: will be
Figure QLYQS_30
Sending the signals to an M-head self-attention mechanism module to obtain +.>
Figure QLYQS_31
Its dimension and->
Figure QLYQS_32
The same applies to the following formula:
Figure QLYQS_33
wherein ,/>
Figure QLYQS_37
,/>
Figure QLYQS_38
Is the weight matrix in the t-th head in the M-head attention mechanism; />
Figure QLYQS_35
Figure QLYQS_36
wherein ,/>
Figure QLYQS_39
For the dimension of the resulting K vector,
Figure QLYQS_40
is the output of the attention mechanism module of the t-th head,/-, a first switch is provided>
Figure QLYQS_34
Is the total output of the M-head attention mechanism;
step 204: will be
Figure QLYQS_41
And->
Figure QLYQS_42
Residual connection is carried out to obtain->
Figure QLYQS_43
And then will
Figure QLYQS_44
Performing batch normalization to obtain +.>
Figure QLYQS_45
The calculation formula of the process is as follows:
Figure QLYQS_46
Figure QLYQS_47
;
step 205: will be
Figure QLYQS_48
Obtaining ∈K through feedforward calculation>
Figure QLYQS_49
The calculation formula is as follows:
Figure QLYQS_50
wherein ,/>
Figure QLYQS_51
Is a weight matrix>
Figure QLYQS_52
Representing a ReLU activation function;
step 206: will be
Figure QLYQS_53
and />
Figure QLYQS_54
Residual connection is carried out to obtain->
Figure QLYQS_55
Then->
Figure QLYQS_56
Performing batch normalization to obtain
Figure QLYQS_57
The calculation formula of the process is as follows:
Figure QLYQS_58
Figure QLYQS_59
;
step 207: will be
Figure QLYQS_60
Performing N times of repetition coding to obtain +.>
Figure QLYQS_61
;/>
Step 208: will be
Figure QLYQS_62
Obtaining frame fraction +.>
Figure QLYQS_63
Figure QLYQS_64
wherein />
Figure QLYQS_65
Represented as a fully connected layer.
5. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 4, wherein: the step 3 includes:
input of m-dimensional mel frequency cepstral coefficient features per frame
Figure QLYQS_66
Layer delay neural network, delay neural network while considering the context information between adjacent frames, assemble the frame-level characteristic into a speech-level characteristic, calculate and splice +.>
Figure QLYQS_67
Mean and variance of the outputs of the layer delay neural network, then go through +.>
Figure QLYQS_68
A time delay neural network for obtaining the characteristic vector of k dimension
Figure QLYQS_69
6. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 5, wherein: in the step 4, feature fusion is performed on the voice feature obtained after the time information extraction in the step 2 and the speaker feature obtained in the step 3 as inputs, including:
will be
Figure QLYQS_70
Sequentially performing dimension conversion, self-adaptive average pooling operation and dimension conversion to obtain +.>
Figure QLYQS_71
The calculation formula is as follows:
Figure QLYQS_72
wherein ,/>
Figure QLYQS_73
Representation->
Figure QLYQS_74
Exchange of dimensions, add>
Figure QLYQS_75
Representing an adaptive averaging pooling operation;
feature vector
Figure QLYQS_76
And->
Figure QLYQS_77
Splicing to obtain preliminarily fused speaking-level characteristic vector
Figure QLYQS_78
,
Figure QLYQS_79
Will get->
Figure QLYQS_80
As input, a convolution module formed by overlapping k convolution layers is fed to obtain high-dimensional speaking-level characteristics +.>
Figure QLYQS_81
The process is expressed by the following formula:
Figure QLYQS_82
Figure QLYQS_83
Figure QLYQS_84
;
wherein ,
Figure QLYQS_85
,/>
Figure QLYQS_88
representing the ReLU activation function,)>
Figure QLYQS_91
Is a weight parameter of the first layer convolution layer, < ->
Figure QLYQS_87
Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>
Figure QLYQS_89
;/>
Figure QLYQS_92
and />
Figure QLYQS_94
Are respectively->
Figure QLYQS_86
Weight parameters and bias parameters of the layer convolution layer, < ->
Figure QLYQS_90
Is the output of the ith convolution layer, +.>
Figure QLYQS_93
Is->
Figure QLYQS_95
And (3) outputting a layer convolution layer.
7. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 6, wherein: in the step 4, the predicting the speech-level quality score includes:
will be
Figure QLYQS_96
Sequentially performing dimension conversion and self-adaptive average pooling operation to obtain final fraction +.>
Figure QLYQS_97
The calculation formula is as follows:
Figure QLYQS_98
wherein ,/>
Figure QLYQS_99
Representing the exchange of dimensions->
Figure QLYQS_100
Representing an adaptive averaging pooling operation. />
CN202310395720.1A 2023-04-14 2023-04-14 Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics Active CN116110437B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310395720.1A CN116110437B (en) 2023-04-14 2023-04-14 Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310395720.1A CN116110437B (en) 2023-04-14 2023-04-14 Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics

Publications (2)

Publication Number Publication Date
CN116110437A true CN116110437A (en) 2023-05-12
CN116110437B CN116110437B (en) 2023-06-13

Family

ID=86260107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310395720.1A Active CN116110437B (en) 2023-04-14 2023-04-14 Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics

Country Status (1)

Country Link
CN (1) CN116110437B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093759A (en) * 2013-01-16 2013-05-08 东北大学 Device and method of voice detection and evaluation based on mobile terminal
CN103730130A (en) * 2013-12-20 2014-04-16 中国科学院深圳先进技术研究院 Detection method and system for pathological voice
CN107068167A (en) * 2017-03-13 2017-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures
CN109727608A (en) * 2017-10-25 2019-05-07 香港中文大学深圳研究院 A kind of ill voice appraisal procedure based on Chinese speech
AU2020102516A4 (en) * 2020-09-30 2020-11-19 Du, Jiahui Mr Health status monitoring system based on speech analysis
CN112820279A (en) * 2021-03-12 2021-05-18 深圳市臻络科技有限公司 Parkinson disease detection method based on voice context dynamic characteristics
US20210319804A1 (en) * 2020-04-01 2021-10-14 University Of Washington Systems and methods using neural networks to identify producers of health sounds
CN114724589A (en) * 2022-04-14 2022-07-08 标贝(北京)科技有限公司 Voice quality inspection method and device, electronic equipment and storage medium
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093759A (en) * 2013-01-16 2013-05-08 东北大学 Device and method of voice detection and evaluation based on mobile terminal
CN103730130A (en) * 2013-12-20 2014-04-16 中国科学院深圳先进技术研究院 Detection method and system for pathological voice
CN107068167A (en) * 2017-03-13 2017-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures
CN109727608A (en) * 2017-10-25 2019-05-07 香港中文大学深圳研究院 A kind of ill voice appraisal procedure based on Chinese speech
US20210319804A1 (en) * 2020-04-01 2021-10-14 University Of Washington Systems and methods using neural networks to identify producers of health sounds
AU2020102516A4 (en) * 2020-09-30 2020-11-19 Du, Jiahui Mr Health status monitoring system based on speech analysis
CN112820279A (en) * 2021-03-12 2021-05-18 深圳市臻络科技有限公司 Parkinson disease detection method based on voice context dynamic characteristics
CN114724589A (en) * 2022-04-14 2022-07-08 标贝(北京)科技有限公司 Voice quality inspection method and device, electronic equipment and storage medium
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A. G´OMEZA ETC: "Acoustic to kinematic projection in Parkinson’s disease dysarthria", 《BIOMEDICAL SIGNAL PROCESSING AND CONTROL》, pages 1 - 13 *
邹佳成: "基于深度学习的呼吸应肺病听诊研究与应用", 《中国优秀硕士学位论文全文数据库 医药卫生辑》, no. 01, pages 1 - 65 *

Also Published As

Publication number Publication date
CN116110437B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
EP4002362A1 (en) Method and apparatus for training speech separation model, storage medium, and computer device
Moran et al. Telephony-based voice pathology assessment using automated speech analysis
Jahangir et al. Deep learning approaches for speech emotion recognition: state of the art and research challenges
Muhammad et al. Convergence of artificial intelligence and internet of things in smart healthcare: a case study of voice pathology detection
CN112818892A (en) Multi-modal depression detection method and system based on time convolution neural network
US20040002853A1 (en) Method and device for speech analysis
Seneviratne et al. Multi-Corpus Acoustic-to-Articulatory Speech Inversion.
WO2023139559A1 (en) Multi-modal systems and methods for voice-based mental health assessment with emotion stimulation
CN110946554A (en) Cough type identification method, device and system
CN112397074A (en) Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning
Tripathi et al. A novel approach for intelligibility assessment in dysarthric subjects
CN111489763A (en) Adaptive method for speaker recognition in complex environment based on GMM model
US20240057936A1 (en) Speech-analysis based automated physiological and pathological assessment
Avila et al. Automatic speaker verification from affective speech using Gaussian mixture model based estimation of neutral speech characteristics
Dibazar et al. A system for automatic detection of pathological speech
CN115910097A (en) Audible signal identification method and system for latent fault of high-voltage circuit breaker
CN116110437B (en) Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics
Lee et al. Assessment of dysarthria using one-word speech recognition with hidden markov models
Ribeiro et al. Towards the prediction of the vocal tract shape from the sequence of phonemes to be articulated
Debnath et al. Study of speech enabled healthcare technology
CN116013371A (en) Neurodegenerative disease monitoring method, system, device and storage medium
Xu et al. Voiceprint recognition of Parkinson patients based on deep learning
Amami et al. A robust voice pathology detection system based on the combined bilstm–cnn architecture
Suwannakhun et al. Characterizing Depressive Related Speech with MFCC
Naikare et al. Classification of voice disorders using i-vector analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant