CN114141366A - Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning - Google Patents
Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning Download PDFInfo
- Publication number
- CN114141366A CN114141366A CN202111665085.1A CN202111665085A CN114141366A CN 114141366 A CN114141366 A CN 114141366A CN 202111665085 A CN202111665085 A CN 202111665085A CN 114141366 A CN114141366 A CN 114141366A
- Authority
- CN
- China
- Prior art keywords
- task
- mel
- voice
- network
- loss function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 15
- 208000006011 Stroke Diseases 0.000 title description 36
- 230000002490 cerebral effect Effects 0.000 title description 6
- 206010008190 Cerebrovascular accident Diseases 0.000 title description 5
- 230000006870 function Effects 0.000 claims abstract description 72
- 230000006378 damage Effects 0.000 claims abstract description 34
- 238000013528 artificial neural network Methods 0.000 claims abstract description 6
- 238000001228 spectrum Methods 0.000 claims description 43
- 238000012549 training Methods 0.000 claims description 31
- 210000002569 neuron Anatomy 0.000 claims description 30
- 230000003068 static effect Effects 0.000 claims description 24
- 239000012634 fragment Substances 0.000 claims description 23
- 238000011156 evaluation Methods 0.000 claims description 22
- 230000037433 frameshift Effects 0.000 claims description 10
- 238000009432 framing Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000013526 transfer learning Methods 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 2
- 238000000605 extraction Methods 0.000 abstract description 3
- 230000006403 short-term memory Effects 0.000 abstract description 2
- 230000000694 effects Effects 0.000 description 9
- 238000000034 method Methods 0.000 description 8
- 230000034994 death Effects 0.000 description 3
- 231100000517 death Toxicity 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 201000007201 aphasia Diseases 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 208000019743 Cranial nerve injury Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 208000026106 cerebrovascular disease Diseases 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Public Health (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Epidemiology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Pathology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Primary Health Care (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning. And providing a multi-task learning model with a main task of evaluating the stroke voice function damage to predict a fractional regression task and an auxiliary task of classifying the stroke voice function damage severity. The bottom layer model is a feature extraction model of a deep residual error network (Resnet50) based on a Mel spectrograms (Mel spectrograms) and a time sequence prediction model of a long short term memory network (LSTM), and the top layer model is a fully connected neural network corresponding to the main task and the auxiliary task respectively. The loss function adopted is the weighted superposition of the mean square error loss function and the cross entropy loss function. The multi-task learning mechanism adopted by the invention can reduce the overfitting probability of the model, effectively reduce the prediction error and clearly know the current rehabilitation condition of the patient through the prediction score.
Description
Technical Field
The invention belongs to the field of voice signal processing and intelligent medical auxiliary analysis, and relates to a stroke rehabilitation assessment auxiliary analysis method based on a voice multitask learning deep residual error network (Resnet50) and a long-short term memory network (LSTM).
Background
Stroke is an acute cerebrovascular disease, is a cranial nerve injury caused by rupture of cerebral vessels or failure of blood to flow into the brain due to vessel occlusion, and has high morbidity and disability rate. The investigation shows that the cerebral apoplexy is the first leading cause of death of residents in China, and the number of cerebral apoplexy deaths in China is about one third of the number of cerebral apoplexy deaths in the world. Usually, stroke patients have symptoms of unclear expression, incoherent speaking and the like in speech expression, and the normal life of the patients is seriously influenced.
The existing stroke detection method based on voice is mainly divided into two methods. The traditional method is mainly based on feature engineering, and generates time alignment information of audio and text through a pre-trained voice recognition model, so as to calculate the characteristics of pronunciation accuracy and fluency related to pronunciation, such as pronunciation quality, the number of syllables in unit time and the like. On the basis, pronunciation difficulty characteristics such as jitter, glimmer, pitch period entropy, glottal entropy, signal-to-noise ratio and the like extracted from an original voice signal are added, and a machine learning classifier is adopted for classification. The deep learning method is more prone to taking an original voice signal or a voice time-frequency graph as the input of the network by designing a neural network, so that the network automatically learns the characteristics related to the difficulty in voice expression without complex characteristic calculation. However, these methods tend to have the following disadvantages:
1. in the traditional method, the speech recognition model has a large error in recognizing the speech of the patient due to unclear expression of the patient, and the extracted features lack robustness. The performance of the classifier is weak, and the requirement on the characterization capability of extracting features is high, so that the algorithm based on feature engineering cannot meet the engineering requirement;
2. the existing deep learning method mainly carries out secondary classification on the existence of stroke, and data with different severity levels cannot be quantized, so that a reasonable evaluation result cannot be given to the current rehabilitation condition of a patient.
Aiming at the problems, the invention provides a multi-task learning model, wherein the main task is a regression task for evaluating the stroke voice function damage so as to predict scores, and the auxiliary task is a classification task for classifying the stroke voice function damage severity. The bottom layer model is a feature extraction model of a deep residual error network (Resnet50) based on a Mel spectrograms (Mel spectrograms) and a time sequence prediction model of a long short term memory network (LSTM), and the top layer model is a fully connected neural network corresponding to the main task and the auxiliary task respectively. The parameters of the bottom layer model are uniformly shared, the tasks of the parameters of the top layer model are independent, and a single mean square error loss function is modified into a weighted superposition of the mean square error loss function and the cross entropy loss function to train the network. The method can reduce the probability of model overfitting, can effectively inhibit the prediction abnormal value after the auxiliary classification task is added, and improves the prediction precision.
Disclosure of Invention
The invention provides a multitask learning stroke rehabilitation assessment auxiliary analysis method aiming at the defects of the existing stroke rehabilitation assessment algorithm based on voice. The invention adopts a multi-task learning mechanism which takes a main task of evaluating the stroke voice function damage, predicts the score regression task and an auxiliary task of classifying the stroke voice function damage severity classification, can realize the automatic learning of the voice Mel frequency spectrum map depth characteristics and the score prediction of the time sequence information, and can effectively inhibit the prediction error, thereby realizing the automatic evaluation of the stroke rehabilitation based on voice.
The technical scheme of the invention mainly comprises the following steps:
step 2, the labels of the existing data sets are evaluation scores of the voice function damage by doctors, and the existing data are divided into four severity grades according to the evaluation score intervals to be used as labels of auxiliary classification tasks;
step 3, for the segment level Mel frequency spectrogram extracted in the step 1, an improved Resnet50 deep convolution neural network is used, and an auxiliary classification task for classifying the severity of stroke voice function damage is added on the basis that a main task is a regression task for predicting the stroke voice function damage score by using a hard parameter sharing mechanism; adding the label in the step 2 by using the pre-training network weight, modifying the loss function, training the model, and extracting 100-dimensional depth features;
and 4, forming the 100-dimensional depth features of the segment-level Mel frequency spectrogram obtained in the step 3 into speech-level features according to a time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke speech function damage on the basis that a main task is a regression task for predicting the stroke speech function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the speech function damage.
Further, the step 1 is specifically realized as follows:
1-1, intercepting an original voice signal into a fixed length of four seconds, discarding fragments exceeding four seconds, and copying and supplementing existing fragments to a length of four seconds by fragments not enough for four seconds;
1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z-1Enhancing the high frequency part in the signal; then using a frame length of 25 ms and a frame shift of 10 msPerforming framing operation on the signals; then multiplying each frame by a hamming window;
1-3, performing fast Fourier transform on each frame signal after framing and windowing to obtain a short-time amplitude spectrum of each frame, performing modular squaring on the short-time amplitude spectrum, and obtaining a Mel frequency spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
the final 4 seconds of audio is processed to obtain a mel frequency spectrum of 400 x 64 pixels;
1-4, intercepting the Mel frequency spectrum of 400 x 64 pixels according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel frequency spectrum graph, then calculating the first order difference and the second order difference, and superposing the static image and the first order difference and the second order difference to form a picture similar to three channels of RGB; the final 4s segment of audio is truncated to give a total of 13 mel-frequency spectrograms at 64 pixel-by-64 pixel fraction level.
Further, the step 2 is specifically realized as follows:
2-1, setting samples with an evaluation score interval of 85-100 as a slight type, setting samples with an interval of 75-84 as a medium type, setting samples with an interval of 65-74 as a serious type, and setting samples with an interval of 60-64 as a very serious type.
Further, the step 3 is specifically realized as follows:
the 3-1 improved Resnet50 network structure is as follows: the Resnet50 network output layer originally used for 1000 classes of ImageNet has 1000 neurons, and the neurons are modified into 100 neurons; then adding respective network output layers for the two tasks respectively, wherein the output layer of the regression task is 1 neuron, and the output layer of the classification task is 4 neurons;
3-2, training by adopting a multi-task learning mechanism, wherein the model applies a hard parameter sharing mechanism, namely a network layer before two task output layers shares parameters, and only the output layers correspond to respective network parameters; the regression task corresponds to a mean-square loss function MSELoss, and the classification task corresponds to a cross entropy loss function CrossEntropyLoss, so that the used loss function TotalLoss is the weighted superposition of the mean-square loss function and the cross entropy loss function; by loading the weight parameters of the pre-trained Resnet50 network in a transfer learning mode, the training speed of the network can be effectively accelerated;
xiandrespectively representing the predicted value and the label value corresponding to the regression task,and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α ═ 1 and β ═ 0.5 are used in the present invention;
3-3, inputting a segment level Mel frequency spectrogram after the model is trained, and taking the output of the penultimate layer of the modified Resnet50 network as a characteristic; since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
Further, the step 4 is specifically realized as follows:
4-1, combining the obtained 100-dimensional segment level features into utterance level features according to a time sequence intercepted by a Mel frequency spectrogram, and processing each 4s voice segment to obtain 13 x 100-dimensional utterance level features;
4-2, predicting input speech level features by adopting a three-layer LSTM network, wherein 64 neurons in each layer reduce network overfitting by using dropout equal to 0.5;
4-3, training by adopting a multi-task learning mechanism, wherein one neuron is arranged at an output layer of an LSTM regression task, four neurons are arranged at an output layer of a classification task, a hard parameter sharing mechanism is applied to a model, and network layers in front of the output layers of the two tasks share parameters, and only the output layers correspond to respective network parameters; the used loss function TotalLoss is the weighted superposition of a mean square loss function and a cross entropy loss function;
xiandrespectively representing the predicted value and the label value corresponding to the regression task,and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α is 1 and β is 0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation indexes of the 4-5 model adopt root mean square error RMSE, a decision coefficient R-square and average absolute error MAE, and the calculation formulas of all parameters are as follows:
wherein, yiThe true value of the sample is represented by,denotes the predicted value of the sample, ymeanRepresents the mean of the true values of all samples. The RMSE calculates the mean value re-evolution of the square sum of the error of the corresponding sample points of the fitting data and the original data, and the smaller the value, the better the fitting effect. And calculating the absolute value of the difference between the predicted value and the real value of each sample by the MAE, and then summing and averaging the absolute values to evaluate the closeness degree of the prediction result and the real data set, wherein the smaller the value is, the better the fitting effect is. The R-square is between 0 and 1, the closer to 1, the better the prediction effect of the model is, and the closer to 0, the worse the prediction effect of the model is.
The invention has the following beneficial effects:
the stroke rehabilitation assessment auxiliary analysis method based on the voice multitask learning has the advantages that: 1) the standard Mel cepstral coefficients (MFCCs) reflect only the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. The adopted Mel frequency spectrogram can combine dynamic and static characteristics to effectively improve the identification performance of the system. 2) Compared with the manually designed features related to the voice function damage, the depth features of the Mel frequency spectrum map extracted by using Resnet have better generalization performance and certain robustness to noise. 3) For single-task learning, the error of partial samples in the prediction result of a single regression network is large, and the samples with large prediction errors can be punished after the auxiliary classification task is added, so that the precision is effectively improved.
The invention can establish a voice-based accurate and efficient rehabilitation assessment diagnosis framework for stroke patients, and overcomes the defects that the prior rehabilitation assessment work only depends on manual work and lacks scientificity and objectivity. Can provide good help for the diagnosis of doctors, and is expected to reduce the medical pressure and improve the medical efficiency.
Drawings
FIG. 1: flow chart of the invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, the implementation steps of the stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning are introduced in detail in the invention content, and the main innovations of the invention are as follows: (1) the standard Mel cepstral coefficients (MFCCs) reflect only the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. The Mel frequency spectrum graph adopted by the invention combines dynamic and static characteristics, and can effectively improve the identification performance of the system; (2) the Resnet50 network can solve the problem of performance degradation along with the increase of network depth, can excavate the depth characteristics of voice signals, and the LSTM network can learn the relation between the characteristic sequence contexts, and can better model the symptoms of unclear pronunciations, incoherence and the like of the voices of stroke patients; (3) for single-task learning, the error of partial samples in the prediction result of a single regression network is large, and the samples with large prediction errors can be punished after the auxiliary classification task is added, so that the prediction errors are effectively reduced, and the model precision is improved.
The technical scheme of the invention mainly comprises the following steps:
And 2, the labels of the existing data sets are the evaluation scores of the doctors on the voice function damage, and the existing data are divided into four severity grades according to the evaluation score intervals to be used as labels of auxiliary classification tasks.
And 3, for the fragment level Mel frequency spectrogram extracted in the step 1, using a Resnet50 deep convolutional neural network, using a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke voice function damage on the basis that a main task is a regression task for predicting the stroke voice function damage score, using the pre-training network weight, adding the label in the step 2, modifying a loss function, training a model, and extracting 100-dimensional deep features.
And 4, forming the 100-dimensional depth features of the segment-level Mel frequency spectrogram obtained in the step 3 into speech-level features according to a time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke speech function damage on the basis that a main task is a regression task for predicting the stroke speech function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the speech function damage.
The specific implementation of the step 1 is as follows:
1-1, the original voice signal is cut into fixed length of four seconds, the segment exceeding four seconds is discarded, and the segment not enough for four seconds is copied to supplement the existing segment to the length of four seconds.
1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z-1The high frequency part of the signal is enhanced. Then, the frame division operation is carried out on the signal in a mode that the frame length is 25 milliseconds and the frame shift is 10 milliseconds. Each frame is then multiplied by a hamming window.
1-3, performing fast Fourier transform on each frame signal after framing and windowing to obtain a short-time amplitude spectrum of each frame, performing modular squaring on the short-time amplitude spectrum, and obtaining a Mel frequency spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
the final 4 seconds of speech signal was processed to obtain a mel spectrum of 400 x 64 pixels.
1-4, the obtained Mel frequency spectrum of 400 x 64 pixels is intercepted according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel frequency spectrum image, then a first order difference and a second order difference are calculated, and the static image and the first order difference and the second order difference are superposed to form an image similar to three channels of RGB. The final 4s segment of audio is truncated to give a total of 13 mel-frequency spectrograms at 64 pixel-by-64 pixel fraction level.
The step 2 is specifically realized as follows:
2-1 there are eleven sets of data with evaluation scores, namely Aphasia entropy (Aphasia elementary) of 78, 91, 87, 68, 91, 80, 92, 74, 71, 81, 61, the score ranges from 0 to 100, and for the convenience of model training, the score is mapped to a decimal number of 0 to 1: 0.78, 0.91, 0.87, 0.68, 0.91, 0.80, 0.92, 0.74, 0.71, 0.81, 0.61, and sets the severity level categories according to the corresponding intervals: 0.91, 0.87, 0.91, 0.92 are mild types, 0.78, 0.80, 0.81 are medium types, 0.71, 0.74, 0.68 are severe types, 0.61 is very severe type, and four types in total are used as labels for classification tasks.
The specific implementation of step 3 is as follows:
3-1, modifying a network structure of Resnet50, wherein the Resnet50 network output layers originally used for 1000 class classifications of ImageNet have 1000 neurons in total, modifying the neurons into 100 neurons, and then adding respective network output layers for two tasks respectively, wherein the output layer of a regression task is 1 neuron, and the output layer of a classification task is 4 neurons.
3-2, training by adopting a multi-task learning mechanism, wherein the model applies a hard parameter sharing mechanism, namely, the network layer before two task output layers shares parameters, and only the output layers correspond to respective network parameters. Since the regression task corresponds to mean square loss functions (mselos) and the classification task corresponds to cross entropy loss functions (crossentropy loss), the loss function TotalLoss used is a weighted superposition of the mean square loss function and the cross entropy loss function. And by loading the weight parameters of the pre-training model in a transfer learning mode, the training speed of the network can be effectively accelerated.
xiAndrespectively representing the predicted value and the label value corresponding to the regression task,and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α ═ 1 and β ═ 0.5 are used in the present invention;
and after the 3-3 model is trained, inputting the segment level Mel frequency spectrogram, and extracting the output of the penultimate layer of the modified Resnet50 network as the characteristic. Since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
The specific implementation of the step 4 is as follows:
4-1, the obtained 100-dimensional segment features are combined into speech level features according to the time sequence intercepted by the Mel frequency spectrogram, so that 13 x 100-dimensional speech level features are obtained after each 4s speech segment is processed.
4-2 predicts the input speech-level features using a three-layer LSTM network, 64 neurons per layer, using dropout of 0.5 to reduce the network overfitting.
4-3, training is carried out by adopting a multi-task learning mechanism, one neuron is arranged at an output layer of an LSTM regression task, four neurons are arranged at an output layer of a classification task, a hard parameter sharing mechanism is applied to a model, parameters are shared by network layers before the output layers of the two tasks, and only the output layers correspond to respective network parameters. The loss function totallloss used is a weighted superposition of the mean square loss function and the cross entropy loss function.
xiAndrespectively representing the predicted value and the label value corresponding to the regression task,and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α is 1 and β is 0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation indexes of the 4-5 model adopt root mean square error RMSE, a decision coefficient R-square and average absolute error MAE, and the calculation formulas of all parameters are as follows:
the RMSE calculates the mean value re-evolution of the square sum of the error of the corresponding sample points of the fitting data and the original data, and the smaller the value, the better the fitting effect. And calculating the absolute value of the difference between the predicted value and the real value of each sample by the MAE, and then summing and averaging the absolute values to evaluate the closeness degree of the prediction result and the real data set, wherein the smaller the value is, the better the fitting effect is. The R-square is between 0 and 1, the closer to 1, the better the prediction effect of the model is, and the closer to 0, the worse the prediction effect of the model is.
The invention also provides a stroke rehabilitation assessment auxiliary analysis system based on voice multitask learning, which comprises a data preprocessing module, a voice function damage level module, an improved Resnet50 network model and an improved three-layer LSTM network model.
The data preprocessing module is specifically realized as follows: intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signals, performing short-time Fourier transform on each frame of signal, and obtaining a Mel frequency spectrogram through a Mel filter bank; then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first order difference and a second order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first order difference and the second order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum;
the voice function damage level module is specifically realized as follows: the label of the existing data set is an evaluation score of a doctor on the voice function damage, and the existing data is divided into four severity grades according to the interval of the evaluation score to be used as a label of an auxiliary classification task;
the improved Resnet50 network model is implemented as follows: for the segment level Mel frequency spectrogram extracted by the data preprocessing module, an improved Resnet50 deep convolution neural network is used, and a hard parameter sharing mechanism is utilized, and an auxiliary classification task for classifying the severity of stroke voice function damage is added on the basis that a main task is a regression task for predicting the stroke voice function damage score; adding a label of a voice function damage level module by using a pre-training network weight, modifying a loss function, training a model, and extracting a 100-dimensional depth feature;
the improved three-layer LSTM network model is specifically realized as follows: the method comprises the steps of forming speech level features by using 100-dimensional depth features of a segment level Mel frequency spectrum diagram obtained by an improved Resnet50 network model according to a time sequence, adopting a three-layer LSTM network, adding an auxiliary classification task for classifying the severity of stroke voice function damage on the basis of a regression task with a main task of stroke voice function damage score prediction by using a hard parameter sharing mechanism, modifying a loss function, training the model, and finally obtaining an evaluation score of the voice function damage.
In order to achieve better stroke speech rehabilitation assessment and prediction effect, the following aspects of parameter selection and design in practical application are introduced as references for other applications of the invention:
the invention adopts the fixed 4s voice data only for facilitating model training, and the voice data with any length can be processed in practical application after the model training.
Obtaining voice data in practical application, processing the voice data through the step 1, extracting a Mel frequency spectrum graph, and finally obtaining N segments of 64 × 64 pixel segment-level Mel frequency spectrum graphs after segmentation. And then, segment level features are obtained through a Resnet feature extraction module in the step 3, and the segment level features are stacked in time sequence to form the N x 100 dimensional speech level features. And finally, processing through N time steps by three layers of LSTMs to obtain a final score.
In the invention, when the evaluation indexes of only one regression task model are RMSE (RMSE) 0.036, MAE (MAE) 0.027 and R-square 0.778, the predicted value and the actual true value of the model are compared, and the prediction error of partial samples is found to be larger. The evaluation indexes of the multitask learning are RMSE (RMSE) 0.029, MAE (MAE) 0.022 and R-square (R-square) 0.837, and the number of samples with larger prediction errors is obviously reduced. In conclusion, the voice-based rehabilitation assessment auxiliary analysis method for the stroke patient through multitask learning can provide scientific and objective assessment results for voice rehabilitation assessment work, and fills the gap that rehabilitation therapy only depends on manual work.
Claims (5)
1. The stroke rehabilitation assessment auxiliary analysis method based on the voice multitask learning is characterized by comprising the following steps of:
step 1, intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signal, performing short-time Fourier transform on each frame of signal, and obtaining a Mel spectrogram through a Mel filter bank; then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first order difference and a second order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first order difference and the second order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum;
step 2, the labels of the existing data sets are evaluation scores of the voice function damage by doctors, and the existing data are divided into four severity grades according to the evaluation score intervals to be used as labels of auxiliary classification tasks;
step 3, for the segment level Mel frequency spectrogram extracted in the step 1, an improved Resnet50 deep convolution neural network is used, and an auxiliary classification task for classifying the severity of stroke voice function damage is added on the basis that a main task is a regression task for predicting the stroke voice function damage score by using a hard parameter sharing mechanism; adding the label in the step 2 by using the pre-training network weight, modifying the loss function, training the model, and extracting 100-dimensional depth features;
and 4, forming the 100-dimensional depth features of the segment-level Mel frequency spectrogram obtained in the step 3 into speech-level features according to a time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke speech function damage on the basis that a main task is a regression task for predicting the stroke speech function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the speech function damage.
2. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 1 is realized as follows:
1-1, intercepting an original voice signal into a fixed length of four seconds, discarding fragments exceeding four seconds, and copying and supplementing existing fragments to a length of four seconds by fragments not enough for four seconds;
1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z-1Enhancing the high frequency part in the signal; then, framing operation is carried out on the signals in a mode that the frame length is 25 milliseconds and the frame shift is 10 milliseconds; then multiplying each frame by a hamming window;
1-3, performing fast Fourier transform on each frame signal after framing and windowing to obtain a short-time amplitude spectrum of each frame, performing modular squaring on the short-time amplitude spectrum, and obtaining a Mel frequency spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
the final 4 seconds of audio is processed to obtain a mel frequency spectrum of 400 x 64 pixels;
1-4, intercepting the Mel frequency spectrum of 400 x 64 pixels according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel frequency spectrum graph, then calculating the first order difference and the second order difference, and superposing the static image and the first order difference and the second order difference to form a picture similar to three channels of RGB; the final 4s segment of audio is truncated to give a total of 13 mel-frequency spectrograms at 64 pixel-by-64 pixel fraction level.
3. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1 or 2, characterized in that the step 2 is realized as follows:
2-1, setting samples with an evaluation score interval of 85-100 as a slight type, setting samples with an interval of 75-84 as a medium type, setting samples with an interval of 65-74 as a serious type, and setting samples with an interval of 60-64 as a very serious type.
4. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 3 is realized as follows:
the 3-1 improved Resnet50 network structure is as follows: the Resnet50 network output layer originally used for 1000 classes of ImageNet has 1000 neurons, and the neurons are modified into 100 neurons; then adding respective network output layers for the two tasks respectively, wherein the output layer of the regression task is 1 neuron, and the output layer of the classification task is 4 neurons;
3-2, training by adopting a multi-task learning mechanism, wherein the model applies a hard parameter sharing mechanism, namely a network layer before two task output layers shares parameters, and only the output layers correspond to respective network parameters; the regression task corresponds to a mean-square loss function MSELoss, and the classification task corresponds to a cross entropy loss function CrossEntropyLoss, so that the used loss function TotalLoss is the weighted superposition of the mean-square loss function and the cross entropy loss function; by loading the weight parameters of the pre-trained Resnet50 network in a transfer learning mode, the training speed of the network can be effectively accelerated;
xiandrespectively representing the predicted value and the label value corresponding to the regression task,and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α ═ 1 and β ═ 0.5 are used in the present invention;
3-3, inputting a segment level Mel frequency spectrogram after the model is trained, and taking the output of the penultimate layer of the modified Resnet50 network as a characteristic; since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
5. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 4 is realized as follows:
4-1, combining the obtained 100-dimensional segment level features into utterance level features according to a time sequence intercepted by a Mel frequency spectrogram, and processing each 4s voice segment to obtain 13 x 100-dimensional utterance level features;
4-2, predicting input speech level features by adopting a three-layer LSTM network, wherein 64 neurons in each layer reduce network overfitting by using dropout equal to 0.5;
4-3, training by adopting a multi-task learning mechanism, wherein one neuron is arranged at an output layer of an LSTM regression task, four neurons are arranged at an output layer of a classification task, a hard parameter sharing mechanism is applied to a model, and network layers in front of the output layers of the two tasks share parameters, and only the output layers correspond to respective network parameters; the used loss function TotalLoss is the weighted superposition of a mean square loss function and a cross entropy loss function;
xiandrespectively representing the predicted value and the label value corresponding to the regression task,and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α is 1 and β is 0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation indexes of the 4-5 model adopt root mean square error RMSE, a decision coefficient R-square and average absolute error MAE, and the calculation formulas of all parameters are as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111665085.1A CN114141366B (en) | 2021-12-31 | 2021-12-31 | Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111665085.1A CN114141366B (en) | 2021-12-31 | 2021-12-31 | Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114141366A true CN114141366A (en) | 2022-03-04 |
CN114141366B CN114141366B (en) | 2024-03-26 |
Family
ID=80384123
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111665085.1A Active CN114141366B (en) | 2021-12-31 | 2021-12-31 | Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114141366B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114882996A (en) * | 2022-03-17 | 2022-08-09 | 深圳大学 | Hepatocellular carcinoma CK19 and MVI prediction method based on multitask learning |
CN117219265A (en) * | 2023-10-07 | 2023-12-12 | 东北大学秦皇岛分校 | Multi-mode data analysis method, device, storage medium and equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3273387A1 (en) * | 2016-07-19 | 2018-01-24 | Siemens Healthcare GmbH | Medical image segmentation with a multi-task neural network system |
CN113436726A (en) * | 2021-06-29 | 2021-09-24 | 南开大学 | Automatic lung pathological sound analysis method based on multi-task classification |
WO2021203796A1 (en) * | 2020-04-09 | 2021-10-14 | 之江实验室 | Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis |
CN113782184A (en) * | 2021-08-11 | 2021-12-10 | 杭州电子科技大学 | Cerebral apoplexy auxiliary evaluation system based on facial key point and feature pre-learning |
-
2021
- 2021-12-31 CN CN202111665085.1A patent/CN114141366B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3273387A1 (en) * | 2016-07-19 | 2018-01-24 | Siemens Healthcare GmbH | Medical image segmentation with a multi-task neural network system |
WO2021203796A1 (en) * | 2020-04-09 | 2021-10-14 | 之江实验室 | Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis |
CN113436726A (en) * | 2021-06-29 | 2021-09-24 | 南开大学 | Automatic lung pathological sound analysis method based on multi-task classification |
CN113782184A (en) * | 2021-08-11 | 2021-12-10 | 杭州电子科技大学 | Cerebral apoplexy auxiliary evaluation system based on facial key point and feature pre-learning |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114882996A (en) * | 2022-03-17 | 2022-08-09 | 深圳大学 | Hepatocellular carcinoma CK19 and MVI prediction method based on multitask learning |
CN114882996B (en) * | 2022-03-17 | 2023-04-07 | 深圳大学 | Hepatocellular carcinoma CK19 and MVI prediction method based on multitask learning |
CN117219265A (en) * | 2023-10-07 | 2023-12-12 | 东北大学秦皇岛分校 | Multi-mode data analysis method, device, storage medium and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114141366B (en) | 2024-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109599129B (en) | Voice depression recognition system based on attention mechanism and convolutional neural network | |
CN112818892B (en) | Multi-modal depression detection method and system based on time convolution neural network | |
CN108564942A (en) | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system | |
Wang et al. | LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement | |
CN114141366A (en) | Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning | |
CN112331216A (en) | Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN | |
Juvela et al. | Speaker-independent raw waveform model for glottal excitation | |
Vásquez-Correa et al. | A Multitask Learning Approach to Assess the Dysarthria Severity in Patients with Parkinson's Disease. | |
US7617101B2 (en) | Method and system for utterance verification | |
CN113012720A (en) | Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction | |
CN113111151A (en) | Cross-modal depression detection method based on intelligent voice question answering | |
Xian et al. | A multi-scale feature recalibration network for end-to-end single channel speech enhancement | |
CN113488063A (en) | Audio separation method based on mixed features and coding and decoding | |
Joshy et al. | Dysarthria severity classification using multi-head attention and multi-task learning | |
CN114898779A (en) | Multi-mode fused speech emotion recognition method and system | |
Fan et al. | CSENET: Complex squeeze-and-excitation network for speech depression level prediction | |
Chauhan et al. | Emotion recognition using LP residual | |
CN116570284A (en) | Depression recognition method and system based on voice characterization | |
Matoušek et al. | A comparison of convolutional neural networks for glottal closure instant detection from raw speech | |
CN116230018A (en) | Synthetic voice quality evaluation method for voice synthesis system | |
Wang et al. | Unsupervised domain adaptation for dysarthric speech detection via domain adversarial training and mutual information minimization | |
CN113963718A (en) | Voice session segmentation method based on deep learning | |
CN113450830A (en) | Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms | |
Alimuradov et al. | A method for automated segmentation of speech signals to determine temporal patterns of naturally expressed psycho-emotional states | |
Abderrazek et al. | Interpretable Assessment of Speech Intelligibility Using Deep Learning: A Case Study on Speech Disorders Due to Head and Neck Cancers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |