CN114141366A

CN114141366A - Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning

Info

Publication number: CN114141366A
Application number: CN202111665085.1A
Authority: CN
Inventors: 曹九稳; 葛宇; 王天磊; 赖晓平
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-03-04
Anticipated expiration: 2041-12-31
Also published as: CN114141366B

Abstract

The invention discloses a stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning. And providing a multi-task learning model with a main task of evaluating the stroke voice function damage to predict a fractional regression task and an auxiliary task of classifying the stroke voice function damage severity. The bottom layer model is a feature extraction model of a deep residual error network (Resnet50) based on a Mel spectrograms (Mel spectrograms) and a time sequence prediction model of a long short term memory network (LSTM), and the top layer model is a fully connected neural network corresponding to the main task and the auxiliary task respectively. The loss function adopted is the weighted superposition of the mean square error loss function and the cross entropy loss function. The multi-task learning mechanism adopted by the invention can reduce the overfitting probability of the model, effectively reduce the prediction error and clearly know the current rehabilitation condition of the patient through the prediction score.

Description

Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning

Technical Field

The invention belongs to the field of voice signal processing and intelligent medical auxiliary analysis, and relates to a stroke rehabilitation assessment auxiliary analysis method based on a voice multitask learning deep residual error network (Resnet50) and a long-short term memory network (LSTM).

Background

Stroke is an acute cerebrovascular disease, is a cranial nerve injury caused by rupture of cerebral vessels or failure of blood to flow into the brain due to vessel occlusion, and has high morbidity and disability rate. The investigation shows that the cerebral apoplexy is the first leading cause of death of residents in China, and the number of cerebral apoplexy deaths in China is about one third of the number of cerebral apoplexy deaths in the world. Usually, stroke patients have symptoms of unclear expression, incoherent speaking and the like in speech expression, and the normal life of the patients is seriously influenced.

The existing stroke detection method based on voice is mainly divided into two methods. The traditional method is mainly based on feature engineering, and generates time alignment information of audio and text through a pre-trained voice recognition model, so as to calculate the characteristics of pronunciation accuracy and fluency related to pronunciation, such as pronunciation quality, the number of syllables in unit time and the like. On the basis, pronunciation difficulty characteristics such as jitter, glimmer, pitch period entropy, glottal entropy, signal-to-noise ratio and the like extracted from an original voice signal are added, and a machine learning classifier is adopted for classification. The deep learning method is more prone to taking an original voice signal or a voice time-frequency graph as the input of the network by designing a neural network, so that the network automatically learns the characteristics related to the difficulty in voice expression without complex characteristic calculation. However, these methods tend to have the following disadvantages:

1. in the traditional method, the speech recognition model has a large error in recognizing the speech of the patient due to unclear expression of the patient, and the extracted features lack robustness. The performance of the classifier is weak, and the requirement on the characterization capability of extracting features is high, so that the algorithm based on feature engineering cannot meet the engineering requirement;

2. the existing deep learning method mainly carries out secondary classification on the existence of stroke, and data with different severity levels cannot be quantized, so that a reasonable evaluation result cannot be given to the current rehabilitation condition of a patient.

Aiming at the problems, the invention provides a multi-task learning model, wherein the main task is a regression task for evaluating the stroke voice function damage so as to predict scores, and the auxiliary task is a classification task for classifying the stroke voice function damage severity. The bottom layer model is a feature extraction model of a deep residual error network (Resnet50) based on a Mel spectrograms (Mel spectrograms) and a time sequence prediction model of a long short term memory network (LSTM), and the top layer model is a fully connected neural network corresponding to the main task and the auxiliary task respectively. The parameters of the bottom layer model are uniformly shared, the tasks of the parameters of the top layer model are independent, and a single mean square error loss function is modified into a weighted superposition of the mean square error loss function and the cross entropy loss function to train the network. The method can reduce the probability of model overfitting, can effectively inhibit the prediction abnormal value after the auxiliary classification task is added, and improves the prediction precision.

Disclosure of Invention

The invention provides a multitask learning stroke rehabilitation assessment auxiliary analysis method aiming at the defects of the existing stroke rehabilitation assessment algorithm based on voice. The invention adopts a multi-task learning mechanism which takes a main task of evaluating the stroke voice function damage, predicts the score regression task and an auxiliary task of classifying the stroke voice function damage severity classification, can realize the automatic learning of the voice Mel frequency spectrum map depth characteristics and the score prediction of the time sequence information, and can effectively inhibit the prediction error, thereby realizing the automatic evaluation of the stroke rehabilitation based on voice.

The technical scheme of the invention mainly comprises the following steps:

step 1, intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signal, performing short-time Fourier transform on each frame of signal, and obtaining a Mel spectrogram through a Mel filter bank; then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first order difference and a second order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first order difference and the second order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum;

step 2, the labels of the existing data sets are evaluation scores of the voice function damage by doctors, and the existing data are divided into four severity grades according to the evaluation score intervals to be used as labels of auxiliary classification tasks;

step 3, for the segment level Mel frequency spectrogram extracted in the step 1, an improved Resnet50 deep convolution neural network is used, and an auxiliary classification task for classifying the severity of stroke voice function damage is added on the basis that a main task is a regression task for predicting the stroke voice function damage score by using a hard parameter sharing mechanism; adding the label in the step 2 by using the pre-training network weight, modifying the loss function, training the model, and extracting 100-dimensional depth features;

and 4, forming the 100-dimensional depth features of the segment-level Mel frequency spectrogram obtained in the step 3 into speech-level features according to a time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke speech function damage on the basis that a main task is a regression task for predicting the stroke speech function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the speech function damage.

Further, the step 1 is specifically realized as follows:

1-1, intercepting an original voice signal into a fixed length of four seconds, discarding fragments exceeding four seconds, and copying and supplementing existing fragments to a length of four seconds by fragments not enough for four seconds;

1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z^-1Enhancing the high frequency part in the signal; then using a frame length of 25 ms and a frame shift of 10 msPerforming framing operation on the signals; then multiplying each frame by a hamming window;

1-3, performing fast Fourier transform on each frame signal after framing and windowing to obtain a short-time amplitude spectrum of each frame, performing modular squaring on the short-time amplitude spectrum, and obtaining a Mel frequency spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:

mel-frequency spectrum:

the final 4 seconds of audio is processed to obtain a mel frequency spectrum of 400 x 64 pixels;

1-4, intercepting the Mel frequency spectrum of 400 x 64 pixels according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel frequency spectrum graph, then calculating the first order difference and the second order difference, and superposing the static image and the first order difference and the second order difference to form a picture similar to three channels of RGB; the final 4s segment of audio is truncated to give a total of 13 mel-frequency spectrograms at 64 pixel-by-64 pixel fraction level.

Further, the step 2 is specifically realized as follows:

2-1, setting samples with an evaluation score interval of 85-100 as a slight type, setting samples with an interval of 75-84 as a medium type, setting samples with an interval of 65-74 as a serious type, and setting samples with an interval of 60-64 as a very serious type.

Further, the step 3 is specifically realized as follows:

the 3-1 improved Resnet50 network structure is as follows: the Resnet50 network output layer originally used for 1000 classes of ImageNet has 1000 neurons, and the neurons are modified into 100 neurons; then adding respective network output layers for the two tasks respectively, wherein the output layer of the regression task is 1 neuron, and the output layer of the classification task is 4 neurons;

3-2, training by adopting a multi-task learning mechanism, wherein the model applies a hard parameter sharing mechanism, namely a network layer before two task output layers shares parameters, and only the output layers correspond to respective network parameters; the regression task corresponds to a mean-square loss function MSELoss, and the classification task corresponds to a cross entropy loss function CrossEntropyLoss, so that the used loss function TotalLoss is the weighted superposition of the mean-square loss function and the cross entropy loss function; by loading the weight parameters of the pre-trained Resnet50 network in a transfer learning mode, the training speed of the network can be effectively accelerated;

x_iand

respectively representing the predicted value and the label value corresponding to the regression task,

and y_ijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α ═ 1 and β ═ 0.5 are used in the present invention;

3-3, inputting a segment level Mel frequency spectrogram after the model is trained, and taking the output of the penultimate layer of the modified Resnet50 network as a characteristic; since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.

Further, the step 4 is specifically realized as follows:

4-1, combining the obtained 100-dimensional segment level features into utterance level features according to a time sequence intercepted by a Mel frequency spectrogram, and processing each 4s voice segment to obtain 13 x 100-dimensional utterance level features;

4-2, predicting input speech level features by adopting a three-layer LSTM network, wherein 64 neurons in each layer reduce network overfitting by using dropout equal to 0.5;

4-3, training by adopting a multi-task learning mechanism, wherein one neuron is arranged at an output layer of an LSTM regression task, four neurons are arranged at an output layer of a classification task, a hard parameter sharing mechanism is applied to a model, and network layers in front of the output layers of the two tasks share parameters, and only the output layers correspond to respective network parameters; the used loss function TotalLoss is the weighted superposition of a mean square loss function and a cross entropy loss function;

x_iand

and y_ijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α is 1 and β is 0.5;

the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;

the evaluation indexes of the 4-5 model adopt root mean square error RMSE, a decision coefficient R-square and average absolute error MAE, and the calculation formulas of all parameters are as follows:

wherein, y_iThe true value of the sample is represented by,

denotes the predicted value of the sample, y_meanRepresents the mean of the true values of all samples. The RMSE calculates the mean value re-evolution of the square sum of the error of the corresponding sample points of the fitting data and the original data, and the smaller the value, the better the fitting effect. And calculating the absolute value of the difference between the predicted value and the real value of each sample by the MAE, and then summing and averaging the absolute values to evaluate the closeness degree of the prediction result and the real data set, wherein the smaller the value is, the better the fitting effect is. The R-square is between 0 and 1, the closer to 1, the better the prediction effect of the model is, and the closer to 0, the worse the prediction effect of the model is.

The invention has the following beneficial effects:

the stroke rehabilitation assessment auxiliary analysis method based on the voice multitask learning has the advantages that: 1) the standard Mel cepstral coefficients (MFCCs) reflect only the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. The adopted Mel frequency spectrogram can combine dynamic and static characteristics to effectively improve the identification performance of the system. 2) Compared with the manually designed features related to the voice function damage, the depth features of the Mel frequency spectrum map extracted by using Resnet have better generalization performance and certain robustness to noise. 3) For single-task learning, the error of partial samples in the prediction result of a single regression network is large, and the samples with large prediction errors can be punished after the auxiliary classification task is added, so that the precision is effectively improved.

The invention can establish a voice-based accurate and efficient rehabilitation assessment diagnosis framework for stroke patients, and overcomes the defects that the prior rehabilitation assessment work only depends on manual work and lacks scientificity and objectivity. Can provide good help for the diagnosis of doctors, and is expected to reduce the medical pressure and improve the medical efficiency.

Drawings

FIG. 1: flow chart of the invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the implementation steps of the stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning are introduced in detail in the invention content, and the main innovations of the invention are as follows: (1) the standard Mel cepstral coefficients (MFCCs) reflect only the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. The Mel frequency spectrum graph adopted by the invention combines dynamic and static characteristics, and can effectively improve the identification performance of the system; (2) the Resnet50 network can solve the problem of performance degradation along with the increase of network depth, can excavate the depth characteristics of voice signals, and the LSTM network can learn the relation between the characteristic sequence contexts, and can better model the symptoms of unclear pronunciations, incoherence and the like of the voices of stroke patients; (3) for single-task learning, the error of partial samples in the prediction result of a single regression network is large, and the samples with large prediction errors can be punished after the auxiliary classification task is added, so that the prediction errors are effectively reduced, and the model precision is improved.

The technical scheme of the invention mainly comprises the following steps:

step 1, intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signals, performing short-time Fourier transform (STFT) on each frame of signals to obtain a short-time amplitude spectrum, performing modulus taking and squaring, and then obtaining a Mel spectrogram through a Mel filter bank. And then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first-order difference and a second-order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first-order difference and the second-order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum.

And 2, the labels of the existing data sets are the evaluation scores of the doctors on the voice function damage, and the existing data are divided into four severity grades according to the evaluation score intervals to be used as labels of auxiliary classification tasks.

And 3, for the fragment level Mel frequency spectrogram extracted in the step 1, using a Resnet50 deep convolutional neural network, using a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke voice function damage on the basis that a main task is a regression task for predicting the stroke voice function damage score, using the pre-training network weight, adding the label in the step 2, modifying a loss function, training a model, and extracting 100-dimensional deep features.

The specific implementation of the step 1 is as follows:

1-1, the original voice signal is cut into fixed length of four seconds, the segment exceeding four seconds is discarded, and the segment not enough for four seconds is copied to supplement the existing segment to the length of four seconds.

1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z^-1The high frequency part of the signal is enhanced. Then, the frame division operation is carried out on the signal in a mode that the frame length is 25 milliseconds and the frame shift is 10 milliseconds. Each frame is then multiplied by a hamming window.

mel-frequency spectrum:

the final 4 seconds of speech signal was processed to obtain a mel spectrum of 400 x 64 pixels.

1-4, the obtained Mel frequency spectrum of 400 x 64 pixels is intercepted according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel frequency spectrum image, then a first order difference and a second order difference are calculated, and the static image and the first order difference and the second order difference are superposed to form an image similar to three channels of RGB. The final 4s segment of audio is truncated to give a total of 13 mel-frequency spectrograms at 64 pixel-by-64 pixel fraction level.

The step 2 is specifically realized as follows:

2-1 there are eleven sets of data with evaluation scores, namely Aphasia entropy (Aphasia elementary) of 78, 91, 87, 68, 91, 80, 92, 74, 71, 81, 61, the score ranges from 0 to 100, and for the convenience of model training, the score is mapped to a decimal number of 0 to 1: 0.78, 0.91, 0.87, 0.68, 0.91, 0.80, 0.92, 0.74, 0.71, 0.81, 0.61, and sets the severity level categories according to the corresponding intervals: 0.91, 0.87, 0.91, 0.92 are mild types, 0.78, 0.80, 0.81 are medium types, 0.71, 0.74, 0.68 are severe types, 0.61 is very severe type, and four types in total are used as labels for classification tasks.

The specific implementation of step 3 is as follows:

3-1, modifying a network structure of Resnet50, wherein the Resnet50 network output layers originally used for 1000 class classifications of ImageNet have 1000 neurons in total, modifying the neurons into 100 neurons, and then adding respective network output layers for two tasks respectively, wherein the output layer of a regression task is 1 neuron, and the output layer of a classification task is 4 neurons.

3-2, training by adopting a multi-task learning mechanism, wherein the model applies a hard parameter sharing mechanism, namely, the network layer before two task output layers shares parameters, and only the output layers correspond to respective network parameters. Since the regression task corresponds to mean square loss functions (mselos) and the classification task corresponds to cross entropy loss functions (crossentropy loss), the loss function TotalLoss used is a weighted superposition of the mean square loss function and the cross entropy loss function. And by loading the weight parameters of the pre-training model in a transfer learning mode, the training speed of the network can be effectively accelerated.

x_iAnd

and after the 3-3 model is trained, inputting the segment level Mel frequency spectrogram, and extracting the output of the penultimate layer of the modified Resnet50 network as the characteristic. Since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.

The specific implementation of the step 4 is as follows:

4-1, the obtained 100-dimensional segment features are combined into speech level features according to the time sequence intercepted by the Mel frequency spectrogram, so that 13 x 100-dimensional speech level features are obtained after each 4s speech segment is processed.

4-2 predicts the input speech-level features using a three-layer LSTM network, 64 neurons per layer, using dropout of 0.5 to reduce the network overfitting.

4-3, training is carried out by adopting a multi-task learning mechanism, one neuron is arranged at an output layer of an LSTM regression task, four neurons are arranged at an output layer of a classification task, a hard parameter sharing mechanism is applied to a model, parameters are shared by network layers before the output layers of the two tasks, and only the output layers correspond to respective network parameters. The loss function totallloss used is a weighted superposition of the mean square loss function and the cross entropy loss function.

x_iAnd

the RMSE calculates the mean value re-evolution of the square sum of the error of the corresponding sample points of the fitting data and the original data, and the smaller the value, the better the fitting effect. And calculating the absolute value of the difference between the predicted value and the real value of each sample by the MAE, and then summing and averaging the absolute values to evaluate the closeness degree of the prediction result and the real data set, wherein the smaller the value is, the better the fitting effect is. The R-square is between 0 and 1, the closer to 1, the better the prediction effect of the model is, and the closer to 0, the worse the prediction effect of the model is.

The invention also provides a stroke rehabilitation assessment auxiliary analysis system based on voice multitask learning, which comprises a data preprocessing module, a voice function damage level module, an improved Resnet50 network model and an improved three-layer LSTM network model.

The data preprocessing module is specifically realized as follows: intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signals, performing short-time Fourier transform on each frame of signal, and obtaining a Mel frequency spectrogram through a Mel filter bank; then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first order difference and a second order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first order difference and the second order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum;

the voice function damage level module is specifically realized as follows: the label of the existing data set is an evaluation score of a doctor on the voice function damage, and the existing data is divided into four severity grades according to the interval of the evaluation score to be used as a label of an auxiliary classification task;

the improved Resnet50 network model is implemented as follows: for the segment level Mel frequency spectrogram extracted by the data preprocessing module, an improved Resnet50 deep convolution neural network is used, and a hard parameter sharing mechanism is utilized, and an auxiliary classification task for classifying the severity of stroke voice function damage is added on the basis that a main task is a regression task for predicting the stroke voice function damage score; adding a label of a voice function damage level module by using a pre-training network weight, modifying a loss function, training a model, and extracting a 100-dimensional depth feature;

the improved three-layer LSTM network model is specifically realized as follows: the method comprises the steps of forming speech level features by using 100-dimensional depth features of a segment level Mel frequency spectrum diagram obtained by an improved Resnet50 network model according to a time sequence, adopting a three-layer LSTM network, adding an auxiliary classification task for classifying the severity of stroke voice function damage on the basis of a regression task with a main task of stroke voice function damage score prediction by using a hard parameter sharing mechanism, modifying a loss function, training the model, and finally obtaining an evaluation score of the voice function damage.

In order to achieve better stroke speech rehabilitation assessment and prediction effect, the following aspects of parameter selection and design in practical application are introduced as references for other applications of the invention:

the invention adopts the fixed 4s voice data only for facilitating model training, and the voice data with any length can be processed in practical application after the model training.

Obtaining voice data in practical application, processing the voice data through the step 1, extracting a Mel frequency spectrum graph, and finally obtaining N segments of 64 × 64 pixel segment-level Mel frequency spectrum graphs after segmentation. And then, segment level features are obtained through a Resnet feature extraction module in the step 3, and the segment level features are stacked in time sequence to form the N x 100 dimensional speech level features. And finally, processing through N time steps by three layers of LSTMs to obtain a final score.

In the invention, when the evaluation indexes of only one regression task model are RMSE (RMSE) 0.036, MAE (MAE) 0.027 and R-square 0.778, the predicted value and the actual true value of the model are compared, and the prediction error of partial samples is found to be larger. The evaluation indexes of the multitask learning are RMSE (RMSE) 0.029, MAE (MAE) 0.022 and R-square (R-square) 0.837, and the number of samples with larger prediction errors is obviously reduced. In conclusion, the voice-based rehabilitation assessment auxiliary analysis method for the stroke patient through multitask learning can provide scientific and objective assessment results for voice rehabilitation assessment work, and fills the gap that rehabilitation therapy only depends on manual work.

Claims

1. The stroke rehabilitation assessment auxiliary analysis method based on the voice multitask learning is characterized by comprising the following steps of:

2. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 1 is realized as follows:

1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z^-1Enhancing the high frequency part in the signal; then, framing operation is carried out on the signals in a mode that the frame length is 25 milliseconds and the frame shift is 10 milliseconds; then multiplying each frame by a hamming window;

mel-frequency spectrum:

3. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1 or 2, characterized in that the step 2 is realized as follows:

4. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 3 is realized as follows:

x_iand

5. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 4 is realized as follows:

x_iand

wherein, y_iThe true value of the sample is represented by,

denotes the predicted value of the sample, y_meanRepresents the mean of the true values of all samples.