CN114141366A - Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning - Google Patents

Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning Download PDF

Info

Publication number
CN114141366A
CN114141366A CN202111665085.1A CN202111665085A CN114141366A CN 114141366 A CN114141366 A CN 114141366A CN 202111665085 A CN202111665085 A CN 202111665085A CN 114141366 A CN114141366 A CN 114141366A
Authority
CN
China
Prior art keywords
task
mel
voice
network
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111665085.1A
Other languages
Chinese (zh)
Other versions
CN114141366B (en
Inventor
曹九稳
葛宇
王天磊
赖晓平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111665085.1A priority Critical patent/CN114141366B/en
Publication of CN114141366A publication Critical patent/CN114141366A/en
Application granted granted Critical
Publication of CN114141366B publication Critical patent/CN114141366B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Epidemiology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Pathology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Primary Health Care (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning. And providing a multi-task learning model with a main task of evaluating the stroke voice function damage to predict a fractional regression task and an auxiliary task of classifying the stroke voice function damage severity. The bottom layer model is a feature extraction model of a deep residual error network (Resnet50) based on a Mel spectrograms (Mel spectrograms) and a time sequence prediction model of a long short term memory network (LSTM), and the top layer model is a fully connected neural network corresponding to the main task and the auxiliary task respectively. The loss function adopted is the weighted superposition of the mean square error loss function and the cross entropy loss function. The multi-task learning mechanism adopted by the invention can reduce the overfitting probability of the model, effectively reduce the prediction error and clearly know the current rehabilitation condition of the patient through the prediction score.

Description

Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning
Technical Field
The invention belongs to the field of voice signal processing and intelligent medical auxiliary analysis, and relates to a stroke rehabilitation assessment auxiliary analysis method based on a voice multitask learning deep residual error network (Resnet50) and a long-short term memory network (LSTM).
Background
Stroke is an acute cerebrovascular disease, is a cranial nerve injury caused by rupture of cerebral vessels or failure of blood to flow into the brain due to vessel occlusion, and has high morbidity and disability rate. The investigation shows that the cerebral apoplexy is the first leading cause of death of residents in China, and the number of cerebral apoplexy deaths in China is about one third of the number of cerebral apoplexy deaths in the world. Usually, stroke patients have symptoms of unclear expression, incoherent speaking and the like in speech expression, and the normal life of the patients is seriously influenced.
The existing stroke detection method based on voice is mainly divided into two methods. The traditional method is mainly based on feature engineering, and generates time alignment information of audio and text through a pre-trained voice recognition model, so as to calculate the characteristics of pronunciation accuracy and fluency related to pronunciation, such as pronunciation quality, the number of syllables in unit time and the like. On the basis, pronunciation difficulty characteristics such as jitter, glimmer, pitch period entropy, glottal entropy, signal-to-noise ratio and the like extracted from an original voice signal are added, and a machine learning classifier is adopted for classification. The deep learning method is more prone to taking an original voice signal or a voice time-frequency graph as the input of the network by designing a neural network, so that the network automatically learns the characteristics related to the difficulty in voice expression without complex characteristic calculation. However, these methods tend to have the following disadvantages:
1. in the traditional method, the speech recognition model has a large error in recognizing the speech of the patient due to unclear expression of the patient, and the extracted features lack robustness. The performance of the classifier is weak, and the requirement on the characterization capability of extracting features is high, so that the algorithm based on feature engineering cannot meet the engineering requirement;
2. the existing deep learning method mainly carries out secondary classification on the existence of stroke, and data with different severity levels cannot be quantized, so that a reasonable evaluation result cannot be given to the current rehabilitation condition of a patient.
Aiming at the problems, the invention provides a multi-task learning model, wherein the main task is a regression task for evaluating the stroke voice function damage so as to predict scores, and the auxiliary task is a classification task for classifying the stroke voice function damage severity. The bottom layer model is a feature extraction model of a deep residual error network (Resnet50) based on a Mel spectrograms (Mel spectrograms) and a time sequence prediction model of a long short term memory network (LSTM), and the top layer model is a fully connected neural network corresponding to the main task and the auxiliary task respectively. The parameters of the bottom layer model are uniformly shared, the tasks of the parameters of the top layer model are independent, and a single mean square error loss function is modified into a weighted superposition of the mean square error loss function and the cross entropy loss function to train the network. The method can reduce the probability of model overfitting, can effectively inhibit the prediction abnormal value after the auxiliary classification task is added, and improves the prediction precision.
Disclosure of Invention
The invention provides a multitask learning stroke rehabilitation assessment auxiliary analysis method aiming at the defects of the existing stroke rehabilitation assessment algorithm based on voice. The invention adopts a multi-task learning mechanism which takes a main task of evaluating the stroke voice function damage, predicts the score regression task and an auxiliary task of classifying the stroke voice function damage severity classification, can realize the automatic learning of the voice Mel frequency spectrum map depth characteristics and the score prediction of the time sequence information, and can effectively inhibit the prediction error, thereby realizing the automatic evaluation of the stroke rehabilitation based on voice.
The technical scheme of the invention mainly comprises the following steps:
step 1, intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signal, performing short-time Fourier transform on each frame of signal, and obtaining a Mel spectrogram through a Mel filter bank; then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first order difference and a second order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first order difference and the second order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum;
step 2, the labels of the existing data sets are evaluation scores of the voice function damage by doctors, and the existing data are divided into four severity grades according to the evaluation score intervals to be used as labels of auxiliary classification tasks;
step 3, for the segment level Mel frequency spectrogram extracted in the step 1, an improved Resnet50 deep convolution neural network is used, and an auxiliary classification task for classifying the severity of stroke voice function damage is added on the basis that a main task is a regression task for predicting the stroke voice function damage score by using a hard parameter sharing mechanism; adding the label in the step 2 by using the pre-training network weight, modifying the loss function, training the model, and extracting 100-dimensional depth features;
and 4, forming the 100-dimensional depth features of the segment-level Mel frequency spectrogram obtained in the step 3 into speech-level features according to a time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke speech function damage on the basis that a main task is a regression task for predicting the stroke speech function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the speech function damage.
Further, the step 1 is specifically realized as follows:
1-1, intercepting an original voice signal into a fixed length of four seconds, discarding fragments exceeding four seconds, and copying and supplementing existing fragments to a length of four seconds by fragments not enough for four seconds;
1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z-1Enhancing the high frequency part in the signal; then using a frame length of 25 ms and a frame shift of 10 msPerforming framing operation on the signals; then multiplying each frame by a hamming window;
1-3, performing fast Fourier transform on each frame signal after framing and windowing to obtain a short-time amplitude spectrum of each frame, performing modular squaring on the short-time amplitude spectrum, and obtaining a Mel frequency spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
Figure BDA0003450873110000031
mel-frequency spectrum:
Figure BDA0003450873110000032
the final 4 seconds of audio is processed to obtain a mel frequency spectrum of 400 x 64 pixels;
1-4, intercepting the Mel frequency spectrum of 400 x 64 pixels according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel frequency spectrum graph, then calculating the first order difference and the second order difference, and superposing the static image and the first order difference and the second order difference to form a picture similar to three channels of RGB; the final 4s segment of audio is truncated to give a total of 13 mel-frequency spectrograms at 64 pixel-by-64 pixel fraction level.
Further, the step 2 is specifically realized as follows:
2-1, setting samples with an evaluation score interval of 85-100 as a slight type, setting samples with an interval of 75-84 as a medium type, setting samples with an interval of 65-74 as a serious type, and setting samples with an interval of 60-64 as a very serious type.
Further, the step 3 is specifically realized as follows:
the 3-1 improved Resnet50 network structure is as follows: the Resnet50 network output layer originally used for 1000 classes of ImageNet has 1000 neurons, and the neurons are modified into 100 neurons; then adding respective network output layers for the two tasks respectively, wherein the output layer of the regression task is 1 neuron, and the output layer of the classification task is 4 neurons;
3-2, training by adopting a multi-task learning mechanism, wherein the model applies a hard parameter sharing mechanism, namely a network layer before two task output layers shares parameters, and only the output layers correspond to respective network parameters; the regression task corresponds to a mean-square loss function MSELoss, and the classification task corresponds to a cross entropy loss function CrossEntropyLoss, so that the used loss function TotalLoss is the weighted superposition of the mean-square loss function and the cross entropy loss function; by loading the weight parameters of the pre-trained Resnet50 network in a transfer learning mode, the training speed of the network can be effectively accelerated;
Figure BDA0003450873110000041
Figure BDA0003450873110000042
Figure BDA0003450873110000043
xiand
Figure BDA0003450873110000044
respectively representing the predicted value and the label value corresponding to the regression task,
Figure BDA0003450873110000045
and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α ═ 1 and β ═ 0.5 are used in the present invention;
3-3, inputting a segment level Mel frequency spectrogram after the model is trained, and taking the output of the penultimate layer of the modified Resnet50 network as a characteristic; since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
Further, the step 4 is specifically realized as follows:
4-1, combining the obtained 100-dimensional segment level features into utterance level features according to a time sequence intercepted by a Mel frequency spectrogram, and processing each 4s voice segment to obtain 13 x 100-dimensional utterance level features;
4-2, predicting input speech level features by adopting a three-layer LSTM network, wherein 64 neurons in each layer reduce network overfitting by using dropout equal to 0.5;
4-3, training by adopting a multi-task learning mechanism, wherein one neuron is arranged at an output layer of an LSTM regression task, four neurons are arranged at an output layer of a classification task, a hard parameter sharing mechanism is applied to a model, and network layers in front of the output layers of the two tasks share parameters, and only the output layers correspond to respective network parameters; the used loss function TotalLoss is the weighted superposition of a mean square loss function and a cross entropy loss function;
Figure BDA0003450873110000051
Figure BDA0003450873110000052
Figure BDA0003450873110000053
xiand
Figure BDA0003450873110000054
respectively representing the predicted value and the label value corresponding to the regression task,
Figure BDA0003450873110000055
and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α is 1 and β is 0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation indexes of the 4-5 model adopt root mean square error RMSE, a decision coefficient R-square and average absolute error MAE, and the calculation formulas of all parameters are as follows:
Figure BDA0003450873110000056
Figure BDA0003450873110000057
Figure BDA0003450873110000058
wherein, yiThe true value of the sample is represented by,
Figure BDA0003450873110000059
denotes the predicted value of the sample, ymeanRepresents the mean of the true values of all samples. The RMSE calculates the mean value re-evolution of the square sum of the error of the corresponding sample points of the fitting data and the original data, and the smaller the value, the better the fitting effect. And calculating the absolute value of the difference between the predicted value and the real value of each sample by the MAE, and then summing and averaging the absolute values to evaluate the closeness degree of the prediction result and the real data set, wherein the smaller the value is, the better the fitting effect is. The R-square is between 0 and 1, the closer to 1, the better the prediction effect of the model is, and the closer to 0, the worse the prediction effect of the model is.
The invention has the following beneficial effects:
the stroke rehabilitation assessment auxiliary analysis method based on the voice multitask learning has the advantages that: 1) the standard Mel cepstral coefficients (MFCCs) reflect only the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. The adopted Mel frequency spectrogram can combine dynamic and static characteristics to effectively improve the identification performance of the system. 2) Compared with the manually designed features related to the voice function damage, the depth features of the Mel frequency spectrum map extracted by using Resnet have better generalization performance and certain robustness to noise. 3) For single-task learning, the error of partial samples in the prediction result of a single regression network is large, and the samples with large prediction errors can be punished after the auxiliary classification task is added, so that the precision is effectively improved.
The invention can establish a voice-based accurate and efficient rehabilitation assessment diagnosis framework for stroke patients, and overcomes the defects that the prior rehabilitation assessment work only depends on manual work and lacks scientificity and objectivity. Can provide good help for the diagnosis of doctors, and is expected to reduce the medical pressure and improve the medical efficiency.
Drawings
FIG. 1: flow chart of the invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, the implementation steps of the stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning are introduced in detail in the invention content, and the main innovations of the invention are as follows: (1) the standard Mel cepstral coefficients (MFCCs) reflect only the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. The Mel frequency spectrum graph adopted by the invention combines dynamic and static characteristics, and can effectively improve the identification performance of the system; (2) the Resnet50 network can solve the problem of performance degradation along with the increase of network depth, can excavate the depth characteristics of voice signals, and the LSTM network can learn the relation between the characteristic sequence contexts, and can better model the symptoms of unclear pronunciations, incoherence and the like of the voices of stroke patients; (3) for single-task learning, the error of partial samples in the prediction result of a single regression network is large, and the samples with large prediction errors can be punished after the auxiliary classification task is added, so that the prediction errors are effectively reduced, and the model precision is improved.
The technical scheme of the invention mainly comprises the following steps:
step 1, intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signals, performing short-time Fourier transform (STFT) on each frame of signals to obtain a short-time amplitude spectrum, performing modulus taking and squaring, and then obtaining a Mel spectrogram through a Mel filter bank. And then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first-order difference and a second-order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first-order difference and the second-order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum.
And 2, the labels of the existing data sets are the evaluation scores of the doctors on the voice function damage, and the existing data are divided into four severity grades according to the evaluation score intervals to be used as labels of auxiliary classification tasks.
And 3, for the fragment level Mel frequency spectrogram extracted in the step 1, using a Resnet50 deep convolutional neural network, using a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke voice function damage on the basis that a main task is a regression task for predicting the stroke voice function damage score, using the pre-training network weight, adding the label in the step 2, modifying a loss function, training a model, and extracting 100-dimensional deep features.
And 4, forming the 100-dimensional depth features of the segment-level Mel frequency spectrogram obtained in the step 3 into speech-level features according to a time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke speech function damage on the basis that a main task is a regression task for predicting the stroke speech function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the speech function damage.
The specific implementation of the step 1 is as follows:
1-1, the original voice signal is cut into fixed length of four seconds, the segment exceeding four seconds is discarded, and the segment not enough for four seconds is copied to supplement the existing segment to the length of four seconds.
1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z-1The high frequency part of the signal is enhanced. Then, the frame division operation is carried out on the signal in a mode that the frame length is 25 milliseconds and the frame shift is 10 milliseconds. Each frame is then multiplied by a hamming window.
1-3, performing fast Fourier transform on each frame signal after framing and windowing to obtain a short-time amplitude spectrum of each frame, performing modular squaring on the short-time amplitude spectrum, and obtaining a Mel frequency spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
Figure BDA0003450873110000081
mel-frequency spectrum:
Figure BDA0003450873110000082
the final 4 seconds of speech signal was processed to obtain a mel spectrum of 400 x 64 pixels.
1-4, the obtained Mel frequency spectrum of 400 x 64 pixels is intercepted according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel frequency spectrum image, then a first order difference and a second order difference are calculated, and the static image and the first order difference and the second order difference are superposed to form an image similar to three channels of RGB. The final 4s segment of audio is truncated to give a total of 13 mel-frequency spectrograms at 64 pixel-by-64 pixel fraction level.
The step 2 is specifically realized as follows:
2-1 there are eleven sets of data with evaluation scores, namely Aphasia entropy (Aphasia elementary) of 78, 91, 87, 68, 91, 80, 92, 74, 71, 81, 61, the score ranges from 0 to 100, and for the convenience of model training, the score is mapped to a decimal number of 0 to 1: 0.78, 0.91, 0.87, 0.68, 0.91, 0.80, 0.92, 0.74, 0.71, 0.81, 0.61, and sets the severity level categories according to the corresponding intervals: 0.91, 0.87, 0.91, 0.92 are mild types, 0.78, 0.80, 0.81 are medium types, 0.71, 0.74, 0.68 are severe types, 0.61 is very severe type, and four types in total are used as labels for classification tasks.
The specific implementation of step 3 is as follows:
3-1, modifying a network structure of Resnet50, wherein the Resnet50 network output layers originally used for 1000 class classifications of ImageNet have 1000 neurons in total, modifying the neurons into 100 neurons, and then adding respective network output layers for two tasks respectively, wherein the output layer of a regression task is 1 neuron, and the output layer of a classification task is 4 neurons.
3-2, training by adopting a multi-task learning mechanism, wherein the model applies a hard parameter sharing mechanism, namely, the network layer before two task output layers shares parameters, and only the output layers correspond to respective network parameters. Since the regression task corresponds to mean square loss functions (mselos) and the classification task corresponds to cross entropy loss functions (crossentropy loss), the loss function TotalLoss used is a weighted superposition of the mean square loss function and the cross entropy loss function. And by loading the weight parameters of the pre-training model in a transfer learning mode, the training speed of the network can be effectively accelerated.
Figure BDA0003450873110000091
Figure BDA0003450873110000092
Figure BDA0003450873110000093
xiAnd
Figure BDA0003450873110000094
respectively representing the predicted value and the label value corresponding to the regression task,
Figure BDA0003450873110000095
and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α ═ 1 and β ═ 0.5 are used in the present invention;
and after the 3-3 model is trained, inputting the segment level Mel frequency spectrogram, and extracting the output of the penultimate layer of the modified Resnet50 network as the characteristic. Since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
The specific implementation of the step 4 is as follows:
4-1, the obtained 100-dimensional segment features are combined into speech level features according to the time sequence intercepted by the Mel frequency spectrogram, so that 13 x 100-dimensional speech level features are obtained after each 4s speech segment is processed.
4-2 predicts the input speech-level features using a three-layer LSTM network, 64 neurons per layer, using dropout of 0.5 to reduce the network overfitting.
4-3, training is carried out by adopting a multi-task learning mechanism, one neuron is arranged at an output layer of an LSTM regression task, four neurons are arranged at an output layer of a classification task, a hard parameter sharing mechanism is applied to a model, parameters are shared by network layers before the output layers of the two tasks, and only the output layers correspond to respective network parameters. The loss function totallloss used is a weighted superposition of the mean square loss function and the cross entropy loss function.
Figure BDA0003450873110000101
Figure BDA0003450873110000102
Figure BDA0003450873110000103
xiAnd
Figure BDA0003450873110000104
respectively representing the predicted value and the label value corresponding to the regression task,
Figure BDA0003450873110000105
and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α is 1 and β is 0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation indexes of the 4-5 model adopt root mean square error RMSE, a decision coefficient R-square and average absolute error MAE, and the calculation formulas of all parameters are as follows:
Figure BDA0003450873110000106
Figure BDA0003450873110000107
Figure BDA0003450873110000108
the RMSE calculates the mean value re-evolution of the square sum of the error of the corresponding sample points of the fitting data and the original data, and the smaller the value, the better the fitting effect. And calculating the absolute value of the difference between the predicted value and the real value of each sample by the MAE, and then summing and averaging the absolute values to evaluate the closeness degree of the prediction result and the real data set, wherein the smaller the value is, the better the fitting effect is. The R-square is between 0 and 1, the closer to 1, the better the prediction effect of the model is, and the closer to 0, the worse the prediction effect of the model is.
The invention also provides a stroke rehabilitation assessment auxiliary analysis system based on voice multitask learning, which comprises a data preprocessing module, a voice function damage level module, an improved Resnet50 network model and an improved three-layer LSTM network model.
The data preprocessing module is specifically realized as follows: intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signals, performing short-time Fourier transform on each frame of signal, and obtaining a Mel frequency spectrogram through a Mel filter bank; then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first order difference and a second order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first order difference and the second order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum;
the voice function damage level module is specifically realized as follows: the label of the existing data set is an evaluation score of a doctor on the voice function damage, and the existing data is divided into four severity grades according to the interval of the evaluation score to be used as a label of an auxiliary classification task;
the improved Resnet50 network model is implemented as follows: for the segment level Mel frequency spectrogram extracted by the data preprocessing module, an improved Resnet50 deep convolution neural network is used, and a hard parameter sharing mechanism is utilized, and an auxiliary classification task for classifying the severity of stroke voice function damage is added on the basis that a main task is a regression task for predicting the stroke voice function damage score; adding a label of a voice function damage level module by using a pre-training network weight, modifying a loss function, training a model, and extracting a 100-dimensional depth feature;
the improved three-layer LSTM network model is specifically realized as follows: the method comprises the steps of forming speech level features by using 100-dimensional depth features of a segment level Mel frequency spectrum diagram obtained by an improved Resnet50 network model according to a time sequence, adopting a three-layer LSTM network, adding an auxiliary classification task for classifying the severity of stroke voice function damage on the basis of a regression task with a main task of stroke voice function damage score prediction by using a hard parameter sharing mechanism, modifying a loss function, training the model, and finally obtaining an evaluation score of the voice function damage.
In order to achieve better stroke speech rehabilitation assessment and prediction effect, the following aspects of parameter selection and design in practical application are introduced as references for other applications of the invention:
the invention adopts the fixed 4s voice data only for facilitating model training, and the voice data with any length can be processed in practical application after the model training.
Obtaining voice data in practical application, processing the voice data through the step 1, extracting a Mel frequency spectrum graph, and finally obtaining N segments of 64 × 64 pixel segment-level Mel frequency spectrum graphs after segmentation. And then, segment level features are obtained through a Resnet feature extraction module in the step 3, and the segment level features are stacked in time sequence to form the N x 100 dimensional speech level features. And finally, processing through N time steps by three layers of LSTMs to obtain a final score.
In the invention, when the evaluation indexes of only one regression task model are RMSE (RMSE) 0.036, MAE (MAE) 0.027 and R-square 0.778, the predicted value and the actual true value of the model are compared, and the prediction error of partial samples is found to be larger. The evaluation indexes of the multitask learning are RMSE (RMSE) 0.029, MAE (MAE) 0.022 and R-square (R-square) 0.837, and the number of samples with larger prediction errors is obviously reduced. In conclusion, the voice-based rehabilitation assessment auxiliary analysis method for the stroke patient through multitask learning can provide scientific and objective assessment results for voice rehabilitation assessment work, and fills the gap that rehabilitation therapy only depends on manual work.

Claims (5)

1. The stroke rehabilitation assessment auxiliary analysis method based on the voice multitask learning is characterized by comprising the following steps of:
step 1, intercepting input voice data into a fixed length of 4 seconds, performing pre-emphasis, framing and windowing on the voice signal, performing short-time Fourier transform on each frame of signal, and obtaining a Mel spectrogram through a Mel filter bank; then intercepting the Mel spectrogram according to a frame length of 64 frames and a frame shift of 30 frames to obtain a static fragment level Mel spectrum, calculating a first order difference and a second order difference of the static fragment level Mel spectrum, and overlapping the static fragment level Mel spectrum, the first order difference and the second order difference to finally obtain a 64 x 64 pixel fragment level Mel spectrum;
step 2, the labels of the existing data sets are evaluation scores of the voice function damage by doctors, and the existing data are divided into four severity grades according to the evaluation score intervals to be used as labels of auxiliary classification tasks;
step 3, for the segment level Mel frequency spectrogram extracted in the step 1, an improved Resnet50 deep convolution neural network is used, and an auxiliary classification task for classifying the severity of stroke voice function damage is added on the basis that a main task is a regression task for predicting the stroke voice function damage score by using a hard parameter sharing mechanism; adding the label in the step 2 by using the pre-training network weight, modifying the loss function, training the model, and extracting 100-dimensional depth features;
and 4, forming the 100-dimensional depth features of the segment-level Mel frequency spectrogram obtained in the step 3 into speech-level features according to a time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity of the stroke speech function damage on the basis that a main task is a regression task for predicting the stroke speech function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the speech function damage.
2. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 1 is realized as follows:
1-1, intercepting an original voice signal into a fixed length of four seconds, discarding fragments exceeding four seconds, and copying and supplementing existing fragments to a length of four seconds by fragments not enough for four seconds;
1-2 passing the speech signal through a high pass filter: h (z) ═ 1-. mu.z-1Enhancing the high frequency part in the signal; then, framing operation is carried out on the signals in a mode that the frame length is 25 milliseconds and the frame shift is 10 milliseconds; then multiplying each frame by a hamming window;
1-3, performing fast Fourier transform on each frame signal after framing and windowing to obtain a short-time amplitude spectrum of each frame, performing modular squaring on the short-time amplitude spectrum, and obtaining a Mel frequency spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
Figure FDA0003450873100000021
mel-frequency spectrum:
Figure FDA0003450873100000022
the final 4 seconds of audio is processed to obtain a mel frequency spectrum of 400 x 64 pixels;
1-4, intercepting the Mel frequency spectrum of 400 x 64 pixels according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel frequency spectrum graph, then calculating the first order difference and the second order difference, and superposing the static image and the first order difference and the second order difference to form a picture similar to three channels of RGB; the final 4s segment of audio is truncated to give a total of 13 mel-frequency spectrograms at 64 pixel-by-64 pixel fraction level.
3. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1 or 2, characterized in that the step 2 is realized as follows:
2-1, setting samples with an evaluation score interval of 85-100 as a slight type, setting samples with an interval of 75-84 as a medium type, setting samples with an interval of 65-74 as a serious type, and setting samples with an interval of 60-64 as a very serious type.
4. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 3 is realized as follows:
the 3-1 improved Resnet50 network structure is as follows: the Resnet50 network output layer originally used for 1000 classes of ImageNet has 1000 neurons, and the neurons are modified into 100 neurons; then adding respective network output layers for the two tasks respectively, wherein the output layer of the regression task is 1 neuron, and the output layer of the classification task is 4 neurons;
3-2, training by adopting a multi-task learning mechanism, wherein the model applies a hard parameter sharing mechanism, namely a network layer before two task output layers shares parameters, and only the output layers correspond to respective network parameters; the regression task corresponds to a mean-square loss function MSELoss, and the classification task corresponds to a cross entropy loss function CrossEntropyLoss, so that the used loss function TotalLoss is the weighted superposition of the mean-square loss function and the cross entropy loss function; by loading the weight parameters of the pre-trained Resnet50 network in a transfer learning mode, the training speed of the network can be effectively accelerated;
Figure FDA0003450873100000031
Figure FDA0003450873100000032
Figure FDA0003450873100000033
xiand
Figure FDA0003450873100000034
respectively representing the predicted value and the label value corresponding to the regression task,
Figure FDA0003450873100000035
and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α ═ 1 and β ═ 0.5 are used in the present invention;
3-3, inputting a segment level Mel frequency spectrogram after the model is trained, and taking the output of the penultimate layer of the modified Resnet50 network as a characteristic; since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
5. The stroke rehabilitation assessment auxiliary analysis method based on voice multitask learning according to claim 1, characterized in that the step 4 is realized as follows:
4-1, combining the obtained 100-dimensional segment level features into utterance level features according to a time sequence intercepted by a Mel frequency spectrogram, and processing each 4s voice segment to obtain 13 x 100-dimensional utterance level features;
4-2, predicting input speech level features by adopting a three-layer LSTM network, wherein 64 neurons in each layer reduce network overfitting by using dropout equal to 0.5;
4-3, training by adopting a multi-task learning mechanism, wherein one neuron is arranged at an output layer of an LSTM regression task, four neurons are arranged at an output layer of a classification task, a hard parameter sharing mechanism is applied to a model, and network layers in front of the output layers of the two tasks share parameters, and only the output layers correspond to respective network parameters; the used loss function TotalLoss is the weighted superposition of a mean square loss function and a cross entropy loss function;
Figure FDA0003450873100000036
Figure FDA0003450873100000037
Figure FDA0003450873100000041
xiand
Figure FDA0003450873100000042
respectively representing the predicted value and the label value corresponding to the regression task,
Figure FDA0003450873100000043
and yijRespectively representing a predicted value and a label value of the classification task, wherein n represents the sample amount of training one batch at a time, and m represents the number of classes corresponding to the auxiliary classification task; wherein α is 1 and β is 0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation indexes of the 4-5 model adopt root mean square error RMSE, a decision coefficient R-square and average absolute error MAE, and the calculation formulas of all parameters are as follows:
Figure FDA0003450873100000044
Figure FDA0003450873100000045
Figure FDA0003450873100000046
wherein, yiThe true value of the sample is represented by,
Figure FDA0003450873100000047
denotes the predicted value of the sample, ymeanRepresents the mean of the true values of all samples.
CN202111665085.1A 2021-12-31 2021-12-31 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning Active CN114141366B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111665085.1A CN114141366B (en) 2021-12-31 2021-12-31 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111665085.1A CN114141366B (en) 2021-12-31 2021-12-31 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning

Publications (2)

Publication Number Publication Date
CN114141366A true CN114141366A (en) 2022-03-04
CN114141366B CN114141366B (en) 2024-03-26

Family

ID=80384123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111665085.1A Active CN114141366B (en) 2021-12-31 2021-12-31 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning

Country Status (1)

Country Link
CN (1) CN114141366B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882996A (en) * 2022-03-17 2022-08-09 深圳大学 Hepatocellular carcinoma CK19 and MVI prediction method based on multitask learning
CN117219265A (en) * 2023-10-07 2023-12-12 东北大学秦皇岛分校 Multi-mode data analysis method, device, storage medium and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3273387A1 (en) * 2016-07-19 2018-01-24 Siemens Healthcare GmbH Medical image segmentation with a multi-task neural network system
CN113436726A (en) * 2021-06-29 2021-09-24 南开大学 Automatic lung pathological sound analysis method based on multi-task classification
WO2021203796A1 (en) * 2020-04-09 2021-10-14 之江实验室 Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis
CN113782184A (en) * 2021-08-11 2021-12-10 杭州电子科技大学 Cerebral apoplexy auxiliary evaluation system based on facial key point and feature pre-learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3273387A1 (en) * 2016-07-19 2018-01-24 Siemens Healthcare GmbH Medical image segmentation with a multi-task neural network system
WO2021203796A1 (en) * 2020-04-09 2021-10-14 之江实验室 Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis
CN113436726A (en) * 2021-06-29 2021-09-24 南开大学 Automatic lung pathological sound analysis method based on multi-task classification
CN113782184A (en) * 2021-08-11 2021-12-10 杭州电子科技大学 Cerebral apoplexy auxiliary evaluation system based on facial key point and feature pre-learning

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882996A (en) * 2022-03-17 2022-08-09 深圳大学 Hepatocellular carcinoma CK19 and MVI prediction method based on multitask learning
CN114882996B (en) * 2022-03-17 2023-04-07 深圳大学 Hepatocellular carcinoma CK19 and MVI prediction method based on multitask learning
CN117219265A (en) * 2023-10-07 2023-12-12 东北大学秦皇岛分校 Multi-mode data analysis method, device, storage medium and equipment

Also Published As

Publication number Publication date
CN114141366B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
CN109599129B (en) Voice depression recognition system based on attention mechanism and convolutional neural network
CN112818892B (en) Multi-modal depression detection method and system based on time convolution neural network
CN108564942A (en) One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
Wang et al. LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement
CN114141366A (en) Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning
CN112331216A (en) Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
Juvela et al. Speaker-independent raw waveform model for glottal excitation
Vásquez-Correa et al. A Multitask Learning Approach to Assess the Dysarthria Severity in Patients with Parkinson's Disease.
US7617101B2 (en) Method and system for utterance verification
CN113012720A (en) Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction
CN113111151A (en) Cross-modal depression detection method based on intelligent voice question answering
Xian et al. A multi-scale feature recalibration network for end-to-end single channel speech enhancement
CN113488063A (en) Audio separation method based on mixed features and coding and decoding
Joshy et al. Dysarthria severity classification using multi-head attention and multi-task learning
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
Fan et al. CSENET: Complex squeeze-and-excitation network for speech depression level prediction
Chauhan et al. Emotion recognition using LP residual
CN116570284A (en) Depression recognition method and system based on voice characterization
Matoušek et al. A comparison of convolutional neural networks for glottal closure instant detection from raw speech
CN116230018A (en) Synthetic voice quality evaluation method for voice synthesis system
Wang et al. Unsupervised domain adaptation for dysarthric speech detection via domain adversarial training and mutual information minimization
CN113963718A (en) Voice session segmentation method based on deep learning
CN113450830A (en) Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms
Alimuradov et al. A method for automated segmentation of speech signals to determine temporal patterns of naturally expressed psycho-emotional states
Abderrazek et al. Interpretable Assessment of Speech Intelligibility Using Deep Learning: A Case Study on Speech Disorders Due to Head and Neck Cancers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant