CN114141366B - Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning - Google Patents

Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning Download PDF

Info

Publication number
CN114141366B
CN114141366B CN202111665085.1A CN202111665085A CN114141366B CN 114141366 B CN114141366 B CN 114141366B CN 202111665085 A CN202111665085 A CN 202111665085A CN 114141366 B CN114141366 B CN 114141366B
Authority
CN
China
Prior art keywords
task
voice
mel
network
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111665085.1A
Other languages
Chinese (zh)
Other versions
CN114141366A (en
Inventor
曹九稳
葛宇
王天磊
赖晓平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111665085.1A priority Critical patent/CN114141366B/en
Publication of CN114141366A publication Critical patent/CN114141366A/en
Application granted granted Critical
Publication of CN114141366B publication Critical patent/CN114141366B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Epidemiology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Pathology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Primary Health Care (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cerebral apoplexy rehabilitation evaluation auxiliary analysis method based on voice multitask learning. The main task is a regression task for evaluating the damage of the cerebral apoplexy voice function so as to predict the score, and the auxiliary task is a multi-task learning model for classifying the severity of the damage of the cerebral apoplexy voice function. The bottom model is a feature extraction model of a deep residual network (Resnet 50) based on a Mel spectrogram (Mel spectrum) and a time sequence prediction model of a long short term memory network (LSTM), and the top model is a fully connected neural network corresponding to a main task and an auxiliary task respectively. The loss function employed is a weighted superposition of a mean square error loss function and a cross entropy loss function. The multi-task learning mechanism adopted by the invention can reduce the model overfitting probability, effectively reduce the prediction error and clearly know the current rehabilitation condition of the patient through the prediction score.

Description

Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning
Technical Field
The invention belongs to the field of voice signal processing and intelligent medical auxiliary analysis, and relates to a cerebral apoplexy rehabilitation evaluation auxiliary analysis method based on a voice multitasking learning depth residual error network (Resnet 50) and a long-short-term memory network (LSTM).
Background
Cerebral apoplexy is an acute cerebrovascular disease, which is a brain nerve injury caused by cerebral vascular rupture or incapability of blood flowing into the brain due to vascular occlusion, and has high morbidity and disability rate. Investigation shows that cerebral apoplexy is the first leading cause of death of residents in China, and the number of cerebral apoplexy deaths in China accounts for about one third of the number of cerebral apoplexy deaths worldwide. Usually, the cerebral apoplexy patient has the symptoms of unclear expression, discontinuous speaking and the like in terms of speech expression, and seriously influences the normal life of the patient.
The existing cerebral apoplexy detection method based on voice is mainly divided into two types. The traditional method is mainly based on feature engineering, and generates time alignment information of audio and text through a pre-trained voice recognition model, so that the characteristics of voice pronunciation quality, syllable number in unit time and the like, which are relevant to pronunciation accuracy and fluency, are calculated. Based on the method, pronunciation difficulty characteristics such as jitter, glint, pitch period entropy, glottal entropy, signal to noise ratio and the like extracted from an original voice signal are added, and a machine learning classifier is adopted for classification. The deep learning method is more prone to taking the original voice signal or the voice time-frequency diagram as the input of the network by designing the neural network, so that the network automatically learns the characteristics related to the voice expression difficulty without complex characteristic calculation. However, these methods tend to suffer from the following drawbacks:
1. the traditional method firstly causes great errors in the voice recognition model when recognizing the voice of the patient due to unclear expression of the patient, and the extracted features lack robustness. The classifier has weak performance and high requirements on the characterization capability of extracted features, so that an algorithm based on feature engineering cannot meet engineering requirements;
2. the existing deep learning method mainly carries out two classifications on the existence of stroke, and cannot quantify data with different severity levels, so that a reasonable evaluation result cannot be given to the current rehabilitation condition of a patient.
Aiming at the problems, the invention provides a multi-task learning model with main tasks of evaluating the damage of the cerebral apoplexy voice function so as to predict the regression tasks of scores and auxiliary tasks of classifying the severity of the damage of the cerebral apoplexy voice function. The bottom model is a feature extraction model of a deep residual network (Resnet 50) based on a Mel spectrogram (Mel spectrum) and a time sequence prediction model of a long short term memory network (LSTM), and the top model is a fully connected neural network corresponding to a main task and an auxiliary task respectively. The parameters of the bottom layer model are uniformly shared, the tasks of the top layer model are independent, and a single mean square error loss function is modified into a weighted superposition of a mean square error loss function and a cross entropy loss function to train the network. The method can reduce the probability of model overfitting, and can effectively inhibit predicted abnormal values after adding auxiliary classification tasks, thereby improving the prediction precision.
Disclosure of Invention
Aiming at the defects of the existing stroke rehabilitation evaluation algorithm based on voice, the invention provides a stroke rehabilitation evaluation auxiliary analysis method for multitask learning. The invention adopts a multi-task learning mechanism with a main task of evaluating the cerebral apoplexy voice function damage, a regression task of predicting the score, and an auxiliary task of classifying the severity of the cerebral apoplexy voice function damage, can realize the automatic learning of the depth characteristics of a voice Mel spectrogram and the score prediction of time sequence information, and can effectively restrain the prediction error, thereby realizing the automatic evaluation of cerebral apoplexy rehabilitation based on voice.
The technical scheme of the invention mainly comprises the following steps:
step 1, intercepting input voice data into a fixed length of 4 seconds, pre-emphasis, framing and windowing the voice signals, performing short-time Fourier transform on each frame of signals, and obtaining a Mel spectrogram through a Mel filter bank; then, according to the frame length of 64 frames on the Mel spectrogram, the frames are shifted to 30 frames for interception to obtain a static segment-level Mel frequency spectrum, the first-order difference and the second-order difference of the static segment-level Mel frequency spectrum are calculated, the static segment-level Mel frequency spectrum, the first-order difference and the second-order difference are overlapped, and finally a segment-level Mel spectrogram with 64 x 64 pixels is obtained;
step 2, the labels of the existing data sets are evaluation scores of doctors on voice function damage, and the existing data are divided into four severity levels according to intervals of the evaluation scores to serve as labels of auxiliary classification tasks;
step 3, using an improved Resnet50 deep convolutional neural network for the segment-level Mel spectrogram extracted in the step 1, and adding an auxiliary classification task for classifying the severity of the stroke voice function injury on the basis that a main task is a regression task of stroke voice function injury score prediction by utilizing a hard parameter sharing mechanism; adding the label in the step 2 by using the weight of the pre-training network, modifying the loss function, training the model, and extracting the depth characteristics of 100 dimensions;
and 4, forming the 100-dimensional depth features of the segment-level Mel spectrogram obtained in the step 3 into speech-level features according to time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity degree of the stroke voice function damage on the basis that a main task is a regression task for predicting the stroke voice function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the voice function damage.
Further, the specific implementation of the step 1 is as follows:
1-1, intercepting an original voice signal into a fixed length of four seconds, discarding fragments exceeding four seconds, and copying and supplementing the existing fragments to the length of four seconds by fragments less than four seconds;
1-2 passing the speech signal through a high pass filter: h (z) =1- μz -1 Enhancing the high frequency portion of the signal; then framing the signal in a mode that the frame length is 25 milliseconds and the frame shift is 10 milliseconds; multiplying each frame by a hamming window;
1-3, carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain short-time amplitude spectrum of each frame, taking the modulus square of the short-time amplitude spectrum, and obtaining a Mel spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
mel spectrogram:
processing the audio of the final 4 seconds to obtain a mel spectrum of 400 x 64 pixels;
1-4, intercepting a Mel spectrum of 400 x 64 pixels according to a frame length of 64 pixels and a frame shift of 30 pixels to obtain a static image of the Mel spectrum, calculating a first-order difference and a second-order difference of the static image, and superposing the static image and the first-order difference and the second-order difference to form a picture similar to RGB three channels; the final 4s audio was truncated to yield a total of 13 mel-frequency spectra at a segment level of 64 x 64 pixels.
Further, the specific implementation of the step 2 is as follows:
2-1 the samples with evaluation scores ranging from 85 to 100 were set to a slight type, the samples with scores ranging from 75 to 84 were set to a medium type, the samples with scores ranging from 65 to 74 were set to a severe type, and the samples with scores ranging from 60 to 64 were set to a very severe type.
Further, the specific implementation of the step 3 is as follows:
3-1 improved Resnet50 network architecture is as follows: the Resnet50 network output layer originally used for 1000 category classification of ImageNet has 1000 neurons in total, and the Resnet50 network output layer is modified into 100 neurons; then adding respective network output layers for the two tasks respectively, wherein the output layer of the regression task is 1 neuron, and the output layer of the classification task is 4 neurons;
3-2 training by adopting a multi-task learning mechanism, wherein the model adopts a hard parameter sharing mechanism, namely, network layers before two task output layers share parameters, and only the output layers correspond to the respective network parameters; the regression task corresponds to a mean square loss function MSELoss, and the classification task corresponds to a cross entropy loss function cross Entropyloss, so that the loss function Totalloss used is a weighted superposition of the mean square loss function and the cross entropy loss function; the training speed of the network can be effectively accelerated by loading the weight parameters of the pre-training Resnet50 network in a transfer learning mode;
x i andrespectively representing the predicted value and the label value corresponding to the regression task, < >>And y ij Respectively representing a predicted value and a label value of the classification task, wherein n represents the sample size of one batch for each training, and m represents the category number corresponding to the auxiliary classification task; wherein α=1, β=0.5 employed in the present invention;
after the 3-3 model is trained, inputting a segment-level Mel spectrogram, and taking the output of the penultimate layer of the modified Resnet50 network as a characteristic; since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
Further, the specific implementation of the step 4 is as follows:
4-1, forming the obtained 100-dimensional segment level features into speaking level features according to the time sequence intercepted by the Mel spectrogram, so that 13 x 100-dimensional speaking level features are obtained after each 4s voice segment is processed;
4-2 predicting input speech-level features using a three-layer LSTM network, 64 neurons per layer, using dropout=0.5 to reduce network overfitting;
4-3 training by adopting a multi-task learning mechanism, wherein the LSTM regresses one neuron at the output layer of the task, classifies four neurons at the output layer of the task, the model adopts a hard parameter sharing mechanism, the network layers before the output layers of the two tasks share parameters, and only the output layers correspond to the respective network parameters; the loss function TotalLoss is the weighted superposition of the mean square loss function and the cross entropy loss function;
x i andrespectively representing the predicted value and the label value corresponding to the regression task, < >>And y ij Respectively representing a predicted value and a label value of the classification task, wherein n represents the sample size of one batch for each training, and m represents the category number corresponding to the auxiliary classification task; wherein α=1, β=0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation index of the 4-5 model adopts Root Mean Square Error (RMSE), a determination coefficient R-square and an average absolute error (MAE), and the calculation formulas of the parameters are as follows:
wherein y is i Representing the true value of the sample,representing the predicted value of the sample, y mean Representing the average of the true values of all samples. The RMSE calculates the mean value of the sum of squares of errors of the sample points corresponding to the fitting data and the original data, and the smaller the value is, the better the fitting effect is. The MAE calculates the absolute value of the difference between the predicted value and the true value of each sample, and then sums up to obtain an average value for evaluating the proximity of the predicted result and the true data set, and the smaller the value is, the better the fitting effect is. The closer R-square is between 0 and 1 to 1, the better the prediction effect of the model is, the closer to 0, and the worse the prediction effect of the model is.
The invention has the following beneficial effects:
the auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning mainly comprises the following steps: 1) The standard mel-frequency cepstral coefficient (MFCC) reflects only the static characteristics of speech parameters, and the dynamic characteristics of speech can be described by the differential spectrum of these static characteristics. The adopted mel spectrogram can combine dynamic and static characteristics, so that the identification performance of the system can be effectively improved. 2) Compared with the manually designed characteristics related to the voice function damage, the Mel spectrogram depth characteristics extracted by using Resnet have better generalization performance and certain robustness to noise. 3) For single-task learning, the prediction result of a single regression network has larger errors of partial samples, and samples with larger prediction errors can be punished after the auxiliary classification task is added, so that the accuracy is effectively improved.
The invention can establish a precise and efficient rehabilitation evaluation diagnosis framework based on voice for cerebral apoplexy patients, and overcomes the defect that the prior rehabilitation evaluation work only depends on manpower and lacks scientificity and objectivity. Can provide good help for the diagnosis of doctors, and is expected to reduce the medical pressure and improve the medical efficiency.
Drawings
Fig. 1: the flow chart of the invention.
Detailed Description
The invention is described in detail below with reference to the drawings and the detailed description.
As shown in fig. 1, the implementation steps of the auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning are described in detail in the summary of the invention, and the main innovations of the invention are as follows: (1) The standard mel-frequency cepstral coefficient (MFCC) reflects only the static characteristics of speech parameters, and the dynamic characteristics of speech can be described by the differential spectrum of these static characteristics. The Mel spectrogram adopted by the invention combines dynamic and static characteristics, so that the identification performance of the system can be effectively improved; (2) The Resnet50 network can solve the problem of performance degradation along with the increase of the network depth, can mine the depth characteristics of voice signals, and the LSTM network can learn the relation among the characteristic sequence contexts, so that the modeling of symptoms such as unclear word spitting, incoherence and the like of the voice of a cerebral apoplexy patient can be better realized; (3) For single-task learning, the prediction result of a single regression network has larger errors of partial samples, and samples with larger prediction errors can be punished after the auxiliary classification task is added, so that the prediction errors are effectively reduced, and the model precision is improved.
The technical scheme of the invention mainly comprises the following steps:
step 1, intercepting input voice data into a fixed length of 4 seconds, pre-emphasizing, framing and windowing the voice signals, performing short-time Fourier transform (STFT) on each frame of signals to obtain a short-time amplitude spectrum, and obtaining a Mel spectrogram through a Mel filter bank after modulus and squaring. And then, according to the frame length of 64 frames on a Mel spectrogram, the frames are shifted to 30 frames for interception, a static segment-level Mel frequency spectrum is obtained, the first-order difference and the second-order difference of the static segment-level Mel frequency spectrum are calculated, the static segment-level Mel frequency spectrum, the first-order difference and the second-order difference are overlapped, and finally, the segment-level Mel frequency spectrum with 64 x 64 pixels is obtained.
Step 2, the labels of the existing data sets are evaluation scores of doctors on voice function injuries, and the existing data are divided into four severity levels according to intervals of the evaluation scores to serve as labels of auxiliary classification tasks.
And 3, adding an auxiliary classification task for classifying the severity degree of the stroke voice function damage on the basis that a main task is a regression task for predicting the stroke voice function damage score by using a hard parameter sharing mechanism through a Resnet50 deep convolutional neural network to the segment-level Mel spectrogram extracted in the step 1, adding a pre-training network weight, adding a label in the step 2, modifying a loss function, training a model, and extracting 100-dimensional depth features.
And 4, forming the 100-dimensional depth features of the segment-level Mel spectrogram obtained in the step 3 into speech-level features according to time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity degree of the stroke voice function damage on the basis that a main task is a regression task for predicting the stroke voice function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the voice function damage.
The specific implementation of the step 1 is as follows:
1-1 intercepts the original speech signal to a fixed length of four seconds, and clips exceeding four seconds are discarded, and clips less than four seconds duplicate existing clips to a length of four seconds.
1-2 passing the speech signal through a high pass filter: h (z) =1- μz -1 The high frequency part of the signal is enhanced. And then framing the signal in a mode that the frame length is 25 milliseconds and the frame is shifted to 10 milliseconds. Each frame is then multiplied by a hamming window.
1-3, carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain short-time amplitude spectrum of each frame, taking the modulus square of the short-time amplitude spectrum, and obtaining a Mel spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
mel spectrogram:
the final 4 second speech signal is processed to obtain a mel spectrum of 400 x 64 pixels.
1-4, intercepting the acquired Mel spectrum with 400 x 64 pixels according to the frame length of 64 pixels and the frame shift of 30 pixels to obtain a static image of the Mel spectrum, calculating the first-order difference and the second-order difference, and superposing the static image and the first-order and second-order differences to form a picture similar to RGB three channels. The final 4s audio was truncated to yield a total of 13 mel-frequency spectra at a segment level of 64 x 64 pixels.
The specific implementation of the step 2 is as follows:
2-1 has an evaluation score of eleven sets of data, namely Aphasia entropy (Aphasia score) of 78, 91, 87, 68, 91, 80, 92, 74, 71, 81, 61, which score ranges from 0 to 100, which is mapped to a decimal fraction of 0 to 1 for the purpose of facilitating training of the model: 0.78,0.91,0.87,0.68,0.91,0.80,0.92,0.74,0.71,0.81,0.61, and sets the severity level class according to the corresponding interval: 0.91,0.87,0.91,0.92 is a mild type, 0.78,0.80,0.81 is a moderate type, 0.71,0.74,0.68 is a severe type, 0.61 is a very severe type, and a total of four classes are labels for classification tasks.
The specific implementation of the step 3 is as follows:
3-1 modifying the network structure of the Resnet50, wherein 1000 neurons are totally used for the Resnet50 network output layers of 1000 category classification of the ImageNet, the Resnet50 network output layers are modified into 100 neurons, then the network output layers of the two tasks are respectively added, the output layer of the regression task is 1 neuron, and the output layer of the classification task is 4 neurons.
3-2 training by adopting a multi-task learning mechanism, wherein the model adopts a hard parameter sharing mechanism, namely, the network layers before the two task output layers share parameters, and only the output layers correspond to the respective network parameters. Since the regression task corresponds to a mean square loss function (MSELOss), the classification task corresponds to a cross entropy loss function (CrossEntropyLoss), the loss function TotalLoss used is a weighted superposition of the mean square loss function and the cross entropy loss function. And by utilizing a migration learning mode and loading the weight parameters of the pre-training model, the training speed of the network can be effectively accelerated.
x i Andrespectively representing the predicted value and the label value corresponding to the regression task, < >>And y ij Respectively representing a predicted value and a label value of the classification task, wherein n represents the sample size of one batch for each training, and m represents the category number corresponding to the auxiliary classification task; wherein α=1, β=0.5 employed in the present invention;
after the 3-3 model is trained, a segment-level Mel spectrogram is input, and the output of the penultimate layer of the modified Resnet50 network is extracted as a characteristic. Since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
The specific implementation of the step 4 is as follows:
4-1, the obtained 100-dimensional segment features are formed into speaking-level features according to the time sequence of the mel spectrogram interception, so that each 4s voice segment is processed to obtain 13 x 100-dimensional speaking-level features.
4-2 the input speech-level features were predicted using a three-layer LSTM network, 64 neurons per layer, using dropout=0.5 to reduce network overfitting.
4-3, training by adopting a multi-task learning mechanism, wherein the LSTM regresses one neuron at the output layer of the task, classifies four neurons at the output layer of the task, the model adopts a hard parameter sharing mechanism, the network layers before the output layers of the two tasks share parameters, and only the output layers correspond to the respective network parameters. The loss function totaloss used is a weighted superposition of the mean square loss function and the cross entropy loss function.
x i Andrespectively representing the predicted value and the label value corresponding to the regression task, < >>And y ij Respectively representing a predicted value and a label value of the classification task, wherein n represents the sample size of one batch for each training, and m represents the category number corresponding to the auxiliary classification task; wherein α=1, β=0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation index of the 4-5 model adopts Root Mean Square Error (RMSE), a determination coefficient R-square and an average absolute error (MAE), and the calculation formulas of the parameters are as follows:
the RMSE calculates the mean value of the sum of squares of errors of the sample points corresponding to the fitting data and the original data, and the smaller the value is, the better the fitting effect is. The MAE calculates the absolute value of the difference between the predicted value and the true value of each sample, and then sums up to obtain an average value for evaluating the proximity of the predicted result and the true data set, and the smaller the value is, the better the fitting effect is. The closer R-square is between 0 and 1 to 1, the better the prediction effect of the model is, the closer to 0, and the worse the prediction effect of the model is.
The invention also provides a cerebral apoplexy rehabilitation evaluation auxiliary analysis system based on voice multitasking learning, which comprises a data preprocessing module, a voice function damage level module, an improved Resnet50 network model and an improved three-layer LSTM network model.
The data preprocessing module is concretely realized as follows: intercepting input voice data into a fixed length of 4 seconds, pre-emphasis, framing and windowing the voice signals, performing short-time Fourier transform on each frame of signals, and obtaining a Mel spectrogram through a Mel filter bank; then, according to the frame length of 64 frames on the Mel spectrogram, the frames are shifted to 30 frames for interception to obtain a static segment-level Mel frequency spectrum, the first-order difference and the second-order difference of the static segment-level Mel frequency spectrum are calculated, the static segment-level Mel frequency spectrum, the first-order difference and the second-order difference are overlapped, and finally a segment-level Mel spectrogram with 64 x 64 pixels is obtained;
the voice function damage level module is specifically realized as follows: the labels of the existing data sets are evaluation scores of doctors on voice function damage, and the existing data are divided into four severity levels according to the intervals of the evaluation scores to serve as labels of auxiliary classification tasks;
the improved Resnet50 network model is embodied as follows: the method comprises the steps of adding an auxiliary classification task for classifying the severity of the stroke voice function damage to a segment-level Mel spectrogram extracted by a data preprocessing module on the basis of a regression task of which the main task is stroke voice function damage score prediction by using an improved Resnet50 deep convolutional neural network and utilizing a hard parameter sharing mechanism; using pre-training network weight, adding a label of a voice function damage level module, modifying a loss function, training a model, and extracting 100-dimensional depth characteristics;
the improved three-layer LSTM network model is specifically realized as follows: the 100-dimensional depth features of the segment-level Mel spectrogram obtained by the improved Resnet50 network model are formed into speech-level features according to time sequence, a three-layer LSTM network is adopted, a hard parameter sharing mechanism is utilized, an auxiliary classification task for classifying the severity degree of the cerebral apoplexy voice function injury is added on the basis of a regression task of predicting the cerebral apoplexy voice function injury score by using a main task, a loss function is modified, a model is trained, and finally the evaluation score of the voice function injury is obtained.
In order to achieve a better prediction effect of the stroke voice rehabilitation evaluation, the following will be developed from the aspects of selection and design of parameters in practical application, so as to be used as references for other applications of the invention:
the invention adopts the voice data of fixed 4s only to facilitate model training, and can process the voice data with any length in practical application after model training.
And (3) obtaining voice data in practical application, extracting a Mel spectrogram through processing in the step (1), and finally obtaining a segment-level Mel spectrogram with N segments of 64 x 64 pixels after segmentation. And then obtaining fragment-level features through a Resnet feature extraction module in the step 3, and stacking the fragment-level features according to time sequence to form N-100-dimensional speaking-level features. Finally, the final score is obtained through three layers of LSTM and N time steps.
In the invention, when only one regression task model is adopted and the evaluation index is rmse=0.036, mae=0.027 and r-square=0.778, the prediction error of partial samples is found to be larger by comparing the model predicted value with the actual true value. The evaluation index of the multi-task learning is rmse=0.029, mae=0.022, r-square=0.837, and the number of samples with larger prediction error is obviously reduced. In conclusion, the auxiliary analysis method for rehabilitation evaluation of the brain stroke patient based on voice multitask learning can provide scientific and objective evaluation results for voice rehabilitation evaluation work, and fills the gap that only manual dependence exists in rehabilitation treatment.

Claims (5)

1. The auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning is characterized by comprising the following steps:
step 1, intercepting input voice data into a fixed length of 4 seconds, pre-emphasis, framing and windowing the voice signals, performing short-time Fourier transform on each frame of signals, and obtaining a Mel spectrogram through a Mel filter bank; then, according to the frame length of 64 frames on the Mel spectrogram, the frames are shifted to 30 frames for interception to obtain a static segment-level Mel frequency spectrum, the first-order difference and the second-order difference of the static segment-level Mel frequency spectrum are calculated, the static segment-level Mel frequency spectrum, the first-order difference and the second-order difference are overlapped, and finally a segment-level Mel spectrogram with 64 x 64 pixels is obtained;
step 2, the labels of the existing data sets are evaluation scores of doctors on voice function damage, and the existing data are divided into four severity levels according to intervals of the evaluation scores to serve as labels of auxiliary classification tasks;
step 3, using an improved Resnet50 deep convolutional neural network for the segment-level Mel spectrogram extracted in the step 1, and adding an auxiliary classification task for classifying the severity of the stroke voice function injury on the basis that a main task is a regression task of stroke voice function injury score prediction by utilizing a hard parameter sharing mechanism; adding the label in the step 2 by using the weight of the pre-training network, modifying the loss function, training the model, and extracting the depth characteristics of 100 dimensions;
and 4, forming the 100-dimensional depth features of the segment-level Mel spectrogram obtained in the step 3 into speech-level features according to time sequence, adopting a three-layer LSTM network, utilizing a hard parameter sharing mechanism, adding an auxiliary classification task for classifying the severity degree of the stroke voice function damage on the basis that a main task is a regression task for predicting the stroke voice function damage score, modifying a loss function, training a model, and finally obtaining the evaluation score of the voice function damage.
2. The auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning according to claim 1, wherein the specific implementation of the step 1 is as follows:
1-1, intercepting an original voice signal into a fixed length of four seconds, discarding fragments exceeding four seconds, and copying and supplementing the existing fragments to the length of four seconds by fragments less than four seconds;
1-2 passing the speech signal through a high pass filter: h (z) =1- μz -1 Enhancing the high frequency portion of the signal; then framing the signal in a mode that the frame length is 25 milliseconds and the frame shift is 10 milliseconds; multiplying each frame by a hamming window;
1-3, carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain short-time amplitude spectrum of each frame, taking the modulus square of the short-time amplitude spectrum, and obtaining a Mel spectrogram through a Mel filter bank with the filter number of 64, wherein the Mel filter bank comprises:
mel spectrogram:
processing the audio of the final 4 seconds to obtain a mel spectrum of 400 x 64 pixels;
1-4, intercepting a Mel spectrum of 400 x 64 pixels according to a frame length of 64 pixels and a frame shift of 30 pixels to obtain a static image of the Mel spectrum, calculating a first-order difference and a second-order difference of the static image, and superposing the static image and the first-order difference and the second-order difference to form a picture similar to RGB three channels; the final 4s audio was truncated to yield a total of 13 mel-frequency spectra at a segment level of 64 x 64 pixels.
3. The auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning according to claim 1 or 2, wherein the specific implementation of the step 2 is as follows:
2-1 the samples with evaluation scores ranging from 85 to 100 were set to a slight type, the samples with scores ranging from 75 to 84 were set to a medium type, the samples with scores ranging from 65 to 74 were set to a severe type, and the samples with scores ranging from 60 to 64 were set to a very severe type.
4. The auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning according to claim 1, wherein the specific implementation of the step 3 is as follows:
3-1 improved Resnet50 network architecture is as follows: the Resnet50 network output layer originally used for 1000 category classification of ImageNet has 1000 neurons in total, and the Resnet50 network output layer is modified into 100 neurons; then adding respective network output layers for the two tasks respectively, wherein the output layer of the regression task is 1 neuron, and the output layer of the classification task is 4 neurons;
3-2 training by adopting a multi-task learning mechanism, wherein the model adopts a hard parameter sharing mechanism, namely, network layers before two task output layers share parameters, and only the output layers correspond to the respective network parameters; the regression task corresponds to a mean square loss function MSELoss, and the classification task corresponds to a cross entropy loss function cross Entropyloss, so that the loss function Totalloss used is a weighted superposition of the mean square loss function and the cross entropy loss function; the training speed of the network can be effectively accelerated by loading the weight parameters of the pre-training Resnet50 network in a transfer learning mode;
x i andrespectively representing the corresponding pre-tasks of the regression taskMeasured value and tag value->And y ij Respectively representing a predicted value and a label value of the classification task, wherein n represents the sample size of one batch for each training, and m represents the category number corresponding to the auxiliary classification task; wherein α=1, β=0.5 employed in the present invention;
after the 3-3 model is trained, inputting a segment-level Mel spectrogram, and taking the output of the penultimate layer of the modified Resnet50 network as a characteristic; since the penultimate layer has 100 neurons, the feature dimension is 100 dimensions.
5. The auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning according to claim 1, wherein the specific implementation of the step 4 is as follows:
4-1, forming the obtained 100-dimensional segment level features into speaking level features according to the time sequence intercepted by the Mel spectrogram, so that 13 x 100-dimensional speaking level features are obtained after each 4s voice segment is processed;
4-2 predicting input speech-level features using a three-layer LSTM network, 64 neurons per layer, using dropout=0.5 to reduce network overfitting;
4-3 training by adopting a multi-task learning mechanism, wherein the LSTM regresses one neuron at the output layer of the task, classifies four neurons at the output layer of the task, the model adopts a hard parameter sharing mechanism, the network layers before the output layers of the two tasks share parameters, and only the output layers correspond to the respective network parameters; the loss function TotalLoss is the weighted superposition of the mean square loss function and the cross entropy loss function;
x i andrespectively representing the predicted value and the label value corresponding to the regression task, < >>And y ij Respectively representing a predicted value and a label value of the classification task, wherein n represents the sample size of one batch for each training, and m represents the category number corresponding to the auxiliary classification task; wherein α=1, β=0.5;
the output of the neuron corresponding to the 4-4LSTM network regression task output layer is the final prediction result;
the evaluation index of the 4-5 model adopts Root Mean Square Error (RMSE), a determination coefficient R-square and an average absolute error (MAE), and the calculation formulas of the parameters are as follows:
wherein y is i Representing the true value of the sample,representing the predicted value of the sample, y mean Representing the average of the true values of all samples.
CN202111665085.1A 2021-12-31 2021-12-31 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning Active CN114141366B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111665085.1A CN114141366B (en) 2021-12-31 2021-12-31 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111665085.1A CN114141366B (en) 2021-12-31 2021-12-31 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning

Publications (2)

Publication Number Publication Date
CN114141366A CN114141366A (en) 2022-03-04
CN114141366B true CN114141366B (en) 2024-03-26

Family

ID=80384123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111665085.1A Active CN114141366B (en) 2021-12-31 2021-12-31 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning

Country Status (1)

Country Link
CN (1) CN114141366B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882996B (en) * 2022-03-17 2023-04-07 深圳大学 Hepatocellular carcinoma CK19 and MVI prediction method based on multitask learning
CN117219265A (en) * 2023-10-07 2023-12-12 东北大学秦皇岛分校 Multi-mode data analysis method, device, storage medium and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3273387A1 (en) * 2016-07-19 2018-01-24 Siemens Healthcare GmbH Medical image segmentation with a multi-task neural network system
CN113436726A (en) * 2021-06-29 2021-09-24 南开大学 Automatic lung pathological sound analysis method based on multi-task classification
WO2021203796A1 (en) * 2020-04-09 2021-10-14 之江实验室 Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis
CN113782184A (en) * 2021-08-11 2021-12-10 杭州电子科技大学 Cerebral apoplexy auxiliary evaluation system based on facial key point and feature pre-learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3273387A1 (en) * 2016-07-19 2018-01-24 Siemens Healthcare GmbH Medical image segmentation with a multi-task neural network system
WO2021203796A1 (en) * 2020-04-09 2021-10-14 之江实验室 Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis
CN113436726A (en) * 2021-06-29 2021-09-24 南开大学 Automatic lung pathological sound analysis method based on multi-task classification
CN113782184A (en) * 2021-08-11 2021-12-10 杭州电子科技大学 Cerebral apoplexy auxiliary evaluation system based on facial key point and feature pre-learning

Also Published As

Publication number Publication date
CN114141366A (en) 2022-03-04

Similar Documents

Publication Publication Date Title
CN109599129B (en) Voice depression recognition system based on attention mechanism and convolutional neural network
CN114141366B (en) Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning
CN108564942A (en) One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
Joshy et al. Automated dysarthria severity classification: A study on acoustic features and deep learning techniques
CN110853680A (en) double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition
CN104200804A (en) Various-information coupling emotion recognition method for human-computer interaction
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN113012720A (en) Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction
CN110085216A (en) A kind of vagitus detection method and device
CN113111151A (en) Cross-modal depression detection method based on intelligent voice question answering
Karan et al. Stacked auto-encoder based Time-frequency features of Speech signal for Parkinson disease prediction
Janbakhshi et al. Automatic dysarthric speech detection exploiting pairwise distance-based convolutional neural networks
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
Fan et al. CSENET: Complex squeeze-and-excitation network for speech depression level prediction
CN110246509A (en) A kind of stack denoising self-encoding encoder and deep neural network structure for voice lie detection
Wu et al. A Characteristic of Speaker's Audio in the Model Space Based on Adaptive Frequency Scaling
Radha et al. Automated detection and severity assessment of dysarthria using raw speech
CN116570284A (en) Depression recognition method and system based on voice characterization
Matoušek et al. A comparison of convolutional neural networks for glottal closure instant detection from raw speech
CN109584861A (en) The screening method of Alzheimer&#39;s disease voice signal based on deep learning
Wang et al. Unsupervised domain adaptation for dysarthric speech detection via domain adversarial training and mutual information minimization
CN113963718A (en) Voice session segmentation method based on deep learning
Hireš et al. On the inter-dataset generalization of machine learning approaches to Parkinson's disease detection from voice
CN113450830A (en) Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms
Jothi et al. Speech intelligence using machine learning for aphasia individual

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant