CN115376560A - Voice feature coding model for early screening of mild cognitive impairment and training method thereof - Google Patents
Voice feature coding model for early screening of mild cognitive impairment and training method thereof Download PDFInfo
- Publication number
- CN115376560A CN115376560A CN202211010852.XA CN202211010852A CN115376560A CN 115376560 A CN115376560 A CN 115376560A CN 202211010852 A CN202211010852 A CN 202211010852A CN 115376560 A CN115376560 A CN 115376560A
- Authority
- CN
- China
- Prior art keywords
- audio
- feature
- cognitive impairment
- samples
- mild cognitive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 208000010877 cognitive disease Diseases 0.000 title claims abstract description 53
- 208000027061 mild cognitive impairment Diseases 0.000 title claims abstract description 51
- 238000012549 training Methods 0.000 title claims abstract description 43
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000012216 screening Methods 0.000 title claims abstract description 36
- 238000000605 extraction Methods 0.000 claims abstract description 44
- 208000024827 Alzheimer disease Diseases 0.000 claims abstract description 23
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 claims abstract description 11
- 238000001228 spectrum Methods 0.000 claims description 29
- 239000013598 vector Substances 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 11
- 238000012360 testing method Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 6
- 230000003595 spectral effect Effects 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 230000004927 fusion Effects 0.000 claims description 4
- 238000005259 measurement Methods 0.000 claims description 4
- 230000001149 cognitive effect Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 3
- 238000002474 experimental method Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 230000002779 inactivation Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000002269 spontaneous effect Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 2
- 230000000717 retained effect Effects 0.000 claims description 2
- 238000011410 subtraction method Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 230000006735 deficit Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 241001014642 Rasta Species 0.000 description 2
- 230000019771 cognition Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 208000000044 Amnesia Diseases 0.000 description 1
- 208000028698 Cognitive impairment Diseases 0.000 description 1
- 208000026139 Memory disease Diseases 0.000 description 1
- 208000012902 Nervous system disease Diseases 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000006984 memory degeneration Effects 0.000 description 1
- 208000023060 memory loss Diseases 0.000 description 1
- 238000002610 neuroimaging Methods 0.000 description 1
- 230000003557 neuropsychological effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Public Health (AREA)
- General Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention relates to a speech feature coding model for early screening of mild cognitive impairment and a training method thereof, wherein a two-stage feature coding generation mode is used, and each stage is provided with a feature coding extraction network and a classifier. The first-order feature extraction network converts the audio into a feature coding form facing to the Alzheimer's disease, and the classifier classifies the audio into the Alzheimer's disease and the non-Alzheimer's disease according to the coding. And the second stage of feature extraction network converts the audio of the non-Alzheimer's disease into a PLP feature form and finally generates a speech feature code for early screening of mild cognitive impairment. The code shows excellent performance in classification of mild cognitive impairment and health controllability, and can effectively improve the accuracy of early screening of the Alzheimer's disease.
Description
Technical Field
The invention relates to a computer audio technology, in particular to a speech feature coding model for early screening of mild cognitive impairment and a training method thereof.
Background
Alzheimer disease is a persistent neurological disorder, often occurring in the elderly or pre-senile stages, and the symptoms are memory loss, impairment of visual space ability, impairment of abstract thinking and computing power, personality and behavior changes and the like. These symptoms may progress irreversibly and progressively until the ability to live is lost completely. The early stage of the Alzheimer disease is called mild cognitive impairment, and the disease development can be effectively delayed through treatment and intervention in the early stage, so that the aims of improving the life quality of patients and prolonging the lives of the patients are fulfilled. Early screening for mild cognitive impairment is therefore very critical.
Conventional screening methods for mild cognitive impairment include neuroimaging examinations, neuropsychological assessments, biomarker examinations, and the like. However, these methods require the intervention of specialized physicians, are time-consuming and cost-intensive, and even some of the examination methods are invasive to the human body, which has objectively led to the failure of early diagnosis in most patients with alzheimer disease. Therefore, the method judges the cognitive status of the patient by utilizing the voice characteristics, effectively reduces the threshold of early screening of mild cognitive impairment and has scientific value, social value and economic value.
At present, few related researches for screening cognitive impairment by using voice characteristics are available, wherein most of the researches only consider the classification of Alzheimer patients and normal people, the boundary is fuzzy, and the early screening work of Alzheimer diseases cannot be well met. A few studies recognize the necessity for screening for mild cognitive impairment, but continue the habits of earlier studies, using some unified or fused features to directly classify three categories of alzheimer's disease, mild cognitive impairment and normal people, and do not consider the correlation between more refined speech features and the patient's speech. Taking the Chinese patent application CN114333911A as an example, the early Alzheimer's disease recognition system comprises a voice signal acquisition module, a voice feature extraction module, a local feature modeling module, a global relationship modeling module and a recognition module which are sequentially in signal connection, wherein logarithmic Mel spectrum features can be extracted after voice is sliced, and then three classifications are carried out by utilizing a bidirectional long-time and short-time memory network, so that the state of a patient is judged. However, the method ignores that the mild cognitive impairment is in the intermediate transition stage of normal cognition and Alzheimer's disease, and the logarithmic Mel spectrum features of the mild cognitive impairment have many similarities with the latter two, so that the mild cognitive impairment is easy to confuse. Therefore, the final classification result is that the classification accuracy of mild cognitive impairment is far lower than the other two items. The practical clinical significance of such results is not great.
Therefore, an audio feature learning algorithm is needed in the prior art to acquire deeper acoustic information, and a speech feature coding method for mild cognitive impairment is needed to improve the accuracy of early screening of alzheimer's disease.
Disclosure of Invention
Aiming at the problems existing in screening of mild cognitive impairment, a speech feature coding model for early screening of mild cognitive impairment and a training method thereof are provided, deeper acoustic information is obtained, and the accuracy of early screening of Alzheimer's disease is improved.
The technical scheme of the invention is as follows: a speech feature coding model for early screening of mild cognitive impairment comprises two feature extraction networks and two classifiers, wherein all data are sent to a first feature extraction network to extract audio features of Alzheimer patients, the obtained speech feature codes of the Alzheimer patients are sent to the first classifier to be used for classifying the audios of the Alzheimer patients and the audios of the non-Alzheimer patients, the audios of the non-Alzheimer patients are sent to a second feature extraction network to extract the audio features of the mild cognitive impairment patients, and the audio feature codes of the mild cognitive impairment early stages are sent to the second classifier to be classified and recognized;
the first feature extraction network is a 1D multi-convolution fusion network combined with attention among channels and comprises three branch networks, the tail end of each convolution network branch is connected with a high-efficiency channel attention module, so that the network adaptively distributes different weights to different channel features, splicing operation is carried out on three vectors extracted by the three branches, and finally one-dimensional feature codes, namely voice feature codes of Alzheimer patients, are obtained through a full connection layer;
the second feature extraction network is a feature extraction network based on 2D convolution, a continuous convolution pooling structure is adopted, a random inactivation layer is added to reduce overfitting, feature codes are output, high-dimensional audio features obtained after non-Alzheimer type audio is input and processed are input, the sample input format is mxnxc, wherein m and n respectively correspond to the width and height of two-dimensional features, and c corresponds to the number of channels of the features.
A speech feature coding model training method for early screening of mild cognitive impairment comprises the following steps:
1) Collecting audio data and preprocessing an audio data set:
collecting spontaneous voices, wherein each voice segment is about 30 to 60 seconds long, and the voices are divided into three types which are respectively from a normal cognitive subject, an Alzheimer disease patient and a mild cognitive impairment patient;
all voice is used as the training data and the audio files of the test data are subjected to unified noise reduction treatment, the audio is subjected to segmentation quantity expansion by adopting a Berouti spectral subtraction method, and then screening and filtering are carried out, so that the obtained data set is used for the first feature extraction network training and testing;
2) The preprocessed audio data set is sent into a first feature extraction network for training, the audio data is subjected to segmented processing based on a metric learning module of a decision mechanism during each training, different acoustic features and classification schemes are adopted, the audio or corresponding acoustic features are input into the metric learning module to generate corresponding feature vectors and classify the feature vectors, and the two metric learning modules are used for self-supervision clustering of the feature vectors to guarantee the usability of the feature extraction network for generating the features;
3) The second feature extraction network performs two-stage acoustic feature extraction on the non-Alzheimer type voice recognized by the first feature extraction network classifier trained in the step 2), namely the recognized non-Alzheimer type voice is converted into a PLP feature form and used as the input of the second feature extraction network, and finally the voice feature code for early screening of mild cognitive impairment is obtained in an end-to-end manner.
Further, the specific method for denoising by using the Berouti spectral subtraction in the step 1) is as follows: the power spectrum of the clean audio is obtained by:
wherein, ω is a frame audio vector; | X (omega) emittingfume 2 Is a pure power spectrum, | Y (ω) emittingphosphor 2 Is a noisy power spectrum; | D (omega) emittingfume 2 Is a noise power spectrum; alpha is the current audio signal-to-noise ratio; replacing | D (omega) | Wyof the average noise power spectrum of the first few frames of the original audio 2 (ii) a Setting beta to 0.02;
the audio frame length is determined by taking the average noise spectrum of the first 5 frames as the noise spectrum, with a window overlap of 50%, wherein the frame rate is 16000,
the method comprises the steps of expanding the segmentation quantity of audio in an audio slicing mode, dividing the audio into 2-second short audio segments, wherein an overlapping window between the short audio segments is 1 second; considering that an audio vector is too short and may not contain valid audio information, the audio samples are filtered by the following formula:
where w is a complete audio vector, ω is a 2 second slice of audio, γ is a threshold set to 0.3, l is the length of w, and l is the length of ω, which represents the removal of audio samples if False is found, otherwise it is retained, and finally a data set for comparison experiments is generated.
Further, in the training of step 2), a triplet loss function is used as a loss function of the metric learning module, the input form of the sample is a triplet form of an anchor sample, a positive sample and a negative sample, wherein the anchor sample and the positive sample are homogeneous samples, and the negative sample is heterogeneous samples, and the loss function is as follows:
wherein margin is an interval parameter for enlarging the difference between the anchor sample and the positive sample pair and the anchor sample and the negative sample pair; the loss function takes the L2 distance between the feature vectors as a measurement distance, and continuously trains the network, so that the L2 distance between the anchor sample and the positive sample is smaller than the interval parameter, and the L2 distance between the anchor sample and the negative sample is larger than the interval parameter, thereby achieving the clustering effect of the feature vectors.
Further, in the training of step 2), the triple samples adopt a decorrelation sample generator to dynamically generate the triples from the memory, in each round of training, the decorrelation sample generator selects different samples as anchor samples in a random sequence, and ensures that all samples in each round of training can be used as one anchor sample to construct the triples, so that decorrelation in the triple sequence is achieved; the decorrelation sample generator constructs the triples in a random manner, so that the same triples are difficult to reappear in different training iterations, and decorrelation on the tuple structure is achieved.
Further, the step 3) obtains 13-dimensional PLP features as acoustic features for coding and extracting the voice features facing early stage of mild cognitive impairment, the features are subjected to equal loudness pre-emphasis and cubic root compression, and finally, a linear prediction autoregressive model is used for obtaining cepstrum coefficients.
A method for applying a voice characteristic coding model for early screening of mild cognitive impairment after training is characterized in that long voice is divided into a multi-segment 2s short voice frequency mode, 2s short voice frequency is sent to the voice characteristic coding model for early screening of mild cognitive impairment after training, and final voting is carried out according to a classification result of each short voice frequency, so that a long voice frequency category is obtained.
The invention has the beneficial effects that: the voice characteristic coding model for early screening of mild cognitive impairment and the training method thereof are used for solving the problem of difficulty in classification between mild cognitive impairment and healthy control easily occurring in early screening audio classification research of Alzheimer's disease, and provide a new idea for early screening research of Alzheimer's disease.
Drawings
FIG. 1 is a diagram of a model architecture for use in the method of the present invention;
FIG. 2 is a schematic diagram of a 1D multi-convolution fusion network structure incorporating inter-channel attention in the method of the present invention;
FIG. 3 is a schematic diagram of a 2D convolution-based feature extraction network structure in the method of the present invention;
FIG. 4 is a schematic diagram of a metric learning module in the method of the present invention;
fig. 5 is a schematic diagram of a decorrelated sample generator configuration in accordance with the method of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
The core of the invention is the generation mode of the feature code, and the feature code generated by the 2s short audio frequency is evaluated by using a classifier in the model. In practical applications, long speech is usually classified to improve the classification accuracy. By combining the invention, the long voice can be divided into a multi-segment 2s short voice frequency mode, and the final voting is carried out according to the classification result of each short voice frequency, thereby obtaining the category of the long voice frequency. As for the detailed voting scheme, it can be selected by itself, and will not be described in detail here.
The invention relates to a speech feature coding method for early screening of mild cognitive impairment, which uses a model architecture as shown in figure 1, wherein two feature extraction networks are core components of two modules respectively. It is noted that an XGBoost based classifier (XGBoost classifier) is used in both modules in connection with a feature extraction network (feature generation net). The XGboost is an integrated learning model with high training speed and extremely excellent performance. A first classifier (xgboost classifier one) divides all data into the audio of patients with alzheimer's disease and the audio of patients with non-alzheimer's disease using the speech feature coding for alzheimer's disease; the second classifier (xgboost classifier two) uses audio features for patients with mild cognitive impairment to classify the audio of non-alzheimer patients into cognitively normal audio and audio of patients with mild cognitive impairment. According to the method, the classification performance of the feature codes obtained through the feature extraction network is tested through the classifier based on the XGboost, the feature extraction network is finely adjusted according to the classification result, and finally the voice feature codes with excellent classification performance and for early screening of mild cognitive impairment are obtained.
The method specifically comprises the following steps:
step 1: the audio data set is preprocessed. Firstly, the audio files as training data and test data are subjected to unified noise reduction processing by adopting a technique of Berouti spectral subtraction. The spectrum subtraction is to subtract the spectrum of the noise signal from the spectrum of the noise signal, and the Berouti spectrum subtraction is improved on the basis, so that the condition that the power returns to zero after the spectrum subtraction is avoided. Secondly, the audio file after noise reduction is segmented into short audio sets with 2 seconds, the overlapping part of the segments is 1 second, and the data sets are expanded under the condition that original audio characteristic information is kept. And finally, the sliced and expanded short audio set is filtered out, so that some meaningless or low-value audio sections are removed, and the data set is used for the feature extraction network training and testing in the step 2.
Step 2: as shown in fig. 1 and 2, a 1D multi-convolution fusion network combining inter-channel attention is constructed for feature extraction work of audio slice vectors. The network is composed of three branch networks, the sizes of convolution kernels of the three branches are 3, 5 and 7 respectively, multi-view perception capability of the network is endowed, and the network can be combined with information under different receptive fields to carry out comprehensive feature extraction. The tail end of each convolutional network branch is connected with an Efficient Channel Attention (Efficient Channel Attention) module, so that the network can adaptively assign different weights to different Channel characteristics, thereby helping to highlight some more important characteristics and weakening the influence of other characteristics. Finally, splicing operation is carried out on the three vectors extracted by the three branches, and finally, a one-dimensional feature code with the length of 32, namely the speech feature code for the Alzheimer disease, is obtained through a full connection layer. And (4) accessing a classifier at the network terminal, filtering corresponding voice classified into feature codes of the Alzheimer's disease, reserving the voice of normal people and mild cognitive impairment patients, and performing further voice feature code generation and discrimination facing to early mild cognitive impairment in the step 3.
And step 3: as shown in fig. 1 and 3, a feature extraction network based on 2D convolution is constructed. And (3) taking the high-dimensional audio features obtained after processing the audio which is determined to be not of the Alzheimer type in the step (2) as samples to be input into the network. The sample input format is m × n × c, where m and n correspond to the width and height of the two-dimensional feature, respectively, and c corresponds to the number of channels of the feature. The whole network adopts a traditional continuous convolution pooling structure, and finally a random inactivation layer is added to reduce overfitting, so that feature codes with the output length of 32 are output. Therefore, the invention successfully obtains the speech feature code for the early stage of mild cognitive impairment.
In the step 1, spontaneous speech such as talking in a picture or free chat is adopted in the invention for selecting the audio data set, and is often accompanied by a plurality of pauses, repetitions, nonsense auxiliary words or incomplete paragraphs, which are more close to the conversation in daily life of people. Each speech segment is about 30 to 60 seconds long. All speech can be divided into three categories, from subjects with normal cognition, alzheimer's patients and patients with mild cognitive impairment.
And after the data set is collected, noise reduction processing is required. The invention uses the Berouti spectral subtraction, and the power spectrum of the pure audio can be obtained by the formula 1:
wherein ω is a frame of audio vector, | X (ω) emitting 2 For pure power spectrum, | Y (ω) & gt bypass 2 For noisy power spectrum, | D (ω) | D 2 To the noise power spectrum, α is the current audio signal-to-noise ratio. ByWe cannot know | D (ω) & gtnon 2 Therefore, the average noise power spectrum of the first few frames of the original audio is generally taken as a substitute in the calculation process. In the invention, beta is set to be 0.02, the audio frame length is obtained by the formula (2), the window overlap is 50%, the average noise spectrum of the first 5 frames is taken as the noise spectrum, and the frame rate is 16000 in the formula.
Because the total number of the audio samples is not large, the invention performs segmentation quantity expansion on the audio in an audio slicing mode, the audio is divided into 2-second short audio segments, and the overlapping window between the short audio segments is 1 second. Considering that the audio vector may not contain valid audio information due to being too short, the audio samples are filtered by formula (3):
where w is a complete audio vector, ω is a 2 second slice of audio therein, γ is a threshold set to 0.3, L is the length of w, and l is the length of ω. False is found to represent removal of the audio sample, whereas the remaining, eventually results in a data set for the control experiment.
In the process of extracting the voice features in the step 2, the invention provides a method for the sectional processing of a metric learning module based on a decision-making mechanism, which is shown in figure 1. The method comprises two different voice feature extraction models, and different acoustic features and classification schemes are respectively adopted, so that the classification accuracy at different stages can reach a higher level, and the problem of low recognition accuracy of mild cognitive impairment is solved. The workflow of the two-time classification is similar, and the audio or the corresponding acoustic features are input into a metric learning module to generate corresponding feature vectors and are classified. The two metric learning modules are used for self-supervision clustering of the feature vectors, so that the usability of feature extraction network generated features is guaranteed.
In the above method, in order to effectively discriminate different types of speech features, the present invention proposes to use a triple loss function as a loss function of a metric learning module, as shown in fig. 4, which requires that when the module is trained, the input form of a sample should be a triple form (an anchor sample, a positive sample, and a negative sample), where the anchor sample and the positive sample are similar samples, and the anchor sample and the negative sample are heterogeneous samples. The loss function is shaped as equation (4):
where margin is an interval parameter used to stretch the gap between anchor samples and pairs of positive samples and anchor samples and pairs of negative samples. Equation (4) takes the L2 distance between feature vectors as the metric distance. By continuously training the network, the L2 distance between the anchor sample and the positive sample is smaller than the interval parameter, and the L2 distance between the anchor sample and the negative sample is larger than the interval parameter, so that the clustering effect of the feature vectors is achieved, and the interval parameters of the first measurement learning module and the second measurement learning module are set to be 3 and 2 respectively. The relationship between the feature generation network and the metric learning module is shown in fig. 4. Considering that the current sample size is small, the feature vector classification performed by the artificial neural network is easy to generate serious overfitting, so the classification work is finished by a machine learning model. And the feature vectors obtained by the feature extraction network are used as training and testing samples of the model to train and test.
The invention also provides a decorrelation sample generator. Conventional triple sample sets suffer from several drawbacks. Firstly, the triple samples are not changed once being generated, and the sample set is generated by combining the existing different types of non-tuple form samples through random or a certain rule. Therefore, some invalid triplets are stored in the memory, and the network cannot learn new information from the triplets. Secondly, the network continuously learns the sequence information of the samples in the training set because the decorrelation of the samples cannot be completely achieved, and the information is useless and redundant in the invention. Finally, the sample cannot be fully utilized. Assuming that there are N classes of samples, each class having M, the number of non-repeating triples that can be constructed is:
obviously, it is difficult to store too many triplet samples in computer memory, and only part of each time can be loaded, which results in the sample not being fully utilized. The decorrelated sample generator proposed by the present invention solves the above problem well, and its structure is shown in fig. 5. Firstly, the method carries out network training by dynamically generating the triples from the memory, and fully utilizes the samples as much as possible. Secondly, in each round of training, the generator selects different samples as anchor samples in a random sequence, and ensures that all samples in each round of training can be used as one anchor sample to construct a triplet, so that decorrelation on the triplet sequence is achieved. Finally, the generator carries out the triplet construction in a random mode, so that the same triplet is difficult to reappear in different training iterations, decorrelation on the tuple structure is achieved, and meanwhile, the phenomenon that the same invalid triplet has continuous influence on network training can be avoided to a certain extent.
In the selection of the acoustic features in the step 3, the invention tries to use the acoustic features with excellent performance in various fields for systematic effect comparison, wherein the acoustic features comprise a MFCC feature with 13 dimensions, a PLP feature with 13 dimensions, a MFCC-delta-delta feature with 39 dimensions and a PASTA-PLP feature with 26 dimensions. The final results show that PLP > RASTA-PLP > MFCC-delta-delta > MFCC on the classification performance for mild language impairment versus normally controllable audio. Therefore, the 13-dimensional PLP features are finally selected as acoustic features for the voice feature coding extraction facing the early stage of mild cognitive impairment.
Compared with the MFCC characteristics, the characteristics can extract the recognizable information in the audio frequency, highly overlap the auditory characteristics of human ears, and are the characteristics with high generalization and robustness. The invention selects 26 triangular filters to map the audio frequency to 26 different frequency bands, and extracts 26-dimensional MFCC characteristics. Since a human cannot feel high frequency information, the top 13-dimensional MFCC features are chosen as the available features. On the basis, the invention also splices the first derivative and the second derivative of the original MFCC to the original MFCC characteristic to obtain the MFCC-delta-delta characteristic with 39 dimensions. The invention also adopts PLP characteristics, which utilize equal loudness pre-emphasis and cubic root compression, and finally obtain cepstrum coefficients by using a linear prediction autoregressive model. The invention also adopts 26 filters, and finally selects the first 13-dimensional PLP characteristics as available characteristics. On the basis, the invention also uses RASTA-PLP characteristics which are modified linear prediction cepstrum coefficients, the power spectrum is modified according to the auditory perception characteristics of people to obtain RASTA characteristics, and the RASTA characteristics are superposed on the original PLP characteristics, so that 26-dimensional PASTA-PLP characteristics are obtained.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (7)
1. A voice feature coding model for early screening of mild cognitive impairment is characterized by comprising two feature extraction networks and two classifiers, wherein all data are sent into the first feature extraction network to extract audio features of Alzheimer patients, the obtained voice feature codes of the Alzheimer patients are sent into the first classifier to be used for classifying the audios of the Alzheimer patients and the audios of the non-Alzheimer patients, the audios of the non-Alzheimer patients are sent into the second feature extraction network to extract the audio features of the mild cognitive impairment patients, and the audio feature codes of the early screening of the mild cognitive impairment are sent to the second classifier to be classified and recognized;
the first feature extraction network is a 1D multi-convolution fusion network combined with attention among channels and comprises three branch networks, the tail end of each convolution network branch is connected with a high-efficiency channel attention module, so that the network adaptively distributes different weights to different channel features, splicing operation is carried out on three vectors extracted by the three branches, and finally one-dimensional feature codes, namely voice feature codes of Alzheimer patients, are obtained through a full connection layer;
and the second feature extraction network is a 2D convolution-based feature extraction network, a continuous convolution pooling structure is adopted, a random inactivation layer is added to reduce overfitting, feature codes are output, high-dimensional audio features are obtained after non-Alzheimer type audio is input and processed, the sample input format is mxnxc, wherein m and n respectively correspond to the width and the height of the two-dimensional features, and c corresponds to the number of channels of the features.
2. A speech feature coding model training method for early screening of mild cognitive impairment is characterized by comprising the following steps:
1) Collecting audio data and preprocessing an audio data set:
collecting spontaneous voices, wherein each voice segment is about 30 to 60 seconds long, and the voices are divided into three types which are respectively from a normal cognitive subject, an Alzheimer disease patient and a mild cognitive impairment patient;
all voice is used as the training data and the audio files of the test data are subjected to unified noise reduction treatment, the audio is subjected to segmentation quantity expansion by adopting a Berouti spectral subtraction method, and then screening and filtering are carried out, so that the obtained data set is used for the first feature extraction network training and testing;
2) The preprocessed audio data set is sent into a first feature extraction network for training, the audio data is subjected to segmented processing based on a metric learning module of a decision mechanism during each training, different acoustic features and classification schemes are adopted, the audio or corresponding acoustic features are input into the metric learning module to generate corresponding feature vectors and classify the feature vectors, and the two metric learning modules are used for self-supervision clustering of the feature vectors to guarantee the usability of the feature extraction network for generating the features;
3) The second feature extraction network performs two-stage acoustic feature extraction on the non-Alzheimer type voice recognized by the first feature extraction network classifier trained in the step 2), namely the recognized non-Alzheimer type voice is converted into a PLP feature form and is used as the input of the second feature extraction network, and finally the voice feature code for early screening of mild cognitive impairment is obtained in an end-to-end manner.
3. The method for training the speech feature coding model for early screening of mild cognitive impairment as claimed in claim 2, wherein the concrete method for denoising by using the Berouti spectral subtraction in the step 1) is as follows:
the power spectrum of the clean audio is found by:
wherein, ω is a frame audio vector; | X (omega) emittingfume 2 Is a pure power spectrum, | Y (ω) emittingphosphor 2 Is a noisy power spectrum; | D (omega) ("Liao") 2 Is a noise power spectrum; alpha is the current audio signal-to-noise ratio; taking average noise power spectrum of first few frames of original audio to replace | D (omega) & gt 2 (ii) a Setting beta to 0.02;
the audio frame length is determined by taking the average noise spectrum of the first 5 frames as the noise spectrum, with a window overlap of 50%, wherein the frame rate is 16000,
the method comprises the steps of expanding the segmentation quantity of audio in an audio slicing mode, dividing the audio into 2-second short audio segments, wherein an overlapping window between the short audio segments is 1 second; considering that an audio vector is too short and may not contain valid audio information, the audio samples are filtered by the following formula:
where w is a complete audio vector, ω is a 2 second slice of audio, γ is a threshold set to 0.3, l is the length of w, and l is the length of ω, which represents the removal of audio samples if False is found, otherwise it is retained, and finally a data set for comparison experiments is generated.
4. The method as claimed in claim 2, wherein in the training of step 2), a triplet loss function is used as the loss function of the metric learning module, and the input form of the samples is a triplet form of anchor samples, positive samples and negative samples, wherein the anchor samples and the positive samples are similar samples and the negative samples are dissimilar samples, and the loss function is as follows:
wherein margin is an interval parameter for enlarging the difference between the anchor sample and the positive sample pair and the anchor sample and the negative sample pair; the loss function takes the L2 distance between the feature vectors as a measurement distance, and continuously trains the network, so that the L2 distance between the anchor sample and the positive sample is smaller than the interval parameter, and the L2 distance between the anchor sample and the negative sample is larger than the interval parameter, thereby achieving the clustering effect of the feature vectors.
5. The method as claimed in claim 4, wherein the triple samples in the training of step 2) dynamically generate the triple from the memory by using the decorrelated sample generator, in each training cycle, the decorrelated sample generator selects different samples as anchor samples in a random order, and ensures that all samples in each training cycle can be used as one anchor sample to construct the triple, thereby achieving decorrelation in the triple order; the decorrelation sample generator constructs the triples in a random manner, so that the same triples are difficult to reappear in different training iterations, and decorrelation on the tuple structure is achieved.
6. The method for training the speech feature coding model for early screening of mild cognitive impairment as claimed in claim 2, wherein the step 3) obtains 13-dimensional PLP features as acoustic features for speech feature coding extraction facing early stage of mild cognitive impairment, the features are pre-emphasized by using equal loudness and compressed by using cubic root, and finally, a linear prediction autoregressive model is used for obtaining cepstral coefficients.
7. A method for applying a voice feature coding model for early screening of mild cognitive impairment after training is characterized in that long voice is divided into a multi-section 2s short audio mode, 2s short audio is sent into the voice feature coding model for early screening of mild cognitive impairment after training, and final voting is carried out according to a classification result of each short audio, so that a long audio category is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211010852.XA CN115376560A (en) | 2022-08-23 | 2022-08-23 | Voice feature coding model for early screening of mild cognitive impairment and training method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211010852.XA CN115376560A (en) | 2022-08-23 | 2022-08-23 | Voice feature coding model for early screening of mild cognitive impairment and training method thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115376560A true CN115376560A (en) | 2022-11-22 |
Family
ID=84067157
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211010852.XA Pending CN115376560A (en) | 2022-08-23 | 2022-08-23 | Voice feature coding model for early screening of mild cognitive impairment and training method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115376560A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116189668A (en) * | 2023-04-24 | 2023-05-30 | 科大讯飞股份有限公司 | Voice classification and cognitive disorder detection method, device, equipment and medium |
-
2022
- 2022-08-23 CN CN202211010852.XA patent/CN115376560A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116189668A (en) * | 2023-04-24 | 2023-05-30 | 科大讯飞股份有限公司 | Voice classification and cognitive disorder detection method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11961533B2 (en) | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments | |
Luo et al. | Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition. | |
EP3469584B1 (en) | Neural decoding of attentional selection in multi-speaker environments | |
Umamaheswari et al. | An enhanced human speech emotion recognition using hybrid of PRNN and KNN | |
Yang et al. | Feature augmenting networks for improving depression severity estimation from speech signals | |
CN114041795A (en) | Emotion recognition method and system based on multi-modal physiological information and deep learning | |
CN111329494A (en) | Depression detection method based on voice keyword retrieval and voice emotion recognition | |
Renjith et al. | Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters—A comparitive study using KNN and ANN classifiers | |
CN113257406A (en) | Disaster rescue triage and auxiliary diagnosis method based on intelligent glasses | |
Deperlioglu | Classification of segmented phonocardiograms by convolutional neural networks | |
KR20170064960A (en) | Disease diagnosis apparatus and method using a wave signal | |
Hammami et al. | Pathological voices detection using support vector machine | |
Gallardo-Antolín et al. | On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification | |
Majda-Zdancewicz et al. | Deep learning vs feature engineering in the assessment of voice signals for diagnosis in Parkinson’s disease | |
CN115376560A (en) | Voice feature coding model for early screening of mild cognitive impairment and training method thereof | |
CN112466284B (en) | Mask voice identification method | |
Rusnac et al. | Convolutional Neural Network applied in EEG imagined phoneme recognition system | |
Zhu et al. | Emotion Recognition of College Students Based on Audio and Video Image. | |
Rusnac et al. | Generalized brain computer interface system for EEG imaginary speech recognition | |
CN112699236B (en) | Deepfake detection method based on emotion recognition and pupil size calculation | |
CN114881668A (en) | Multi-mode-based deception detection method | |
CN114492579A (en) | Emotion recognition method, camera device, emotion recognition device and storage device | |
Sakthi et al. | Keyword-spotting and speech onset detection in EEG-based Brain Computer Interfaces | |
Rao et al. | Automatic classification of healthy subjects and patients with essential vocal tremor using probabilistic source-filter model based noise robust pitch estimation | |
Bhavya et al. | Machine learning applied to speech emotion analysis for depression recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |