CN115376560A - Voice feature coding model for early screening of mild cognitive impairment and training method thereof - Google Patents

Voice feature coding model for early screening of mild cognitive impairment and training method thereof Download PDF

Info

Publication number
CN115376560A
CN115376560A CN202211010852.XA CN202211010852A CN115376560A CN 115376560 A CN115376560 A CN 115376560A CN 202211010852 A CN202211010852 A CN 202211010852A CN 115376560 A CN115376560 A CN 115376560A
Authority
CN
China
Prior art keywords
audio
feature
cognitive impairment
samples
mild cognitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211010852.XA
Other languages
Chinese (zh)
Inventor
钱辰
狄靖凯
李继云
黄鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN202211010852.XA priority Critical patent/CN115376560A/en
Publication of CN115376560A publication Critical patent/CN115376560A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to a speech feature coding model for early screening of mild cognitive impairment and a training method thereof, wherein a two-stage feature coding generation mode is used, and each stage is provided with a feature coding extraction network and a classifier. The first-order feature extraction network converts the audio into a feature coding form facing to the Alzheimer's disease, and the classifier classifies the audio into the Alzheimer's disease and the non-Alzheimer's disease according to the coding. And the second stage of feature extraction network converts the audio of the non-Alzheimer's disease into a PLP feature form and finally generates a speech feature code for early screening of mild cognitive impairment. The code shows excellent performance in classification of mild cognitive impairment and health controllability, and can effectively improve the accuracy of early screening of the Alzheimer's disease.

Description

Voice feature coding model for early screening of mild cognitive impairment and training method thereof
Technical Field
The invention relates to a computer audio technology, in particular to a speech feature coding model for early screening of mild cognitive impairment and a training method thereof.
Background
Alzheimer disease is a persistent neurological disorder, often occurring in the elderly or pre-senile stages, and the symptoms are memory loss, impairment of visual space ability, impairment of abstract thinking and computing power, personality and behavior changes and the like. These symptoms may progress irreversibly and progressively until the ability to live is lost completely. The early stage of the Alzheimer disease is called mild cognitive impairment, and the disease development can be effectively delayed through treatment and intervention in the early stage, so that the aims of improving the life quality of patients and prolonging the lives of the patients are fulfilled. Early screening for mild cognitive impairment is therefore very critical.
Conventional screening methods for mild cognitive impairment include neuroimaging examinations, neuropsychological assessments, biomarker examinations, and the like. However, these methods require the intervention of specialized physicians, are time-consuming and cost-intensive, and even some of the examination methods are invasive to the human body, which has objectively led to the failure of early diagnosis in most patients with alzheimer disease. Therefore, the method judges the cognitive status of the patient by utilizing the voice characteristics, effectively reduces the threshold of early screening of mild cognitive impairment and has scientific value, social value and economic value.
At present, few related researches for screening cognitive impairment by using voice characteristics are available, wherein most of the researches only consider the classification of Alzheimer patients and normal people, the boundary is fuzzy, and the early screening work of Alzheimer diseases cannot be well met. A few studies recognize the necessity for screening for mild cognitive impairment, but continue the habits of earlier studies, using some unified or fused features to directly classify three categories of alzheimer's disease, mild cognitive impairment and normal people, and do not consider the correlation between more refined speech features and the patient's speech. Taking the Chinese patent application CN114333911A as an example, the early Alzheimer's disease recognition system comprises a voice signal acquisition module, a voice feature extraction module, a local feature modeling module, a global relationship modeling module and a recognition module which are sequentially in signal connection, wherein logarithmic Mel spectrum features can be extracted after voice is sliced, and then three classifications are carried out by utilizing a bidirectional long-time and short-time memory network, so that the state of a patient is judged. However, the method ignores that the mild cognitive impairment is in the intermediate transition stage of normal cognition and Alzheimer's disease, and the logarithmic Mel spectrum features of the mild cognitive impairment have many similarities with the latter two, so that the mild cognitive impairment is easy to confuse. Therefore, the final classification result is that the classification accuracy of mild cognitive impairment is far lower than the other two items. The practical clinical significance of such results is not great.
Therefore, an audio feature learning algorithm is needed in the prior art to acquire deeper acoustic information, and a speech feature coding method for mild cognitive impairment is needed to improve the accuracy of early screening of alzheimer's disease.
Disclosure of Invention
Aiming at the problems existing in screening of mild cognitive impairment, a speech feature coding model for early screening of mild cognitive impairment and a training method thereof are provided, deeper acoustic information is obtained, and the accuracy of early screening of Alzheimer's disease is improved.
The technical scheme of the invention is as follows: a speech feature coding model for early screening of mild cognitive impairment comprises two feature extraction networks and two classifiers, wherein all data are sent to a first feature extraction network to extract audio features of Alzheimer patients, the obtained speech feature codes of the Alzheimer patients are sent to the first classifier to be used for classifying the audios of the Alzheimer patients and the audios of the non-Alzheimer patients, the audios of the non-Alzheimer patients are sent to a second feature extraction network to extract the audio features of the mild cognitive impairment patients, and the audio feature codes of the mild cognitive impairment early stages are sent to the second classifier to be classified and recognized;
the first feature extraction network is a 1D multi-convolution fusion network combined with attention among channels and comprises three branch networks, the tail end of each convolution network branch is connected with a high-efficiency channel attention module, so that the network adaptively distributes different weights to different channel features, splicing operation is carried out on three vectors extracted by the three branches, and finally one-dimensional feature codes, namely voice feature codes of Alzheimer patients, are obtained through a full connection layer;
the second feature extraction network is a feature extraction network based on 2D convolution, a continuous convolution pooling structure is adopted, a random inactivation layer is added to reduce overfitting, feature codes are output, high-dimensional audio features obtained after non-Alzheimer type audio is input and processed are input, the sample input format is mxnxc, wherein m and n respectively correspond to the width and height of two-dimensional features, and c corresponds to the number of channels of the features.
A speech feature coding model training method for early screening of mild cognitive impairment comprises the following steps:
1) Collecting audio data and preprocessing an audio data set:
collecting spontaneous voices, wherein each voice segment is about 30 to 60 seconds long, and the voices are divided into three types which are respectively from a normal cognitive subject, an Alzheimer disease patient and a mild cognitive impairment patient;
all voice is used as the training data and the audio files of the test data are subjected to unified noise reduction treatment, the audio is subjected to segmentation quantity expansion by adopting a Berouti spectral subtraction method, and then screening and filtering are carried out, so that the obtained data set is used for the first feature extraction network training and testing;
2) The preprocessed audio data set is sent into a first feature extraction network for training, the audio data is subjected to segmented processing based on a metric learning module of a decision mechanism during each training, different acoustic features and classification schemes are adopted, the audio or corresponding acoustic features are input into the metric learning module to generate corresponding feature vectors and classify the feature vectors, and the two metric learning modules are used for self-supervision clustering of the feature vectors to guarantee the usability of the feature extraction network for generating the features;
3) The second feature extraction network performs two-stage acoustic feature extraction on the non-Alzheimer type voice recognized by the first feature extraction network classifier trained in the step 2), namely the recognized non-Alzheimer type voice is converted into a PLP feature form and used as the input of the second feature extraction network, and finally the voice feature code for early screening of mild cognitive impairment is obtained in an end-to-end manner.
Further, the specific method for denoising by using the Berouti spectral subtraction in the step 1) is as follows: the power spectrum of the clean audio is obtained by:
Figure BDA0003810785850000031
wherein, ω is a frame audio vector; | X (omega) emittingfume 2 Is a pure power spectrum, | Y (ω) emittingphosphor 2 Is a noisy power spectrum; | D (omega) emittingfume 2 Is a noise power spectrum; alpha is the current audio signal-to-noise ratio; replacing | D (omega) | Wyof the average noise power spectrum of the first few frames of the original audio 2 (ii) a Setting beta to 0.02;
the audio frame length is determined by taking the average noise spectrum of the first 5 frames as the noise spectrum, with a window overlap of 50%, wherein the frame rate is 16000,
Figure BDA0003810785850000032
the method comprises the steps of expanding the segmentation quantity of audio in an audio slicing mode, dividing the audio into 2-second short audio segments, wherein an overlapping window between the short audio segments is 1 second; considering that an audio vector is too short and may not contain valid audio information, the audio samples are filtered by the following formula:
Figure BDA0003810785850000041
where w is a complete audio vector, ω is a 2 second slice of audio, γ is a threshold set to 0.3, l is the length of w, and l is the length of ω, which represents the removal of audio samples if False is found, otherwise it is retained, and finally a data set for comparison experiments is generated.
Further, in the training of step 2), a triplet loss function is used as a loss function of the metric learning module, the input form of the sample is a triplet form of an anchor sample, a positive sample and a negative sample, wherein the anchor sample and the positive sample are homogeneous samples, and the negative sample is heterogeneous samples, and the loss function is as follows:
Figure BDA0003810785850000042
wherein margin is an interval parameter for enlarging the difference between the anchor sample and the positive sample pair and the anchor sample and the negative sample pair; the loss function takes the L2 distance between the feature vectors as a measurement distance, and continuously trains the network, so that the L2 distance between the anchor sample and the positive sample is smaller than the interval parameter, and the L2 distance between the anchor sample and the negative sample is larger than the interval parameter, thereby achieving the clustering effect of the feature vectors.
Further, in the training of step 2), the triple samples adopt a decorrelation sample generator to dynamically generate the triples from the memory, in each round of training, the decorrelation sample generator selects different samples as anchor samples in a random sequence, and ensures that all samples in each round of training can be used as one anchor sample to construct the triples, so that decorrelation in the triple sequence is achieved; the decorrelation sample generator constructs the triples in a random manner, so that the same triples are difficult to reappear in different training iterations, and decorrelation on the tuple structure is achieved.
Further, the step 3) obtains 13-dimensional PLP features as acoustic features for coding and extracting the voice features facing early stage of mild cognitive impairment, the features are subjected to equal loudness pre-emphasis and cubic root compression, and finally, a linear prediction autoregressive model is used for obtaining cepstrum coefficients.
A method for applying a voice characteristic coding model for early screening of mild cognitive impairment after training is characterized in that long voice is divided into a multi-segment 2s short voice frequency mode, 2s short voice frequency is sent to the voice characteristic coding model for early screening of mild cognitive impairment after training, and final voting is carried out according to a classification result of each short voice frequency, so that a long voice frequency category is obtained.
The invention has the beneficial effects that: the voice characteristic coding model for early screening of mild cognitive impairment and the training method thereof are used for solving the problem of difficulty in classification between mild cognitive impairment and healthy control easily occurring in early screening audio classification research of Alzheimer's disease, and provide a new idea for early screening research of Alzheimer's disease.
Drawings
FIG. 1 is a diagram of a model architecture for use in the method of the present invention;
FIG. 2 is a schematic diagram of a 1D multi-convolution fusion network structure incorporating inter-channel attention in the method of the present invention;
FIG. 3 is a schematic diagram of a 2D convolution-based feature extraction network structure in the method of the present invention;
FIG. 4 is a schematic diagram of a metric learning module in the method of the present invention;
fig. 5 is a schematic diagram of a decorrelated sample generator configuration in accordance with the method of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
The core of the invention is the generation mode of the feature code, and the feature code generated by the 2s short audio frequency is evaluated by using a classifier in the model. In practical applications, long speech is usually classified to improve the classification accuracy. By combining the invention, the long voice can be divided into a multi-segment 2s short voice frequency mode, and the final voting is carried out according to the classification result of each short voice frequency, thereby obtaining the category of the long voice frequency. As for the detailed voting scheme, it can be selected by itself, and will not be described in detail here.
The invention relates to a speech feature coding method for early screening of mild cognitive impairment, which uses a model architecture as shown in figure 1, wherein two feature extraction networks are core components of two modules respectively. It is noted that an XGBoost based classifier (XGBoost classifier) is used in both modules in connection with a feature extraction network (feature generation net). The XGboost is an integrated learning model with high training speed and extremely excellent performance. A first classifier (xgboost classifier one) divides all data into the audio of patients with alzheimer's disease and the audio of patients with non-alzheimer's disease using the speech feature coding for alzheimer's disease; the second classifier (xgboost classifier two) uses audio features for patients with mild cognitive impairment to classify the audio of non-alzheimer patients into cognitively normal audio and audio of patients with mild cognitive impairment. According to the method, the classification performance of the feature codes obtained through the feature extraction network is tested through the classifier based on the XGboost, the feature extraction network is finely adjusted according to the classification result, and finally the voice feature codes with excellent classification performance and for early screening of mild cognitive impairment are obtained.
The method specifically comprises the following steps:
step 1: the audio data set is preprocessed. Firstly, the audio files as training data and test data are subjected to unified noise reduction processing by adopting a technique of Berouti spectral subtraction. The spectrum subtraction is to subtract the spectrum of the noise signal from the spectrum of the noise signal, and the Berouti spectrum subtraction is improved on the basis, so that the condition that the power returns to zero after the spectrum subtraction is avoided. Secondly, the audio file after noise reduction is segmented into short audio sets with 2 seconds, the overlapping part of the segments is 1 second, and the data sets are expanded under the condition that original audio characteristic information is kept. And finally, the sliced and expanded short audio set is filtered out, so that some meaningless or low-value audio sections are removed, and the data set is used for the feature extraction network training and testing in the step 2.
Step 2: as shown in fig. 1 and 2, a 1D multi-convolution fusion network combining inter-channel attention is constructed for feature extraction work of audio slice vectors. The network is composed of three branch networks, the sizes of convolution kernels of the three branches are 3, 5 and 7 respectively, multi-view perception capability of the network is endowed, and the network can be combined with information under different receptive fields to carry out comprehensive feature extraction. The tail end of each convolutional network branch is connected with an Efficient Channel Attention (Efficient Channel Attention) module, so that the network can adaptively assign different weights to different Channel characteristics, thereby helping to highlight some more important characteristics and weakening the influence of other characteristics. Finally, splicing operation is carried out on the three vectors extracted by the three branches, and finally, a one-dimensional feature code with the length of 32, namely the speech feature code for the Alzheimer disease, is obtained through a full connection layer. And (4) accessing a classifier at the network terminal, filtering corresponding voice classified into feature codes of the Alzheimer's disease, reserving the voice of normal people and mild cognitive impairment patients, and performing further voice feature code generation and discrimination facing to early mild cognitive impairment in the step 3.
And step 3: as shown in fig. 1 and 3, a feature extraction network based on 2D convolution is constructed. And (3) taking the high-dimensional audio features obtained after processing the audio which is determined to be not of the Alzheimer type in the step (2) as samples to be input into the network. The sample input format is m × n × c, where m and n correspond to the width and height of the two-dimensional feature, respectively, and c corresponds to the number of channels of the feature. The whole network adopts a traditional continuous convolution pooling structure, and finally a random inactivation layer is added to reduce overfitting, so that feature codes with the output length of 32 are output. Therefore, the invention successfully obtains the speech feature code for the early stage of mild cognitive impairment.
In the step 1, spontaneous speech such as talking in a picture or free chat is adopted in the invention for selecting the audio data set, and is often accompanied by a plurality of pauses, repetitions, nonsense auxiliary words or incomplete paragraphs, which are more close to the conversation in daily life of people. Each speech segment is about 30 to 60 seconds long. All speech can be divided into three categories, from subjects with normal cognition, alzheimer's patients and patients with mild cognitive impairment.
And after the data set is collected, noise reduction processing is required. The invention uses the Berouti spectral subtraction, and the power spectrum of the pure audio can be obtained by the formula 1:
Figure BDA0003810785850000071
wherein ω is a frame of audio vector, | X (ω) emitting 2 For pure power spectrum, | Y (ω) & gt bypass 2 For noisy power spectrum, | D (ω) | D 2 To the noise power spectrum, α is the current audio signal-to-noise ratio. ByWe cannot know | D (ω) & gtnon 2 Therefore, the average noise power spectrum of the first few frames of the original audio is generally taken as a substitute in the calculation process. In the invention, beta is set to be 0.02, the audio frame length is obtained by the formula (2), the window overlap is 50%, the average noise spectrum of the first 5 frames is taken as the noise spectrum, and the frame rate is 16000 in the formula.
Figure BDA0003810785850000072
Because the total number of the audio samples is not large, the invention performs segmentation quantity expansion on the audio in an audio slicing mode, the audio is divided into 2-second short audio segments, and the overlapping window between the short audio segments is 1 second. Considering that the audio vector may not contain valid audio information due to being too short, the audio samples are filtered by formula (3):
Figure BDA0003810785850000081
where w is a complete audio vector, ω is a 2 second slice of audio therein, γ is a threshold set to 0.3, L is the length of w, and l is the length of ω. False is found to represent removal of the audio sample, whereas the remaining, eventually results in a data set for the control experiment.
In the process of extracting the voice features in the step 2, the invention provides a method for the sectional processing of a metric learning module based on a decision-making mechanism, which is shown in figure 1. The method comprises two different voice feature extraction models, and different acoustic features and classification schemes are respectively adopted, so that the classification accuracy at different stages can reach a higher level, and the problem of low recognition accuracy of mild cognitive impairment is solved. The workflow of the two-time classification is similar, and the audio or the corresponding acoustic features are input into a metric learning module to generate corresponding feature vectors and are classified. The two metric learning modules are used for self-supervision clustering of the feature vectors, so that the usability of feature extraction network generated features is guaranteed.
In the above method, in order to effectively discriminate different types of speech features, the present invention proposes to use a triple loss function as a loss function of a metric learning module, as shown in fig. 4, which requires that when the module is trained, the input form of a sample should be a triple form (an anchor sample, a positive sample, and a negative sample), where the anchor sample and the positive sample are similar samples, and the anchor sample and the negative sample are heterogeneous samples. The loss function is shaped as equation (4):
Figure BDA0003810785850000082
where margin is an interval parameter used to stretch the gap between anchor samples and pairs of positive samples and anchor samples and pairs of negative samples. Equation (4) takes the L2 distance between feature vectors as the metric distance. By continuously training the network, the L2 distance between the anchor sample and the positive sample is smaller than the interval parameter, and the L2 distance between the anchor sample and the negative sample is larger than the interval parameter, so that the clustering effect of the feature vectors is achieved, and the interval parameters of the first measurement learning module and the second measurement learning module are set to be 3 and 2 respectively. The relationship between the feature generation network and the metric learning module is shown in fig. 4. Considering that the current sample size is small, the feature vector classification performed by the artificial neural network is easy to generate serious overfitting, so the classification work is finished by a machine learning model. And the feature vectors obtained by the feature extraction network are used as training and testing samples of the model to train and test.
The invention also provides a decorrelation sample generator. Conventional triple sample sets suffer from several drawbacks. Firstly, the triple samples are not changed once being generated, and the sample set is generated by combining the existing different types of non-tuple form samples through random or a certain rule. Therefore, some invalid triplets are stored in the memory, and the network cannot learn new information from the triplets. Secondly, the network continuously learns the sequence information of the samples in the training set because the decorrelation of the samples cannot be completely achieved, and the information is useless and redundant in the invention. Finally, the sample cannot be fully utilized. Assuming that there are N classes of samples, each class having M, the number of non-repeating triples that can be constructed is:
Figure BDA0003810785850000091
obviously, it is difficult to store too many triplet samples in computer memory, and only part of each time can be loaded, which results in the sample not being fully utilized. The decorrelated sample generator proposed by the present invention solves the above problem well, and its structure is shown in fig. 5. Firstly, the method carries out network training by dynamically generating the triples from the memory, and fully utilizes the samples as much as possible. Secondly, in each round of training, the generator selects different samples as anchor samples in a random sequence, and ensures that all samples in each round of training can be used as one anchor sample to construct a triplet, so that decorrelation on the triplet sequence is achieved. Finally, the generator carries out the triplet construction in a random mode, so that the same triplet is difficult to reappear in different training iterations, decorrelation on the tuple structure is achieved, and meanwhile, the phenomenon that the same invalid triplet has continuous influence on network training can be avoided to a certain extent.
In the selection of the acoustic features in the step 3, the invention tries to use the acoustic features with excellent performance in various fields for systematic effect comparison, wherein the acoustic features comprise a MFCC feature with 13 dimensions, a PLP feature with 13 dimensions, a MFCC-delta-delta feature with 39 dimensions and a PASTA-PLP feature with 26 dimensions. The final results show that PLP > RASTA-PLP > MFCC-delta-delta > MFCC on the classification performance for mild language impairment versus normally controllable audio. Therefore, the 13-dimensional PLP features are finally selected as acoustic features for the voice feature coding extraction facing the early stage of mild cognitive impairment.
Compared with the MFCC characteristics, the characteristics can extract the recognizable information in the audio frequency, highly overlap the auditory characteristics of human ears, and are the characteristics with high generalization and robustness. The invention selects 26 triangular filters to map the audio frequency to 26 different frequency bands, and extracts 26-dimensional MFCC characteristics. Since a human cannot feel high frequency information, the top 13-dimensional MFCC features are chosen as the available features. On the basis, the invention also splices the first derivative and the second derivative of the original MFCC to the original MFCC characteristic to obtain the MFCC-delta-delta characteristic with 39 dimensions. The invention also adopts PLP characteristics, which utilize equal loudness pre-emphasis and cubic root compression, and finally obtain cepstrum coefficients by using a linear prediction autoregressive model. The invention also adopts 26 filters, and finally selects the first 13-dimensional PLP characteristics as available characteristics. On the basis, the invention also uses RASTA-PLP characteristics which are modified linear prediction cepstrum coefficients, the power spectrum is modified according to the auditory perception characteristics of people to obtain RASTA characteristics, and the RASTA characteristics are superposed on the original PLP characteristics, so that 26-dimensional PASTA-PLP characteristics are obtained.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (7)

1. A voice feature coding model for early screening of mild cognitive impairment is characterized by comprising two feature extraction networks and two classifiers, wherein all data are sent into the first feature extraction network to extract audio features of Alzheimer patients, the obtained voice feature codes of the Alzheimer patients are sent into the first classifier to be used for classifying the audios of the Alzheimer patients and the audios of the non-Alzheimer patients, the audios of the non-Alzheimer patients are sent into the second feature extraction network to extract the audio features of the mild cognitive impairment patients, and the audio feature codes of the early screening of the mild cognitive impairment are sent to the second classifier to be classified and recognized;
the first feature extraction network is a 1D multi-convolution fusion network combined with attention among channels and comprises three branch networks, the tail end of each convolution network branch is connected with a high-efficiency channel attention module, so that the network adaptively distributes different weights to different channel features, splicing operation is carried out on three vectors extracted by the three branches, and finally one-dimensional feature codes, namely voice feature codes of Alzheimer patients, are obtained through a full connection layer;
and the second feature extraction network is a 2D convolution-based feature extraction network, a continuous convolution pooling structure is adopted, a random inactivation layer is added to reduce overfitting, feature codes are output, high-dimensional audio features are obtained after non-Alzheimer type audio is input and processed, the sample input format is mxnxc, wherein m and n respectively correspond to the width and the height of the two-dimensional features, and c corresponds to the number of channels of the features.
2. A speech feature coding model training method for early screening of mild cognitive impairment is characterized by comprising the following steps:
1) Collecting audio data and preprocessing an audio data set:
collecting spontaneous voices, wherein each voice segment is about 30 to 60 seconds long, and the voices are divided into three types which are respectively from a normal cognitive subject, an Alzheimer disease patient and a mild cognitive impairment patient;
all voice is used as the training data and the audio files of the test data are subjected to unified noise reduction treatment, the audio is subjected to segmentation quantity expansion by adopting a Berouti spectral subtraction method, and then screening and filtering are carried out, so that the obtained data set is used for the first feature extraction network training and testing;
2) The preprocessed audio data set is sent into a first feature extraction network for training, the audio data is subjected to segmented processing based on a metric learning module of a decision mechanism during each training, different acoustic features and classification schemes are adopted, the audio or corresponding acoustic features are input into the metric learning module to generate corresponding feature vectors and classify the feature vectors, and the two metric learning modules are used for self-supervision clustering of the feature vectors to guarantee the usability of the feature extraction network for generating the features;
3) The second feature extraction network performs two-stage acoustic feature extraction on the non-Alzheimer type voice recognized by the first feature extraction network classifier trained in the step 2), namely the recognized non-Alzheimer type voice is converted into a PLP feature form and is used as the input of the second feature extraction network, and finally the voice feature code for early screening of mild cognitive impairment is obtained in an end-to-end manner.
3. The method for training the speech feature coding model for early screening of mild cognitive impairment as claimed in claim 2, wherein the concrete method for denoising by using the Berouti spectral subtraction in the step 1) is as follows:
the power spectrum of the clean audio is found by:
Figure FDA0003810785840000021
wherein, ω is a frame audio vector; | X (omega) emittingfume 2 Is a pure power spectrum, | Y (ω) emittingphosphor 2 Is a noisy power spectrum; | D (omega) ("Liao") 2 Is a noise power spectrum; alpha is the current audio signal-to-noise ratio; taking average noise power spectrum of first few frames of original audio to replace | D (omega) & gt 2 (ii) a Setting beta to 0.02;
the audio frame length is determined by taking the average noise spectrum of the first 5 frames as the noise spectrum, with a window overlap of 50%, wherein the frame rate is 16000,
Figure FDA0003810785840000022
the method comprises the steps of expanding the segmentation quantity of audio in an audio slicing mode, dividing the audio into 2-second short audio segments, wherein an overlapping window between the short audio segments is 1 second; considering that an audio vector is too short and may not contain valid audio information, the audio samples are filtered by the following formula:
Figure FDA0003810785840000023
where w is a complete audio vector, ω is a 2 second slice of audio, γ is a threshold set to 0.3, l is the length of w, and l is the length of ω, which represents the removal of audio samples if False is found, otherwise it is retained, and finally a data set for comparison experiments is generated.
4. The method as claimed in claim 2, wherein in the training of step 2), a triplet loss function is used as the loss function of the metric learning module, and the input form of the samples is a triplet form of anchor samples, positive samples and negative samples, wherein the anchor samples and the positive samples are similar samples and the negative samples are dissimilar samples, and the loss function is as follows:
Figure FDA0003810785840000031
wherein margin is an interval parameter for enlarging the difference between the anchor sample and the positive sample pair and the anchor sample and the negative sample pair; the loss function takes the L2 distance between the feature vectors as a measurement distance, and continuously trains the network, so that the L2 distance between the anchor sample and the positive sample is smaller than the interval parameter, and the L2 distance between the anchor sample and the negative sample is larger than the interval parameter, thereby achieving the clustering effect of the feature vectors.
5. The method as claimed in claim 4, wherein the triple samples in the training of step 2) dynamically generate the triple from the memory by using the decorrelated sample generator, in each training cycle, the decorrelated sample generator selects different samples as anchor samples in a random order, and ensures that all samples in each training cycle can be used as one anchor sample to construct the triple, thereby achieving decorrelation in the triple order; the decorrelation sample generator constructs the triples in a random manner, so that the same triples are difficult to reappear in different training iterations, and decorrelation on the tuple structure is achieved.
6. The method for training the speech feature coding model for early screening of mild cognitive impairment as claimed in claim 2, wherein the step 3) obtains 13-dimensional PLP features as acoustic features for speech feature coding extraction facing early stage of mild cognitive impairment, the features are pre-emphasized by using equal loudness and compressed by using cubic root, and finally, a linear prediction autoregressive model is used for obtaining cepstral coefficients.
7. A method for applying a voice feature coding model for early screening of mild cognitive impairment after training is characterized in that long voice is divided into a multi-section 2s short audio mode, 2s short audio is sent into the voice feature coding model for early screening of mild cognitive impairment after training, and final voting is carried out according to a classification result of each short audio, so that a long audio category is obtained.
CN202211010852.XA 2022-08-23 2022-08-23 Voice feature coding model for early screening of mild cognitive impairment and training method thereof Pending CN115376560A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211010852.XA CN115376560A (en) 2022-08-23 2022-08-23 Voice feature coding model for early screening of mild cognitive impairment and training method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211010852.XA CN115376560A (en) 2022-08-23 2022-08-23 Voice feature coding model for early screening of mild cognitive impairment and training method thereof

Publications (1)

Publication Number Publication Date
CN115376560A true CN115376560A (en) 2022-11-22

Family

ID=84067157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211010852.XA Pending CN115376560A (en) 2022-08-23 2022-08-23 Voice feature coding model for early screening of mild cognitive impairment and training method thereof

Country Status (1)

Country Link
CN (1) CN115376560A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189668A (en) * 2023-04-24 2023-05-30 科大讯飞股份有限公司 Voice classification and cognitive disorder detection method, device, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189668A (en) * 2023-04-24 2023-05-30 科大讯飞股份有限公司 Voice classification and cognitive disorder detection method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US11961533B2 (en) Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
Luo et al. Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition.
EP3469584B1 (en) Neural decoding of attentional selection in multi-speaker environments
Umamaheswari et al. An enhanced human speech emotion recognition using hybrid of PRNN and KNN
Yang et al. Feature augmenting networks for improving depression severity estimation from speech signals
CN114041795A (en) Emotion recognition method and system based on multi-modal physiological information and deep learning
CN111329494A (en) Depression detection method based on voice keyword retrieval and voice emotion recognition
Renjith et al. Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters—A comparitive study using KNN and ANN classifiers
CN113257406A (en) Disaster rescue triage and auxiliary diagnosis method based on intelligent glasses
Deperlioglu Classification of segmented phonocardiograms by convolutional neural networks
KR20170064960A (en) Disease diagnosis apparatus and method using a wave signal
Hammami et al. Pathological voices detection using support vector machine
Gallardo-Antolín et al. On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification
Majda-Zdancewicz et al. Deep learning vs feature engineering in the assessment of voice signals for diagnosis in Parkinson’s disease
CN115376560A (en) Voice feature coding model for early screening of mild cognitive impairment and training method thereof
CN112466284B (en) Mask voice identification method
Rusnac et al. Convolutional Neural Network applied in EEG imagined phoneme recognition system
Zhu et al. Emotion Recognition of College Students Based on Audio and Video Image.
Rusnac et al. Generalized brain computer interface system for EEG imaginary speech recognition
CN112699236B (en) Deepfake detection method based on emotion recognition and pupil size calculation
CN114881668A (en) Multi-mode-based deception detection method
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
Sakthi et al. Keyword-spotting and speech onset detection in EEG-based Brain Computer Interfaces
Rao et al. Automatic classification of healthy subjects and patients with essential vocal tremor using probabilistic source-filter model based noise robust pitch estimation
Bhavya et al. Machine learning applied to speech emotion analysis for depression recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination