CN112349297A - Depression detection method based on microphone array - Google Patents
Depression detection method based on microphone array Download PDFInfo
- Publication number
- CN112349297A CN112349297A CN202011248610.5A CN202011248610A CN112349297A CN 112349297 A CN112349297 A CN 112349297A CN 202011248610 A CN202011248610 A CN 202011248610A CN 112349297 A CN112349297 A CN 112349297A
- Authority
- CN
- China
- Prior art keywords
- training
- neural network
- voice
- convolutional neural
- microphone array
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 61
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 49
- 238000012360 testing method Methods 0.000 claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 22
- 230000003595 spectral effect Effects 0.000 claims abstract description 13
- 230000009467 reduction Effects 0.000 claims abstract description 8
- 230000005236 sound signal Effects 0.000 claims abstract description 6
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000001228 spectrum Methods 0.000 claims description 32
- 230000006870 function Effects 0.000 claims description 27
- 230000004913 activation Effects 0.000 claims description 14
- 239000013598 vector Substances 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 8
- 230000015572 biosynthetic process Effects 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000003745 diagnosis Methods 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000003491 array Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000002596 correlated effect Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 3
- 210000002364 input neuron Anatomy 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 230000017105 transposition Effects 0.000 description 2
- 108010074864 Factor XI Proteins 0.000 description 1
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 1
- 241000712899 Lymphocytic choriomeningitis mammarenavirus Species 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/16—Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
- A61B5/165—Evaluating the state of mind, e.g. depression, anxiety
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/48—Other medical applications
- A61B5/4803—Speech analysis specially adapted for diagnostic purposes
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7264—Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
- A61B5/7267—Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Public Health (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Surgery (AREA)
- Veterinary Medicine (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Psychiatry (AREA)
- Animal Behavior & Ethology (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Heart & Thoracic Surgery (AREA)
- Biomedical Technology (AREA)
- Pathology (AREA)
- Human Computer Interaction (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Educational Technology (AREA)
- Developmental Disabilities (AREA)
- Psychology (AREA)
- Social Psychology (AREA)
- Epidemiology (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physiology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a depression detection method based on a microphone array, which comprises the steps of collecting a voice signal of a target patient by using the microphone array and preprocessing the voice signal; extracting MFCC characteristics of the audio signal preprocessed by the target patient and the voice data of the existing depression patient to generate an audio frequency spectrogram; sending the MFCC features into a 1D convolutional neural network to obtain P-dimensional features of the MFCC; sending the audio frequency spectrogram into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram; inputting the O-dimensional features into a countermeasure generation network to generate a new spectral image, and transmitting the generated new spectral image into a 2D convolutional neural network for training; fusing the P-dimensional characteristics of the MFCC and the characteristics obtained by training and reducing the dimensions through a full-connection layer; training a classifier by using the dimensionality reduction features; and training the classifier to identify the test voice to obtain an identification result. The method improves the accuracy of depression identification in non-experimental environments.
Description
Technical Field
The invention belongs to the technical field of voice recognition methods, and particularly relates to a depression detection method based on a microphone array.
Background
Currently, some progress has been made in the field of depression detection, but diagnosis of patient's condition mainly requires that the patient performs voice signal acquisition before a fixed voice acquisition device and mainly relies on a clinician for diagnosis, and common diagnosis schemes include a becker depression scale (BDI), a hamilton depression scale (HAMD), and the like, so that diagnosis results of the patient are very dependent on the experience and ability of a physician, and more importantly, cooperation of the patient is required. Therefore, most of the collected voices during the examination of the patient present the characteristics of programming and mechanization, which may cause the problem of inaccurate collected voices of the patient. Therefore, the detection device must be able to collect the voice of the patient under the condition of removing background noise in the natural state of daily life of the patient.
A microphone array is composed of a number of acoustic sensors and is a system for sampling and processing the spatial characteristics of a sound field. In a complex acoustic environment, noise always comes from all directions, and often overlaps with a speech signal in time and frequency spectrum, and in addition to the effects of echo and reverberation, it is very difficult to capture relatively pure speech with a single microphone. And the microphone array fuses the space-time information of the voice signal, so that the sound source can be simultaneously extracted and the noise can be suppressed.
Convolutional Neural Networks (CNN) are one of deep learning algorithms established in recent years, and have a good classification performance for processing large-scale images. The generation of the antagonistic network (GAN) has the greatest advantage that the experimental problem of insufficient sample data is solved, and a sample which is false and genuine is generated by constructing a proper network model, so that the diagnosis and the prediction of medical diseases can be effectively facilitated, and more important diagnosis basis is provided for medical research.
The advantage that the microphone array can clearly adopt sound signals is combined with the advantages of two deep learning methods of GAN and CNN, and therefore the accuracy of depression identification is improved.
Disclosure of Invention
The invention aims to provide a depression detection method based on a microphone array, which improves the accuracy of depression identification.
The technical scheme adopted by the invention is as follows: a microphone array based depression detection method comprising the steps of:
step 1, collecting a voice signal of a target patient by using a microphone array and preprocessing the voice signal;
step 2, extracting MFCC characteristics of the audio signal preprocessed by the target patient and the voice data of the existing depression patient in the step 1 to generate an audio frequency spectrogram;
step 4, sending the audio frequency spectrogram generated in the step 2 into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram;
step 5, inputting the O-dimensional features obtained in the step 4 into a countermeasure generation network to generate a new spectral image, and transmitting the generated new spectral image into the 2D convolution neural network in the step 4 for training;
step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 and the characteristics obtained by training in the step 5, and reducing the dimensions through a full connection layer;
7, training a classifier by the feature obtained in the step 6 after dimensionality reduction;
and 8, identifying the test voice through the classifier trained in the step 7 to obtain an identification result.
The present invention is also characterized in that,
the step 1 specifically comprises the following steps:
step 1.1, acquiring a target patient voice signal through a quaternary cross microphone array;
step 1.2, performing frame windowing on the collected voice signal of the target patient, converting the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, finishing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting the signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining an energy-entropy ratio to obtain a voice endpoint value;
step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA (direction of arrival) positioning method;
and step 1.4, synthesizing four paths of signals into one path of signal through a super-directivity beam forming algorithm according to the voice signals subjected to endpoint detection and sound source positioning processing, and realizing synthesis, noise reduction and enhancement of the microphone array signals.
The step 2 specifically comprises the following steps:
step 2.1, firstly dividing the voice signal into frames through a Hamming window function; generating cepstrum characteristic vectors, calculating discrete Fourier transform for each frame, only reserving the logarithm of an amplitude spectrum, collecting 24 spectrum components of 44100 frequency bands in a Mel frequency range after the frequency spectrum is smoothed, and approximating the spectrum components to discrete cosine transform after the Karhunen-Loeve transform is applied; finally, each frame obtains [ f1,f2,...,fN]A cepstral feature;
step 2.2, according to the set frame number, performing framing and windowing on the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal in the mth frame to obtain a spectrogram; when L filters are selected and L frames having the same size as the filters are selected in the time direction, an L × 3 spectrogram is generated, and the size of the generated color image is adjusted to M × 3.
The 1D convolutional neural network of the step 3 is as follows: only two 1D convolutional layers are built by using an open-source Keras framework based on Tensorflow, wherein each layer adopts a correction linear unit as an activation function; input dimension of Mx 1, through w1Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q1Outputting a feature vector of S; at the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.
The 2D convolutional neural network of the step 4 is as follows: construction of a Keras frame containing w using an open source Tensorflow-based2N x n sized two-dimensional convolution layers, w1The convolutional neural network comprises maximum pooling layers and 1 full-connection layer with the output dimension of L, wherein correction linear units are adopted in the convolutional layers and the full-connection layers as activation functions; at the stage of training the convolutional neural network, sequentially reading spectrogram features containing texture-like information of each frame of voice signal into a memory by using a traversal method, dividing a training set and a testing set, adding labels to the training set and the testing set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training for B times in total; and training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.
The countermeasure generation network of step 5 is: based on the network structure of DCGAN, simplify it and carry on the adjustment on the parameter, the network model includes generator and discriminator, the generator network model is made up of 1 full connection layer, 3 transpose convolution layers and 2 pieces of batch standardized layers, output as a color picture of size M x 3, the discriminator part includes 3 convolution layers and a full connection layer with softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value; and setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda.
The step 6 specifically comprises the following steps: and fusing the P dimension characteristic of the MFCC extracted by the 1D convolutional neural network with the O dimension characteristic of the spectrogram to obtain a P + O dimension characteristic, and changing the dimension of the P + O dimension characteristic into 256 dimensions through a full connection layer.
The step 7 specifically comprises the following steps:
step 7.1, taking the voice of the target patient as a test voice, and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X individuals, tags of whether the X individuals suffer from depression are used as a tag dictionary, each tag has a corresponding index number, and the tag index numbers are set as index numbers of the class; after one test, adding a spectrogram generated by a target patient into a training data set;
and 7.2, for each label, using the voices which always suffer from depression as a positive example sample set and the voices which do not suffer from depression as a negative example sample set, and training the two-classification SVM by using the positive example sample set and the negative example sample set to obtain the trained two-classification SVM.
The invention has the beneficial effects that: according to the depression detection method based on the microphone array, the microphone for voice acquisition is convenient to carry, and voice signals of a patient in a natural state can be acquired; based on the depression recognition research result combined with the CNN, the MFCC characteristics and the GAN enhanced data set characteristics, the accuracy of depression recognition in a non-experimental environment is improved by combining the advantages of MFCC and CNN.
Drawings
FIG. 1 is a schematic diagram of a microphone array based depression detection method of the present invention;
FIG. 2 is a schematic diagram of a microphone array used in a method of the present invention for depression detection based on a microphone array;
FIG. 3 is a schematic diagram of a CNN model in a depression detection method based on a microphone array according to the present invention;
fig. 4 is a schematic diagram of a GAN model in a depression detection method based on a microphone array according to the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention provides a depression detection method based on a microphone array, which comprises the following steps as shown in figures 1 to 4:
step 1, can carry out accurate sound localization at the target speaker direction formation pickup beam through using annular microphone array, restrain noise and reflected sound, strengthen sound signal, can accurately discern 3-5 m's remote pronunciation under noisy environment, satisfy the demand of gathering at any time to speech signal in the patient daily life, specifically do:
step 1.1, acquiring a patient voice signal through a quaternary cross microphone array;
step 1.2, performing frame windowing on the collected voice signals of the target patient, converting the signals from a time domain ratio to a frequency domain by utilizing fast Fourier transform, and finishing the estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum. And outputting the spectrum-reduced signal. Finally, the energy-to-entropy ratio is combined and detected to determine whether the patient speech signal is contained. Obtaining an endpoint value of the voice; the calculation process of the energy-entropy ratio is as follows:
the energy per frame is calculated as:
xiand (m) is a signal of an ith frame, and the frame length is N. The energy relation expression is as follows:
Ei=log10(1+ei/a)
a is constant and proper adjustment can distinguish between unvoiced sounds and noise. The ith frame of voice signal is subjected to fast Fourier transform to obtain:
obtaining a frequency component energy spectrum corresponding to the kth spectral line:
the normalized spectral probability density is then:
short-time spectral entropy definition of a speech frame:
energy to entropy ratio EHiIs the ratio of energy and entropy spectrum:
step 1.3, combining the end point detection result, using a DOA positioning method to judge the position of the sound source signal, and explaining the processing process of a frame of signal data: reading voice data, taking the mth frame as a processing object, taking 4 paths of microphone signals corresponding to the mth frame data, combining the 4 paths of signals into a 1 path of signal, and performing W on the signalc(k) Weighting; then, the corresponding energy sum E of a certain angle on different frequency bands is obtainedsCalculating to obtain the energy value E corresponding to the current frame signal at 360 degreess(i) And the value of i is 0-360 degrees. Take the maximum E of these 360 energiessmax(i) And the angle i corresponding to the maximum energy value, the sound source angle determined by the current frame can be output. The band energy of each frame signal corresponding to a certain angle is:
in the formula (f)1、f2Indicates the setting range of the frequency band from 1 to N/2+1, Xsw(k) In order to perform frequency band weighting processing on the combined 1-path signal, the formula is as follows:
in the formula, We(k) As a band weighting factor, the formula is:
in the formula, the index 0 < lambda < 1, and W (k) is a masking weight factor, which indicates that the frequency band with the maximum signal-to-noise ratio SNR in each frequency band is selected for the current data.
Xs(k) To combine 4 signals into 1 signal, the formula is:
in the formula, Xi(k) Is 1 of the 4 signals.
And step 1.4, synthesizing the 4 paths of signals into 1 path of signal through a super-directional beam forming algorithm on the voice signals after the endpoint detection and the sound source positioning processing, thereby realizing the synthesis, noise reduction and enhancement of the microphone array signals. The super-directional beamforming algorithm is detailed as follows:
the microphone array quaternary cross array can be regarded as one of uniform circular arrays, and the arrival direction vector of a received signal at an angle theta is as follows according to the geometrical relationship of the array:
wherein the content of the first and second substances,
the voice environment used by the method is mainly indoor and daily life, so that the noise matrix calculated based on the scattered noise field has certain applicability to the current microphone voice environment; the scattered noise field only describes the equidirectional noise field of the three-dimensional sphere, and the expression of the correlation function of the scattered noise field is as follows:
where sinc (x) yields the sampling function sin π x/π x. The microphone array is composed of M array elements, and the signal received by the ith microphone is as follows:
wherein f represents frequency, AiThe amplitude is represented by a value representing the amplitude,expressing the phase, according to the mathematical model theory of the optimal solution of the super directivity, the noise signal correlation coefficient between any two points in the space is as follows:
the noise covariance matrix is normalized to:
Rnn=[ρij](i,j=1,2,...,N-1)
dijrepresenting the distance between any two array elements in the microphone array.
The invention adopts the principle of minimum variance distortion free response (MVDR) beam forming, which is the constraint condition w of the LCMV methodHThis method is true when a (θ) ═ 1, and the signal strength is maintained while the variance of the noise is minimized, so to speak, MVDR maximizes the signal-to-noise ratio (SNR) of the array output signal. The aim is to select a filter coefficient w to minimize the total output power under the constraint condition of no distortion of a voice signal; therefore, the key objective is to solve the optimal solution of the weight coefficient w, and the constraint expression is as follows:
wherein, a (theta)s)=[a1(θ),...,aM(θ)]TA vector is guided for a target signal, represents a transfer function between a sound source direction and a microphone, and can be obtained by calculating multiple delay time tau; rxFor a spatial signal dependent covariance matrix, when k noise signals, which are not temporally correlated with each other, arrive at the microphone element from different directions, the spatial dependent covariance matrix is defined as:
calculated by the lagrange Multiplier method:
we normalize the resulting R using the obtained noise covariance matrixnnInstead of the noise covariance matrix R in the above-mentioned MVDRxThe superdirectivity weighting coefficient can be obtained as follows:
and completing weighted beam forming of the multi-channel microphone by using the optimized super-directional weighting coefficient.
Step 2, extracting MFCC characteristics and generating an audio frequency spectrogram, specifically extracting time-frequency representation and similar texture representation of an audio signal at the same time:
step 2.1, firstly, dividing the voice signal into frames by a Hamming window function. Cepstral feature vectors are then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude spectrum is retained and 24 spectral components of 44100 bands are collected over the Mel frequency range after the spectrum is smoothed. The components of the mel-frequency spectral vector calculated for each frame are highly correlated. Therefore, after applying KL (Karhunen-Loeve) transform, it is approximated as Discrete Cosine Transform (DCT). Finally, [ f ] is obtained per frame1,f2,...,fN]A cepstral feature;
and 2.2, performing framing and windowing on the voice signal of the patient according to the set frame number, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal in the mth frame to obtain a spectrogram. To accommodate the input of the convolutional neural network, L filters are selected, and L frames that are as large as the filters are selected in the time direction, so that an L × 3 spectrogram is generated, and the size of the generated color image is adjusted to M × 3.
And 3, sending the MFCC characteristics obtained in the step 2 into a 1D convolutional neural network to obtain P-dimensional characteristics of the MFCC, wherein the 1D convolutional neural network is as follows: using an open source tensirflow-based Keras framework, to prevent the over-fitting problem, only two one-dimensional (1D) convolutional layers are built, each layer using a corrective linear unit (ReLU) as an activation function; input dimension of Mx 1, through w1Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q1And outputting the feature vector of S. At the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.
And 4, sending the spectrogram obtained in the step 2 into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram, wherein the 2D convolutional neural network is as follows: an open-source Keras framework based on Tensorflow is used, and a network structure containing w is constructed by referring to AlexNet2N x n sized two-dimensional convolution layers, w1A convolutional neural network of maximum pooling layers and 1 fully-connected layer with output dimension L, wherein a correcting linear unit (ReLU) is adopted in both the convolutional layer and the fully-connected layer as an activation function; at the stage of training the convolutional neural network, sequentially reading the spectrogram characteristics of each frame of voice signal, which contain texture-like information, into a memory by using a traversal method, dividing a training set and a test set, adding labels to the training set and the test set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training, and iterating for B times in total. And training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.
And 5, inputting the characteristics obtained in the step 4 into a countermeasure generation network to generate a new frequency spectrum image, putting the generated new frequency spectrum image into original frequency spectrum image data, and then executing the training in the step 4. The countermeasure generation network is: based on the network structure of DCGAN, the network structure is simplified and the parameters are adjusted. The network model comprises a generator (generator) and a discriminator (discriminator), wherein the generator network model consists of 1 full-connection layer, 3 transposition convolution layers and 2 batch standardization layers, the output is a color picture with the size of M multiplied by 3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value. And setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda. And (4) transmitting the generated spectrogram meeting the standard into the convolution network in the step 4 for retraining.
And 6, fusing the MFCC features extracted in the step 3 and the features obtained by the expanded spectrogram data through the step 4, and reducing dimensions through a full connection layer, wherein the method specifically comprises the following steps of: and (3) fusing the P-dimensional feature of the MFCC extracted by the CNN with the O-dimensional feature of the spectrogram to obtain a P + O-dimensional feature, and enabling the dimension of the P + O-dimensional feature to be 256-dimensional through a full connection layer.
And 7, training a classifier through the obtained features subjected to dimensionality reduction processed in the step 6, wherein the training classifier specifically comprises the following steps:
and 7.1, taking the voice of the target patient as the test voice, and taking the voice data of the existing depression patient as training data. The training data comprises voice information of X individuals, tags of whether the X individuals suffer from depression are used as a tag dictionary, each tag has a corresponding index number, and the tag index numbers are set as index numbers of the class. After one test, the spectrogram generated by the target patient is added to the training data set.
Step 7.2, for each tag, the speech that always suffered from depression was taken as the positive sample set, and the speech that did not suffer from depression was taken as the negative sample set. Training a two-classification SVM by using the positive sample set and the negative sample set to obtain a trained two-classification SVM; the classifier training process is specifically as follows:
determining kernel function and punishment of SVM by circularly checking accuracy rate of SVM training setAnd selecting the optimal parameters of the two parameters of the penalty factor, and then utilizing the parameters to train the model. Let the training sample speech data be: { xi,yi},xi∈Rn,i=1,2,..,n,xiIs a feature vector of O + P dimension, yiTo determine whether a depression label is present, the SVM maps the training set to a high dimensional space using a non-linear mapping Φ (x), the most classified surface that makes the non-linear problem linear is described as: y ═ ωTΦ (x) + b, ω and b represent the weight and bias of the SVM.
To find the optimal ω and b, then the relaxation factor ξ is introducediAnd transforming the classification plane to obtain a secondary optimization problem, namely:
s.t.yi(ω·Φ(xi)+b)≥1-ξi
ξi≥0i=1,2,...,n
in the formula: c denotes a penalty parameter. Transforming the quadratic optimization problem by introducing a Lagrange multiplier to obtain:
the formula for the weight vector ω is: ω ═ Σ αiyiΦ(xi) Φ (x), the decision function of the support vector machine can be described as: f (x) sgn (α)iyiΦ(xi)·Φ(xj) + b), simplified computation, introducing a gaussian Radial Basis (RBF) kernel function and then a decision function:
where σ represents the width parameter of the RBF.
And 8, identifying the test voice through the classifier trained in the step 7. The generated identification result can be sent to the guardian of the patient through WIFI, so that the illness state of the patient can be observed at any time.
Through the mode, the microphone for voice acquisition based on the depression detection method of the microphone array is convenient to carry, and can acquire voice signals of a patient in a natural state; based on the depression recognition research result combined with the CNN, the MFCC characteristics and the GAN enhanced data set characteristics, the accuracy of depression recognition in a non-experimental environment is improved by combining the advantages of MFCC and CNN.
The depression recognition test was performed with the AVEC2013 audio visual depression recognition challenge race database using the microphone array based depression detection method of the present invention, the data set containing 340 speech information of the individual. The specific operation is as follows:
step 1, preprocessing the voice signals under each subdirectory in sequence by using a traversal method, and dividing the voice signals into frames by using a Hamming window function. Cepstral feature vectors are then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude is retained. After the spectrum is smoothed, 24 spectral components of 44100 bands are collected over the Mel frequency range. The components of the mel-frequency spectral vector calculated for each frame are highly correlated. Therefore, after applying KL (Karhunen-Loeve) transform, it is approximated as Discrete Cosine Transform (DCT).
Step 2, extracting MFCC features after preprocessing signals, normalizing the MFCC features, limiting the length of each section of voice to be 10 seconds by dividing voice fragments, obtaining 177-dimensional feature vectors of each frame by 50 frames per second, and setting the number of channels of each voice to be 50; then converting the voice signal into a spectrogram, wherein the spectrogram limits the number of sampling frames to 64 frames per second; obtaining a color picture with a spectrogram of 64 × 64 × 3 pixels, and adjusting the picture size to 200 × 200 × 3 pixels.
And 4, building a convolution pooling layer, wherein a 7-layer convolution neural network model is composed of 3 convolution layers, 3 maximum pooling layers and 1 full-connection layer. The input data of the first layer is a spectrogram of 200 × 200 × 3, convolution operation is performed on the spectrogram by adopting a convolution kernel of 3 × 3 × 3, the convolution kernel moves along the x axis and the y axis of an image, the step size is 1 pixel, 64 convolution kernels are used in total to generate data of 198 × 198 × 64 pixels, a ReLU function is used as an activation function, the pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by maximum pooling operation, the scale of the pooling operation is 2 × 2, the step size is default to 2, and the size of the pooled pixels is 99 × 99 × 64; when reversely propagating, each convolution kernel should have an offset value, namely 64 convolution kernels of the first layer correspond to 64 offset values input by the upper layer; the second layer uses 32 convolution kernels, 3 × 3 × 64, and generates 97 × 97 × 32 pixel layers after convolution operation. The pixel layers are processed by a ReLU unit to generate active pixel layers, the active pixel layers are processed by maximum pool operation, the size of the image after the pool operation is 2 x 2, the size of the image after the pool operation is 48 x 32, and then when the parameters are updated by a Dropout layer, input neurons are immediately disconnected with a probability of 10% to update the parameters so as to prevent overfitting; in the backward propagation in this layer, each convolution kernel should have an offset value, i.e. 64 convolution kernels in the first layer correspond to 32 offset values input by the upper layer; similarly, the third layer uses 32 3 × 3 × 32 convolution kernels, and generates 46 × 46 × 32 pixel layers after convolution operation. The pixel layers are processed by a ReLU unit to generate active pixel layers, the active pixel layers are processed by maximum pool operation, the size of the image after the pool operation is 2 multiplied by 2, the size of the image after the pool operation is 23 multiplied by 32, and then the input neurons are immediately disconnected with a probability of 10% to update the parameters when the parameters are updated by a Dropout layer; the multi-dimensional input is subjected to one-dimensional input by using a flattening layer, a group of one-dimensional pixel arrays are output after flattening treatment, the total number of the one-dimensional pixel arrays comprises 16928 data, and then the pixels are used as input and transmitted into a full-connection layer to carry out the next operation.
In order to extract the features of the spectrogram to be sent to a GAN network to generate a new spectrogram, the multidimensional features obtained by the spectrogram need to be subjected to dimensionality reduction, a full connection layer is built, the full connection (Dense) is used for fully connecting 16928 input data to 128 neural units, then 128 data are generated after the processing of a ReLU activation function, and 128 data are output after the processing of Dropout and serve as speech emotion features.
And step 5, the GAN generator network model consists of 1 full-connection layer, 3 transposition convolution layers and 2 batch standardization layers. The first layer input data is 128 data extracted in step 4, is connected with 4608 neurons through a full connection layer, and is converted into a shape of 3 × 3 × 512; the second layer reduces 512 channels to 256 channels using transposed convolution, kernel _ size is 3, step size is 3, and passes through the batch normalization layer; the third layer reduces 256 channels to 128 channels using transposed convolution, kernel _ size is 5, step size is 2, and passes through the batch normalization layer; the fourth layer reduces 128 channels to 3 channels using transposed convolution, with a kernel _ size of 4 and a step size of 3;
the GAN discriminator network model of the invention uses 7 layers of convolutional neural network model and consists of 3 convolutional layers, 2 batch normalization layers and 1 full connection layer. The input data of the first layer is a spectrogram of 64 multiplied by 3, a convolution operation is carried out on the spectrogram by adopting a convolution kernel of 5 multiplied by 3, the convolution kernel moves along the x axis and the y axis of the image, the step length is 1 pixel, 64 convolution kernels are used together to generate data of 60 multiplied by 24 pixel layers, a Leakly-ReLU function is used as an activation function, and the pixel layers are processed by a Leakly-ReLU unit to generate an activation pixel layer; the second layer uses 128 5 × 5 × 128 convolution kernels, and generates 57 × 57 × 128 pixel layers after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer to prevent overfitting; the third layer generates 53 × 53 × 256 pixel layers by convolution operation using 256 5 × 5 × 256 convolution kernels. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer to prevent overfitting; using a flattening layer to carry out one-dimensional input, carrying out flattening treatment, then using the pixels as input to transmit the input into a full-connection layer, wherein the last layer of output layer is provided with 1 node, and outputting a probability value; the standard-compliant 64 × 64 × 3 generated spectrogram size is modified to 200 × 200 × 3 pixel size and introduced into the convolutional network of step 4 for retraining.
And 6, building a full connection layer, combining the 1800 dimensional data extracted in the step 3 and the 16928 dimensional data extracted in the step 4 into 18728 dimensional data, fully connecting the 18728 dimensional data with 256 neural units, processing by a ReLU activation function to generate 256 data, and outputting the 256 data after Dropout processing to serve as the speech emotion characteristics.
Step 7, because the data set contains 292 people, 43,800 sections of voice information are used together through clipping and screening, the tags of 292 people whether suffering from depression are used as a tag dictionary, each tag has a corresponding index number, the index numbers of the tags are set as the class index numbers, 90% of voice signals of the tags are used as a training set, and the rest 10% of voice signals are used as a testing set;
for each tag, the voices that always suffered from depression were taken as the positive sample set, and the voices that did not suffer from depression were taken as the negative sample set. Training a two-classification SVM by using the positive sample set and the negative sample set to obtain a trained two-classification SVM;
and 8, recognizing the test voice through the two-classification SVM trained in the step 7.
Claims (8)
1. A microphone array based depression detection method, comprising the steps of:
step 1, collecting a voice signal of a target patient by using a microphone array and preprocessing the voice signal;
step 2, extracting MFCC characteristics of the audio signal preprocessed by the target patient and the voice data of the existing depression patient in the step 1 to generate an audio frequency spectrogram;
step 3, sending the MFCC features extracted in the step 2 into a 1D convolutional neural network to obtain P-dimensional features of the MFCC;
step 4, sending the audio frequency spectrogram generated in the step 2 into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram;
step 5, inputting the O-dimensional features obtained in the step 4 into a countermeasure generation network to generate a new spectral image, and transmitting the generated new spectral image into the 2D convolution neural network in the step 4 for training;
step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 and the characteristics obtained by training in the step 5, and reducing the dimensions through a full connection layer;
7, training a classifier by the feature obtained in the step 6 after dimensionality reduction;
and 8, identifying the test voice through the classifier trained in the step 7 to obtain an identification result.
2. The microphone array based depression detection method as claimed in claim 1, wherein the step 1 comprises the following steps:
step 1.1, acquiring a target patient voice signal through a quaternary cross microphone array;
step 1.2, performing frame windowing on the collected voice signal of the target patient, converting the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, finishing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting the signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining an energy-entropy ratio to obtain a voice endpoint value;
step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA (direction of arrival) positioning method;
and step 1.4, synthesizing four paths of signals into one path of signal through a super-directivity beam forming algorithm according to the voice signals subjected to endpoint detection and sound source positioning processing, and realizing synthesis, noise reduction and enhancement of the microphone array signals.
3. A microphone array based depression detection method according to claim 2, wherein the step 2 comprises the following steps:
step 2.1, firstly dividing the voice signal into frames through a Hamming window function; generating cepstrum characteristic vectors, calculating discrete Fourier transform for each frame, only reserving the logarithm of an amplitude spectrum, collecting 24 spectrum components of 44100 frequency bands in a Mel frequency range after the frequency spectrum is smoothed, and approximating the spectrum components to discrete cosine transform after the Karhunen-Loeve transform is applied; finally, each frame obtains [ f1,f2,...,fN]A cepstral feature;
step 2.2, according to the set frame number, performing framing and windowing on the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal in the mth frame to obtain a spectrogram; when L filters are selected and L frames having the same size as the filters are selected in the time direction, an L × 3 spectrogram is generated, and the size of the generated color image is adjusted to M × 3.
4. The microphone array-based depression detection method as claimed in claim 3, wherein the 1D convolutional neural network of the step 3 is: only two 1D convolutional layers are built by using an open-source Keras framework based on Tensorflow, wherein each layer adopts a correction linear unit as an activation function; input dimension of Mx 1, through w1Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q1Output isA feature vector of S; at the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.
5. The microphone array-based depression detection method of claim 4, wherein the 2D convolutional neural network of the step 4 is: construction of a Keras frame containing w using an open source Tensorflow-based2N x n sized two-dimensional convolution layers, w1The convolutional neural network comprises maximum pooling layers and 1 full-connection layer with the output dimension of L, wherein correction linear units are adopted in the convolutional layers and the full-connection layers as activation functions; at the stage of training the convolutional neural network, sequentially reading spectrogram features containing texture-like information of each frame of voice signal into a memory by using a traversal method, dividing a training set and a testing set, adding labels to the training set and the testing set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training for B times in total; and training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.
6. The microphone array-based depression detection method of claim 5, wherein the confrontation generating network of step 5 is: based on the network structure of DCGAN, simplify it and carry on the adjustment on the parameter, the network model includes generator and discriminator, the generator network model is made up of 1 full connection layer, 3 transpose convolution layers and 2 pieces of batch standardized layers, output as a color picture of size M x 3, the discriminator part includes 3 convolution layers and a full connection layer with softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value; and setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda.
7. The microphone array-based depression detection method according to claim 6, wherein the step 6 is specifically: and fusing the P dimension characteristic of the MFCC extracted by the 1D convolutional neural network with the O dimension characteristic of the spectrogram to obtain a P + O dimension characteristic, and changing the dimension of the P + O dimension characteristic into 256 dimensions through a full connection layer.
8. The microphone array-based depression detection method according to claim 7, wherein the step 7 is specifically:
step 7.1, taking the voice of the target patient as a test voice, and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X individuals, tags of whether the X individuals suffer from depression are used as a tag dictionary, each tag has a corresponding index number, and the tag index numbers are set as index numbers of the class; after one test, adding a spectrogram generated by a target patient into a training data set;
and 7.2, for each label, using the voices which always suffer from depression as a positive example sample set and the voices which do not suffer from depression as a negative example sample set, and training the two-classification SVM by using the positive example sample set and the negative example sample set to obtain the trained two-classification SVM.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011248610.5A CN112349297B (en) | 2020-11-10 | 2020-11-10 | Depression detection method based on microphone array |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011248610.5A CN112349297B (en) | 2020-11-10 | 2020-11-10 | Depression detection method based on microphone array |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112349297A true CN112349297A (en) | 2021-02-09 |
CN112349297B CN112349297B (en) | 2023-07-04 |
Family
ID=74362344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011248610.5A Active CN112349297B (en) | 2020-11-10 | 2020-11-10 | Depression detection method based on microphone array |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112349297B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112687390A (en) * | 2021-03-12 | 2021-04-20 | 中国科学院自动化研究所 | Depression state detection method and device based on hybrid network and lp norm pooling |
CN112818892A (en) * | 2021-02-10 | 2021-05-18 | 杭州医典智能科技有限公司 | Multi-modal depression detection method and system based on time convolution neural network |
CN113012720A (en) * | 2021-02-10 | 2021-06-22 | 杭州医典智能科技有限公司 | Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction |
CN113205803A (en) * | 2021-04-22 | 2021-08-03 | 上海顺久电子科技有限公司 | Voice recognition method and device with adaptive noise reduction capability |
CN113223507A (en) * | 2021-04-14 | 2021-08-06 | 重庆交通大学 | Abnormal speech recognition method based on double-input mutual interference convolutional neural network |
CN113476058A (en) * | 2021-07-22 | 2021-10-08 | 北京脑陆科技有限公司 | Intervention treatment method, device, terminal and medium for depression patients |
CN113679413A (en) * | 2021-09-15 | 2021-11-23 | 北方民族大学 | VMD-CNN-based lung sound feature identification and classification method and system |
CN113820693A (en) * | 2021-09-20 | 2021-12-21 | 西北工业大学 | Uniform linear array element failure calibration method based on generation of countermeasure network |
CN114219005A (en) * | 2021-11-17 | 2022-03-22 | 太原理工大学 | Depression classification method based on high-order spectral voice features |
CN116978409A (en) * | 2023-09-22 | 2023-10-31 | 苏州复变医疗科技有限公司 | Depression state evaluation method, device, terminal and medium based on voice signal |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107705806A (en) * | 2017-08-22 | 2018-02-16 | 北京联合大学 | A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks |
CN108831495A (en) * | 2018-06-04 | 2018-11-16 | 桂林电子科技大学 | A kind of sound enhancement method applied to speech recognition under noise circumstance |
CN109599129A (en) * | 2018-11-13 | 2019-04-09 | 杭州电子科技大学 | Voice depression recognition methods based on attention mechanism and convolutional neural networks |
CN110047506A (en) * | 2019-04-19 | 2019-07-23 | 杭州电子科技大学 | A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM |
-
2020
- 2020-11-10 CN CN202011248610.5A patent/CN112349297B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107705806A (en) * | 2017-08-22 | 2018-02-16 | 北京联合大学 | A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks |
CN108831495A (en) * | 2018-06-04 | 2018-11-16 | 桂林电子科技大学 | A kind of sound enhancement method applied to speech recognition under noise circumstance |
CN109599129A (en) * | 2018-11-13 | 2019-04-09 | 杭州电子科技大学 | Voice depression recognition methods based on attention mechanism and convolutional neural networks |
CN110047506A (en) * | 2019-04-19 | 2019-07-23 | 杭州电子科技大学 | A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM |
Non-Patent Citations (4)
Title |
---|
LE YANG等: "Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals", IEEE ACCESS * |
ZHIYONG WANG等: "Recognition of Audio Depression Based on Convolutional Neural Network and Generative Antagonism Network Model", IEEE ACCESS * |
李金鸣等: "基于深度学习的音频抑郁症识别", 计算机应用与软件 * |
钟昕孜 等: "基于自编码器的语音情感识别方法研究", 电子设计工程, no. 06 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818892A (en) * | 2021-02-10 | 2021-05-18 | 杭州医典智能科技有限公司 | Multi-modal depression detection method and system based on time convolution neural network |
CN113012720A (en) * | 2021-02-10 | 2021-06-22 | 杭州医典智能科技有限公司 | Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction |
CN113012720B (en) * | 2021-02-10 | 2023-06-16 | 杭州医典智能科技有限公司 | Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction |
CN112687390B (en) * | 2021-03-12 | 2021-06-18 | 中国科学院自动化研究所 | Depression state detection method and device based on hybrid network and lp norm pooling |
CN112687390A (en) * | 2021-03-12 | 2021-04-20 | 中国科学院自动化研究所 | Depression state detection method and device based on hybrid network and lp norm pooling |
CN113223507B (en) * | 2021-04-14 | 2022-06-24 | 重庆交通大学 | Abnormal speech recognition method based on double-input mutual interference convolutional neural network |
CN113223507A (en) * | 2021-04-14 | 2021-08-06 | 重庆交通大学 | Abnormal speech recognition method based on double-input mutual interference convolutional neural network |
CN113205803A (en) * | 2021-04-22 | 2021-08-03 | 上海顺久电子科技有限公司 | Voice recognition method and device with adaptive noise reduction capability |
CN113205803B (en) * | 2021-04-22 | 2024-05-03 | 上海顺久电子科技有限公司 | Voice recognition method and device with self-adaptive noise reduction capability |
CN113476058B (en) * | 2021-07-22 | 2022-11-29 | 北京脑陆科技有限公司 | Intervention treatment method, device, terminal and medium for depression patients |
CN113476058A (en) * | 2021-07-22 | 2021-10-08 | 北京脑陆科技有限公司 | Intervention treatment method, device, terminal and medium for depression patients |
CN113679413A (en) * | 2021-09-15 | 2021-11-23 | 北方民族大学 | VMD-CNN-based lung sound feature identification and classification method and system |
CN113679413B (en) * | 2021-09-15 | 2023-11-10 | 北方民族大学 | VMD-CNN-based lung sound feature recognition and classification method and system |
CN113820693A (en) * | 2021-09-20 | 2021-12-21 | 西北工业大学 | Uniform linear array element failure calibration method based on generation of countermeasure network |
CN113820693B (en) * | 2021-09-20 | 2023-06-23 | 西北工业大学 | Uniform linear array element failure calibration method based on generation of countermeasure network |
CN114219005A (en) * | 2021-11-17 | 2022-03-22 | 太原理工大学 | Depression classification method based on high-order spectral voice features |
CN116978409A (en) * | 2023-09-22 | 2023-10-31 | 苏州复变医疗科技有限公司 | Depression state evaluation method, device, terminal and medium based on voice signal |
Also Published As
Publication number | Publication date |
---|---|
CN112349297B (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112349297B (en) | Depression detection method based on microphone array | |
CN111243620B (en) | Voice separation model training method and device, storage medium and computer equipment | |
US10127922B2 (en) | Sound source identification apparatus and sound source identification method | |
CN109841226A (en) | A kind of single channel real-time noise-reducing method based on convolution recurrent neural network | |
JPH02160298A (en) | Noise removal system | |
CN108922543A (en) | Model library method for building up, audio recognition method, device, equipment and medium | |
WO2019232867A1 (en) | Voice discrimination method and apparatus, and computer device, and storage medium | |
Venkatesan et al. | Binaural classification-based speech segregation and robust speaker recognition system | |
Salvati et al. | A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
CN113314127B (en) | Bird song identification method, system, computer equipment and medium based on space orientation | |
US6567771B2 (en) | Weighted pair-wise scatter to improve linear discriminant analysis | |
AU2362495A (en) | Speech-recognition system utilizing neural networks and method of using same | |
Lin et al. | Domestic activities clustering from audio recordings using convolutional capsule autoencoder network | |
CN109741733B (en) | Voice phoneme recognition method based on consistency routing network | |
CN115952840A (en) | Beam forming method, arrival direction identification method, device and chip thereof | |
Raju et al. | AUTOMATIC SPEECH RECOGNITION SYSTEM USING MFCC-BASED LPC APPROACH WITH BACK PROPAGATED ARTIFICIAL NEURAL NETWORKS. | |
Kothapally et al. | Speech Detection and Enhancement Using Single Microphone for Distant Speech Applications in Reverberant Environments. | |
Sailor et al. | Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection. | |
Venkatesan et al. | Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
CN113903344A (en) | Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction | |
CN112259107A (en) | Voiceprint recognition method under meeting scene small sample condition | |
Jannu et al. | An Overview of Speech Enhancement Based on Deep Learning Techniques | |
CN114512133A (en) | Sound object recognition method, sound object recognition device, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |