CN112349297B - Depression detection method based on microphone array - Google Patents
Depression detection method based on microphone array Download PDFInfo
- Publication number
- CN112349297B CN112349297B CN202011248610.5A CN202011248610A CN112349297B CN 112349297 B CN112349297 B CN 112349297B CN 202011248610 A CN202011248610 A CN 202011248610A CN 112349297 B CN112349297 B CN 112349297B
- Authority
- CN
- China
- Prior art keywords
- training
- voice
- neural network
- convolutional neural
- depression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 61
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 48
- 238000001228 spectrum Methods 0.000 claims abstract description 43
- 238000012360 testing method Methods 0.000 claims abstract description 24
- 230000009467 reduction Effects 0.000 claims abstract description 9
- 230000005236 sound signal Effects 0.000 claims abstract description 6
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 32
- 230000006870 function Effects 0.000 claims description 28
- 238000011176 pooling Methods 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 11
- 238000012937 correction Methods 0.000 claims description 6
- 238000009432 framing Methods 0.000 claims description 4
- 230000001131 transforming effect Effects 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 230000000875 corresponding effect Effects 0.000 description 8
- 238000010606 normalization Methods 0.000 description 8
- 238000003745 diagnosis Methods 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000003491 array Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 230000002596 correlated effect Effects 0.000 description 3
- 210000002364 input neuron Anatomy 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 210000005036 nerve Anatomy 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 1
- 241000712899 Lymphocytic choriomeningitis mammarenavirus Species 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000013095 identification testing Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/16—Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
- A61B5/165—Evaluating the state of mind, e.g. depression, anxiety
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/48—Other medical applications
- A61B5/4803—Speech analysis specially adapted for diagnostic purposes
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7264—Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
- A61B5/7267—Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Molecular Biology (AREA)
- Psychiatry (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Biophysics (AREA)
- Pathology (AREA)
- Biomedical Technology (AREA)
- Heart & Thoracic Surgery (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Surgery (AREA)
- Animal Behavior & Ethology (AREA)
- Veterinary Medicine (AREA)
- Acoustics & Sound (AREA)
- Evolutionary Computation (AREA)
- Child & Adolescent Psychology (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physiology (AREA)
- Epidemiology (AREA)
- Fuzzy Systems (AREA)
- Developmental Disabilities (AREA)
- Educational Technology (AREA)
- Hospice & Palliative Care (AREA)
- Psychology (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a depression detection method based on a microphone array, which comprises the steps of collecting and preprocessing voice signals of a target patient by using the microphone array; extracting the preprocessed audio signal of the target patient and the MFCC (frequency spectrum coefficient) characteristics of the voice data of the existing depression patient to generate an audio spectrogram; the MFCC characteristics are sent into a 1D convolutional neural network to obtain P-dimensional characteristics of the MFCC; sending the audio spectrogram into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram; inputting the O-dimensional characteristics into a countermeasure generation network to generate a new spectrum image, and transmitting the generated new spectrum image into a 2D convolutional neural network for training; fusing the P-dimensional characteristics of the MFCC and the characteristics obtained by training and reducing the dimension through the full connection layer; training a classifier by using the dimension reduction characteristics; training the classifier to identify the test voice and obtaining the identification result. The invention improves the accuracy of identifying the depression in a non-experimental environment.
Description
Technical Field
The invention belongs to the technical field of voice recognition methods, and particularly relates to a depression detection method based on a microphone array.
Background
At present, the voice signal has made some progress in the field of depression detection, but the diagnosis of the illness state of a patient mainly ensures that the patient carries out voice signal acquisition in front of a fixed voice acquisition device and mainly depends on a clinician to carry out diagnosis, and the common diagnosis methods include a Beck depression scale (BDI), a Hamiltonian depression scale (HAMD) and the like, so that the diagnosis result of the patient depends on the experience and the capability of doctors, and more importantly, the patient is matched. Therefore, most of the voices collected during the current examination of the patient are characterized by programming and mechanization, and the problem of inaccuracy of the collected voices of the patient is possibly caused. Therefore, the detection device must be able to collect the patient's voice under the condition of removing the background noise in the natural state of the patient's daily life.
A microphone array, which is composed of a number of acoustic sensors, is a system for sampling and processing the spatial characteristics of a sound field. In complex acoustic environments, noise always comes from all directions and often overlaps with speech signals in time and frequency spectrum, plus the effects of echo and reverberation, it is very difficult to capture relatively pure speech with a single microphone. The microphone array fuses the space-time information of the voice signals, and can simultaneously extract sound sources and inhibit noise.
Convolutional neural networks (CNN, convolutional Neural Network) are one of the deep learning algorithms established in recent years, which have good classification performance for large image processing. The biggest advantage of generating the countermeasure network (GAN, generative Adversarial Networks) is that the experimental problem of insufficient sample data is solved, a proper network model is constructed to generate a false and spurious sample, diagnosis and prediction of medical diseases can be effectively facilitated, and more important diagnosis basis is provided for medical research.
The advantage that the microphone array can clearly adopt sound signals is combined with the advantages of two deep learning methods of GAN and CNN, so that the accuracy of identifying depression is improved.
Disclosure of Invention
The invention aims to provide a depression detection method based on a microphone array, which improves the accuracy of depression identification.
The technical scheme adopted by the invention is as follows: a depression detection method based on a microphone array, comprising the steps of:
step 1, a microphone array is used for collecting voice signals of a target patient and preprocessing the voice signals;
step 2, extracting the audio signal preprocessed by the target patient in the step 1 and the MFCC characteristics of the voice data of the existing depression patient to generate an audio spectrogram;
step 4, sending the audio spectrogram generated in the step 2 into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram;
step 5, inputting the O-dimensional characteristics obtained in the step 4 into an countermeasure generation network to generate a new spectrum image, and transmitting the generated new spectrum image into the 2D convolutional neural network in the step 4 for training;
step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 with the characteristics obtained by training in the step 5, and reducing the dimension through the full connection layer;
step 7, training a classifier through the feature after dimension reduction obtained in the step 6;
and 8, recognizing the test voice through the trained classifier in the step 7 to obtain a recognition result.
The present invention is also characterized in that,
the step 1 specifically comprises the following steps:
step 1.1, collecting a voice signal of a target patient through a quaternary cross microphone array;
step 1.2, carrying out frame windowing on the collected voice signal of the target patient, transforming the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, completing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting a signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining the energy entropy ratio to obtain an endpoint value of voice;
step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA positioning method;
and 1.4, synthesizing four paths of signals into one path of signal through a superdirective beam forming algorithm by using the voice signals subjected to end point detection and sound source positioning processing, so as to realize synthesis, noise reduction and enhancement of microphone array signals.
The step 2 specifically comprises the following steps:
step 2.1, firstly dividing a voice signal into frames through a Hamming window function; then a cepstral feature vector is generated and calculated for each frameDiscrete Fourier transform, which only keeps the logarithm of the amplitude spectrum, collects 24 frequency spectrum components of 44100 frequency bands in the Mel frequency range after the frequency spectrum is smoothed, and approximates the frequency spectrum to discrete cosine transform after Karhunen-Loeve transform is applied; finally obtain f per frame 1 ,f 2 ,...,f N ]A cepstrum feature;
step 2.2, according to the set frame number, framing and windowing the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal x (m) in the m-th frame to obtain a spectrogram; l filters are selected, and L frames having the same size as the filters are selected in the time direction, a spectrogram of L×L×3 is generated, and the size of the generated color image is adjusted to M×M×3.
The 1D convolutional neural network of step 3 is: using a Keras framework based on Tensorflow with an open source, only building two 1D convolution layers, wherein each layer adopts a correction linear unit as an activation function; the input dimension is Mx1, through w 1 A convolution layer filter with m multiplied by 1, dropout of 0.1 and maximum pool step of q 1 Outputting a feature vector of S; in the stage of training the 1D convolutional neural network, the MFCC features containing time-frequency information of each frame of voice signal are sequentially read into a memory by using a traversing method, a training set and a testing set are divided, labels are respectively added to the training set and the testing set, and processed data are transmitted into the convolutional neural network according to the set labels to carry out iterative training, and the total iteration is carried out for B times.
The 2D convolutional neural network of step 4 is: using a open source Tensorflow-based Keras framework to build a containing w 2 Two-dimensional convolution layers of size n x n, w 1 A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units are adopted in the convolutional layer and the fully connected layer as an activation function; in the stage of training convolutional neural network, using traversing method to orderly read spectrogram characteristics of each frame of speech signal containing texture-like information into internal memory, dividing training set and test set, respectively adding labels to training set and test set, then transferring the processed data into volume according to the set labelIn the neural network, iterative training is carried out, and total iterations are carried out for B times; training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.
The countermeasure generation network of step 5 is: the network model comprises a generator and a discriminator, wherein the generator network model consists of 1 full-connection layer, 3 transposed convolution layers and 2 batch standardization layers, and outputs a color picture with the size of MxMx3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolution layers, 2 batch standardization layers and 2 full connection layers by using a 7-layer convolution neural network model, and finally outputs a probability value; setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator.
The step 6 is specifically as follows: the P-dimensional characteristics of the MFCC extracted through the 1D convolutional neural network are fused with the O-dimensional characteristics of the spectrogram to obtain P+O-dimensional characteristics, and the dimension of the P+O-dimensional characteristics is changed into 256 dimensions through a full connecting layer.
The step 7 is specifically as follows:
step 7.1, taking the voice of the target patient as test voice and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X persons, labels of whether the X persons suffer from depression are used as a label dictionary, each label is provided with a corresponding index number, and index numbers of which the label index numbers are classified are set; after one test, adding a spectrogram generated by a target patient into a training data set;
and 7.2, for each label, the voice with depression is taken as a positive sample set, the voice without depression is taken as a negative sample set, and the positive sample set and the negative sample set are used for training the two-class SVM, so that the trained two-class SVM is obtained.
The beneficial effects of the invention are as follows: the microphone for voice acquisition is convenient to carry, and can acquire voice signals of a patient in a natural state; based on the depression identification research results combining the characteristics of the CNN, the MFCC and the GAN enhancement data set, the advantages of the MFCC and the CNN are combined, and the accuracy of depression identification in a non-experimental environment is improved.
Drawings
FIG. 1 is a schematic diagram of a method for detecting depression based on a microphone array of the present invention;
FIG. 2 is a schematic diagram of a microphone array used in a method for detecting depression based on a microphone array according to the present invention;
FIG. 3 is a schematic diagram of a CNN model in a microphone array-based depression detection method of the present invention;
fig. 4 is a schematic diagram of GAN model in a method for detecting depression based on a microphone array according to the present invention.
Detailed Description
The invention will be described in detail with reference to the accompanying drawings and detailed description.
The invention provides a depression detection method based on a microphone array, which is shown in fig. 1 to 4 and comprises the following steps:
step 1, can carry out accurate sound source localization and form pickup wave beam in the direction of target speaker through using annular microphone array, restrain noise and reflected sound, strengthen sound signal, can accurately discern 3-5 m's long-distance pronunciation under noisy environment, satisfy the demand to gathering at any time of patient's daily life speech signal, specifically:
step 1.1, collecting a patient voice signal through a quaternary cross microphone array;
step 1.2, framing and windowing the collected voice signal of the target patient, transforming the signal from the time domain ratio to the frequency domain by utilizing the fast Fourier transform, and completing the estimation of the spectrum factor by calculating the smooth power spectrum and the noise power spectrum. Outputting the spectrum subtracted signal. Finally, the combined energy entropy ratio is calculated and detected whether the patient voice signal is contained. Obtaining an endpoint value of the voice; the calculation process of the energy entropy ratio is as follows:
the energy of each frame is calculated as:
x i (m) is a signal of the i-th frame, and the frame length is N. The energy relation expression is:
E i =log 10 (1+e i /a)
a is a constant, and proper adjustment can distinguish unvoiced sound from noisy sound. The i-th frame voice signal is subjected to fast Fourier transform and then is:
obtaining the energy spectrum of the frequency component corresponding to the kth spectral line:
the normalized spectral probability density is:
short-term spectral entropy definition of speech frames:
energy-to-entropy ratio EH i The ratio of energy and entropy spectrum:
step 1.3, combining the end point detection result, using DOA positioning method to make position judgment for sound source signal, using the processing procedure of a frame of signal data to make description: by reading in voice data, taking the m-th frame as a processing object, and taking 4 paths of microphone signalsThe number corresponds to the data of the m-th frame, 4 paths of signals are combined into 1 path of signals, and W is carried out on the signals c (k) Weighting; then find the corresponding energy sum E of a certain angle on different frequency bands s Calculating to obtain energy value E corresponding to 360 angles of current frame signal s (i) The value of i is 0-360 degrees. Take the maximum value E in the 360 energies smax (i) And an angle i corresponding to the maximum energy value, the sound source angle determined by the current frame can be output. The band energy of each frame signal corresponding to a certain angle is:
wherein f 1 、f 2 Indicating the setting range of the frequency band 1-N/2+1, X sw (k) In order to perform band weighting processing on the 1-path signals after combination, the formula is as follows:
in which W is e (k) As a band weighting factor, the formula is:
wherein the index 0 is less than lambda and less than 1, W (k) is a masking weight factor, and represents that the frequency band with the maximum signal-to-noise ratio SNR in each frequency band is taken for the current data.
X s (k) To combine 4 signals into 1 signal, the formula is:
wherein X is i (k) Is 1 out of 4 signals.
Step 1.4, synthesizing 4 paths of signals into 1 path of signals through a superdirective beam forming algorithm by using the voice signals subjected to end point detection and sound source positioning processing, thereby realizing synthesis, noise reduction and enhancement of microphone array signals. The super-directivity beamforming algorithm is detailed as follows:
the microphone array of the invention selects a quaternary cross array, which can be regarded as one of uniform circular arrays, and the geometric relation of the array can know that the direction vector of arrival of a receiving signal with an angle theta is as follows:
wherein,,
the voice environment used by the method is mainly indoor and daily life, so that the noise matrix calculated based on the scattered noise field has certain applicability to the current microphone voice environment; the scattered noise field only describes the three-dimensional spherical surface homodromous noise field, and the related function expression is as follows:
where sinc (x) yields the sampling function sinpi x/pi x. The microphone array is composed of M array elements, and the signals received by the ith microphone are as follows:
wherein f represents frequency, A i The amplitude is represented by a value representing the amplitude,the phase is represented, and according to the mathematical model theory of the optimal solution of 'superdirectivity', the correlation coefficient of noise signals between any two points in space is as follows:
the noise covariance matrix is normalized as:
R nn =[ρ ij ](i,j=1,2,...,N-1)
d ij representing the distance between any two array elements in the microphone array.
The invention adopts the minimum variance distortion-free response (MVDR) beam forming principle, which is under the constraint condition w of LCMV method H a (θ) =1, this approach keeps the signal strength while the variance of the noise is minimized, so to speak MVDR maximizes the signal-to-noise ratio (SNR) of the array output signal. The target is to select a filter coefficient w to minimize the total output power under the constraint condition that the voice signal is not distorted; therefore, the key objective is to solve the optimal solution of the weight coefficient w, and the constraint expression is as follows:
wherein a (θ) s )=[a 1 (θ),...,a M (θ)] T The target signal guiding vector represents a transfer function between the sound source direction and the microphone and can be calculated by a plurality of delay time tau; r is R x For a spatial signal correlation covariance matrix, when k noise signals that are not temporally correlated with each other arrive at a microphone array element from different directions, the spatial correlation covariance matrix is defined as:
the method is calculated by lagrange Multiplier:
we normalize the resulting R using the resulting noise covariance matrix nn Instead of the aboveNoise covariance matrix R in MVDR x The super directivity weighting factor can be obtained as:
and (5) completing the weighted beam forming of the multi-channel microphone by using the optimized super-directional weighting coefficient.
Step 2, extracting MFCC features and generating an audio spectrogram, specifically extracting a time-frequency representation and a texture-like representation of an audio signal simultaneously:
step 2.1, firstly dividing the voice signal into frames through a hamming window function. A cepstral feature vector is then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude spectrum is kept, and after the spectrum is smoothed, 24 spectral components of 44100 frequency bands are collected in the Mel frequency range. The components of the mel-spectrum vector calculated for each frame are highly correlated. Therefore, KL (Karhunen-Loeve) transform is applied, and then approximated as Discrete Cosine Transform (DCT). Finally, get/f per frame 1 ,f 2 ,...,f N ]A cepstrum feature;
and 2.2, according to the set frame number, framing and windowing the patient voice signal, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal x (m) in the m-th frame to obtain a spectrogram. In order to accommodate the input of the convolutional neural network, L filters are selected, and L frames having the same size as the filters are selected in the time direction, a spectrogram of l×l×3 is generated, and the size of the generated color image is adjusted to mxm×3.
Step 4, sending the spectrogram in the step 2 into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram, wherein the 2D convolutional neural network is as follows: by using an open source Tensorflow-based Keras framework, referring to the network structure of AlexNet, a network structure containing w is simplified and built 2 Two-dimensional convolution layers of size n x n, w 1 A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units (ReLUs) are adopted as an activation function in the convolutional layer and the fully connected layer; in the stage of training the convolutional neural network, sequentially reading spectrogram characteristics containing texture-like information of each frame of voice signal into a memory by using a traversing method, dividing a training set and a testing set, respectively adding labels to the training set and the testing set, and then transmitting processed data into the convolutional neural network according to the set labels for iterative training, wherein the total iteration is carried out for B times. Training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.
And 5, inputting the characteristics obtained in the step 4 into a countermeasure generation network to generate a new frequency spectrum image, putting the generated new frequency spectrum into the original frequency spectrum data, and then executing the training in the step 4. The antagonism generation network is: the network structure based on DCGAN is simplified and parameter adjustment is carried out. The network model comprises a generator (generator) and a discriminator (discriminator), wherein the generator network model consists of 1 fully connected layer, 3 transposed convolution layers and 2 batch normalization layers, and outputs a color picture with the size of M multiplied by 3, and the discriminator part comprises 3 convolution layers and a fully connected layer with a softmax function; the discriminator network model consists of 3 convolutional layers, 2 batch normalization layers and 2 full connection layers using a 7-layer convolutional neural network model, and is finally output as a probability value. Setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator. And (4) transmitting the generated spectrogram meeting the standard into the convolutional network in the step (4) for retraining.
And step 6, fusing the MFCC features extracted in the step 3 and the features obtained by the expanded spectrogram data in the step 4, and performing dimension reduction through a full connection layer, wherein the method specifically comprises the following steps: the P-dimensional features of the MFCCs extracted through the CNNs are fused with the O-dimensional features of the spectrograms to obtain P+O-dimensional features, and the dimensions of the P+O-dimensional features are changed into 256 dimensions through a full connection layer.
And 7, training a classifier through the feature after dimension reduction processed in the step 6, wherein the classifier specifically comprises the following steps:
and 7.1, taking the voice of the target patient as test voice, and taking the voice data of the existing depression patient as training data. The training data comprises voice information of X persons, labels of whether the X persons suffer from depression are used as a label dictionary, each label is provided with a corresponding index number, and the index numbers of the label index numbers are set as index numbers of the class. After one test, adding the spectrogram generated by the target patient into the training data set.
Step 7.2, for each label, the voice with depression is taken as a positive sample set, and the voice without depression is taken as a negative sample set. Training the two-classification SVM by using a positive example sample set and a negative example sample set to obtain a trained two-classification SVM; the classifier training process is specifically as follows:
and determining two parameters, namely a kernel function and a penalty factor of the SVM by circularly checking the accuracy of the SVM training set, and performing model training by using the parameters after selecting the optimal parameters. Let training sample speech data be: { x i ,y i },x i ∈R n ,i=1,2,..,n,x i Is the O+P dimension feature vector, y i To determine whether a depression label is present, the SVM maps the training set to a high-dimensional space using a nonlinear mapping Φ (x), the most classified face that makes the nonlinear problem linear is described as: y=ω T Phi (x) +b, omega and b represent the weight and bias vector of the SVM.
To find the optimal ω and b, a relaxation factor ζ is then introduced i Transforming the classification plane to obtain the secondary optimization problem of the classification plane, namely:
s.t.y i (ω·Φ(x i )+b)≥1-ξ i
ξ i ≥0i=1,2,...,n
wherein: c represents a penalty parameter. The secondary optimization problem is transformed by introducing Lagrangian multipliers to obtain the method:
the weight vector ω is calculated as: omega = Σα i y i Φ(x i ) Φ (x), the decision function of the support vector machine can be described as: f (x) =sgn (α) i y i Φ(x i )·Φ(x j ) +b), simplifying calculation, and introducing a Gaussian direct basis (RBF) kernel function to make a decision function as follows:
where σ represents the width parameter of the RBF.
And 8, recognizing the test voice through the trained classifier in the step 7. The generated identification result can be sent to a guardian of the patient through WIFI so as to observe the illness state of the patient at any time.
Through the mode, the microphone for voice acquisition is convenient to carry, and the voice signal of the patient in the natural state can be acquired; based on the depression identification research results combining the characteristics of the CNN, the MFCC and the GAN enhancement data set, the advantages of the MFCC and the CNN are combined, and the accuracy of depression identification in a non-experimental environment is improved.
The depression identification challenge competition database of AVEC2013 audiovisual depression identification is used for carrying out depression identification test by using the depression detection method based on the microphone array, and the data set comprises 340 individuals of voice information. The specific operation is as follows:
step 1, preprocessing the voice signals under each sub-directory sequentially by using a traversing method, and dividing the voice signals into frames by using a Hamming window function. A cepstral feature vector is then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude is retained. After the spectrum is smoothed, 24 spectrum components of 44100 frequency bands are collected in the Mel frequency range. The components of the mel-spectrum vector calculated for each frame are highly correlated. Therefore, KL (Karhunen-Loeve) transform is applied, and then approximated as Discrete Cosine Transform (DCT).
Step 2, extracting MFCC characteristics after preprocessing signals, normalizing the MFCC characteristics, limiting the length of each section of voice to 10 seconds by dividing voice fragments, obtaining a 177-dimensional characteristic vector of each frame by 50 frames per second, and enabling the number of channels of each second of voice to be 50; converting the voice signal into a spectrogram, wherein the spectrogram limits the sampling frame number to 64 frames per second; a color picture having a spectrum of 64×64×3 pixels is obtained, and the picture size is adjusted to 200×200×3 pixel size.
And 4, building convolution pooling layers, wherein the method comprises 3 convolution layers, 3 maximum pooling layers and 1 full connection layer by using a 7-layer convolution neural network model. The input data of the first layer is a spectrogram of 200 multiplied by 3, convolution operation is carried out on the input data and the spectrogram by adopting convolution kernels of 3 multiplied by 3, the convolution kernels move along the X axis and the Y axis of the image, the step length is 1 pixel, 64 convolution kernels are used for generating 198 multiplied by 64 pixel layer data, a ReLU function is used as an activation function, the pixel layers are processed by the ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pooling operation, the scale of the pooling operation is 2 multiplied by 2, the step length is 2 by default, and the pixel size after pooling is 99 multiplied by 64; during back propagation, each convolution kernel corresponds to one deviation value, namely 64 convolution kernels of the first layer correspond to 64 deviation values input by the upper layer; the second layer uses 32 3×3×64 convolution kernels, and 97×97×32 pixel layers are generated after convolution operation. The pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pool operation, the pooled image size is 48 multiplied by 32 by using a pooled operation scale of 2, and then input neurons are immediately disconnected with 10% probability to update parameters when the parameters are updated by a Dropout layer, so that the parameters are prevented from being overfitted; in the back propagation in this layer, each convolution kernel corresponds to one offset value, i.e. 64 convolution kernels of the first layer correspond to 32 offset values of the upper layer input; similarly, the third layer uses 32 3×3×32 convolution kernels, and a convolution operation is performed to generate 46×46×32 pixel layers. The pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pool operation, the pool operation scale is 2 multiplied by 2, the image size after the pool is 23 multiplied by 32, and then input neurons are disconnected immediately with 10% probability to update parameters when the parameters are updated by the Dropout layer; the multi-dimensional input is unidimensionally processed by using a flattening layer, a group of unidimensionally pixel arrays are output after flattening treatment, the sum of the unidimensionally pixel arrays contains 16928 data, and then the pixels are taken as input to be transmitted into a full-connection layer for further operation.
In order to extract the characteristics of the spectrogram itself to send into a GAN network to generate a new spectrogram, the obtained multidimensional characteristics are required to be subjected to dimension reduction, a full-connection layer is built, the full-connection (Dense) is used for fully connecting the input 16928 data to 128 nerve units, 128 data are generated after the ReLU activation function processing, and 128 data are output after the Dropout processing to serve as voice emotion characteristics.
Step 5, the GAN generator network model of the invention consists of 1 full connection layer, 3 transposed convolution layers and 2 batch normalization layers. The first layer input data is 128 data extracted in the step 4, is connected with 4608 neurons through a full connection layer, and is converted into a shape of 3 multiplied by 512; the second layer uses transpose convolution to reduce 512 channels to 256 channels, kernel_size 3, step size 3, and pass through the batch normalization layer; the third layer uses transpose convolution to reduce 256 channels to 128 channels, kernel_size 5, step size 2, and pass through the batch normalization layer; the fourth layer uses transpose convolution to reduce 128 channels to 3 channels, kernel_size 4, step size 3;
the GAN discriminator network model of the invention consists of 3 convolutional layers, 2 batch normalization layers, and 1 fully connected layer using a 7-layer convolutional neural network model. The input data of the first layer is a spectrogram of 64 multiplied by 3, convolution operation is carried out on the spectrogram and a convolution kernel of 5 multiplied by 3, the convolution kernel moves along the X axis and the Y axis of the image, the step length is 1 pixel, 64 convolution kernels are used for generating 60 multiplied by 24 pixel layer data, a leakage-ReLU function is used as an activation function, and the pixel layers are subjected to the processing of the leakage-ReLU unit to generate an activated pixel layer; the second layer uses 128 5×5×128 convolution kernels, and 57×57×128 pixel layers are generated after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer for preventing overfitting; the third layer uses 256 5×5×256 convolution kernels, and generates 53×53×256 pixel layers after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer for preventing overfitting; using a flattening layer to unify multi-dimensional input, after flattening treatment, using the pixels as input to enter a full-connection layer, wherein the final output layer is 1 node, and outputting a probability value; the size of the generated spectrogram which meets the standard and is 64 multiplied by 3 is modified to be 200 multiplied by 3 pixel size, and then the spectrogram is transmitted into the convolution network in the step 4 for retraining.
And 6, constructing a full-connection layer, combining 1800-dimensional data extracted in the step 3 and 16928-dimensional data extracted in the step 4 into 18728-dimensional data, performing full connection on the 18728-dimensional data and 256 nerve units, generating 256 data after being processed by a ReLU activation function, and outputting 256 data after being processed by a Dropout as voice emotion characteristics.
Step 7, as the data set contains 292 persons, 43,800 voice information is used in total through clipping and screening, labels of 292 persons, which are whether to suffer from depression, are used as a label dictionary, each label is provided with a corresponding index number, index numbers with the label index number being similar are set, 90% of voice signals of the labels are used as a training set, and the rest 10% of voice signals are used as a test set;
for each label, the speech that always had depression was taken as a positive sample set and the speech that did not have depression was taken as a negative sample set. Training the two-classification SVM by using a positive example sample set and a negative example sample set to obtain a trained two-classification SVM;
and 8, recognizing test voice through the trained two-class SVM in the step 7.
Claims (8)
1. A method for detecting depression based on a microphone array, comprising the steps of:
step 1, a microphone array is used for collecting voice signals of a target patient and preprocessing the voice signals;
step 2, extracting the audio signal preprocessed by the target patient in the step 1 and the MFCC characteristics of the voice data of the existing depression patient to generate an audio spectrogram;
step 3, the MFCC features extracted in the step 2 are sent into a 1D convolutional neural network to obtain P-dimensional features of the MFCC;
step 4, sending the audio spectrogram generated in the step 2 into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram;
step 5, inputting the O-dimensional characteristics obtained in the step 4 into an countermeasure generation network to generate a new spectrum image, and transmitting the generated new spectrum image into the 2D convolutional neural network in the step 4 for training;
step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 with the characteristics obtained by training in the step 5, and reducing the dimension through the full connection layer;
step 7, training a classifier through the feature after dimension reduction obtained in the step 6;
and 8, recognizing the test voice through the trained classifier in the step 7 to obtain a recognition result.
2. The method for detecting depression based on microphone array as claimed in claim 1, wherein the step 1 specifically comprises the steps of:
step 1.1, collecting a voice signal of a target patient through a quaternary cross microphone array;
step 1.2, carrying out frame windowing on the collected voice signal of the target patient, transforming the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, completing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting a signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining the energy entropy ratio to obtain an endpoint value of voice;
step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA positioning method;
and 1.4, synthesizing four paths of signals into one path of signal through a superdirective beam forming algorithm by using the voice signals subjected to end point detection and sound source positioning processing, so as to realize synthesis, noise reduction and enhancement of microphone array signals.
3. The method for detecting depression based on microphone array as claimed in claim 2, wherein the step 2 specifically comprises the steps of:
step 2.1, firstly dividing a voice signal into frames through a Hamming window function; then generating a cepstrum feature vector, calculating discrete Fourier transform for each frame, only reserving logarithm of an amplitude spectrum, collecting 24 frequency spectrum components of 44100 frequency bands in a Mel frequency range after the frequency spectrum is smoothed, and approximating the frequency spectrum components to discrete cosine transform after Karhunen-Loeve transform is applied; finally obtain f per frame 1 ,f 2 ,...,f N ]A cepstrum feature;
step 2.2, according to the set frame number, framing and windowing the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal x (m) in the m-th frame to obtain a spectrogram; l filters are selected, and L frames having the same size as the filters are selected in the time direction, a spectrogram of L×L×3 is generated, and the size of the generated color image is adjusted to M×M×3.
4. The method for detecting depression based on microphone array as claimed in claim 3, wherein the 1D convolutional neural network of step 3 is: using a Keras framework based on Tensorflow with an open source, only building two 1D convolution layers, wherein each layer adopts a correction linear unit as an activation function; the input dimension is Mx1, through w 1 A convolution layer filter with m multiplied by 1, dropout of 0.1 and maximum pool step of q 1 Outputting a feature vector of S; in the stage of training the 1D convolutional neural network, the MFCC features containing time-frequency information of each frame of voice signal are sequentially read into a memory by using a traversing method, a training set and a testing set are divided, labels are respectively added to the training set and the testing set, and processed data are transmitted into the convolutional neural network according to the set labels to carry out iterative training, and the total iteration is carried out for B times.
5. A microphone-based array as defined in claim 4The depression detection method is characterized in that the 2D convolutional neural network in the step 4 is as follows: using a open source Tensorflow-based Keras framework to build a containing w 2 Two-dimensional convolution layers of size n x n, w 1 A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units are adopted in the convolutional layer and the fully connected layer as an activation function; in the stage of training the convolutional neural network, sequentially reading spectrogram characteristics containing texture-like information of each frame of voice signal into a memory by using a traversing method, dividing a training set and a testing set, respectively adding labels to the training set and the testing set, and then transmitting processed data into the convolutional neural network according to the set labels for iterative training, wherein the total iteration is carried out for B times; training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.
6. The method for detecting depression based on microphone array as claimed in claim 5, wherein the countermeasure generation network of step 5 is: the network model comprises a generator and a discriminator, wherein the generator network model consists of 1 full-connection layer, 3 transposed convolution layers and 2 batch standardization layers, and outputs a color picture with the size of MxMx3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolution layers, 2 batch standardization layers and 2 full connection layers by using a 7-layer convolution neural network model, and finally outputs a probability value; setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator.
7. The method for detecting depression based on microphone array as claimed in claim 6, wherein said step 6 is specifically: the P-dimensional characteristics of the MFCC extracted through the 1D convolutional neural network are fused with the O-dimensional characteristics of the spectrogram to obtain P+O-dimensional characteristics, and the dimension of the P+O-dimensional characteristics is changed into 256 dimensions through a full connecting layer.
8. The method for detecting depression based on microphone array as claimed in claim 7, wherein said step 7 is specifically:
step 7.1, taking the voice of the target patient as test voice and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X persons, labels of whether the X persons suffer from depression are used as a label dictionary, each label is provided with a corresponding index number, and index numbers of which the label index numbers are classified are set; after one test, adding a spectrogram generated by a target patient into a training data set;
and 7.2, for each label, the voice with depression is taken as a positive sample set, the voice without depression is taken as a negative sample set, and the positive sample set and the negative sample set are used for training the two-class SVM, so that the trained two-class SVM is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011248610.5A CN112349297B (en) | 2020-11-10 | 2020-11-10 | Depression detection method based on microphone array |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011248610.5A CN112349297B (en) | 2020-11-10 | 2020-11-10 | Depression detection method based on microphone array |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112349297A CN112349297A (en) | 2021-02-09 |
CN112349297B true CN112349297B (en) | 2023-07-04 |
Family
ID=74362344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011248610.5A Active CN112349297B (en) | 2020-11-10 | 2020-11-10 | Depression detection method based on microphone array |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112349297B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113012720B (en) * | 2021-02-10 | 2023-06-16 | 杭州医典智能科技有限公司 | Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction |
CN112818892B (en) * | 2021-02-10 | 2023-04-07 | 杭州医典智能科技有限公司 | Multi-modal depression detection method and system based on time convolution neural network |
CN112687390B (en) * | 2021-03-12 | 2021-06-18 | 中国科学院自动化研究所 | Depression state detection method and device based on hybrid network and lp norm pooling |
CN113223507B (en) * | 2021-04-14 | 2022-06-24 | 重庆交通大学 | Abnormal speech recognition method based on double-input mutual interference convolutional neural network |
CN113205803B (en) * | 2021-04-22 | 2024-05-03 | 上海顺久电子科技有限公司 | Voice recognition method and device with self-adaptive noise reduction capability |
CN113476058B (en) * | 2021-07-22 | 2022-11-29 | 北京脑陆科技有限公司 | Intervention treatment method, device, terminal and medium for depression patients |
CN113679413B (en) * | 2021-09-15 | 2023-11-10 | 北方民族大学 | VMD-CNN-based lung sound feature recognition and classification method and system |
CN113820693B (en) * | 2021-09-20 | 2023-06-23 | 西北工业大学 | Uniform linear array element failure calibration method based on generation of countermeasure network |
CN114219005B (en) * | 2021-11-17 | 2023-04-18 | 太原理工大学 | Depression classification method based on high-order spectrum voice features |
CN116978409A (en) * | 2023-09-22 | 2023-10-31 | 苏州复变医疗科技有限公司 | Depression state evaluation method, device, terminal and medium based on voice signal |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107705806A (en) * | 2017-08-22 | 2018-02-16 | 北京联合大学 | A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks |
CN108831495A (en) * | 2018-06-04 | 2018-11-16 | 桂林电子科技大学 | A kind of sound enhancement method applied to speech recognition under noise circumstance |
CN109599129A (en) * | 2018-11-13 | 2019-04-09 | 杭州电子科技大学 | Voice depression recognition methods based on attention mechanism and convolutional neural networks |
CN110047506A (en) * | 2019-04-19 | 2019-07-23 | 杭州电子科技大学 | A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM |
-
2020
- 2020-11-10 CN CN202011248610.5A patent/CN112349297B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107705806A (en) * | 2017-08-22 | 2018-02-16 | 北京联合大学 | A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks |
CN108831495A (en) * | 2018-06-04 | 2018-11-16 | 桂林电子科技大学 | A kind of sound enhancement method applied to speech recognition under noise circumstance |
CN109599129A (en) * | 2018-11-13 | 2019-04-09 | 杭州电子科技大学 | Voice depression recognition methods based on attention mechanism and convolutional neural networks |
CN110047506A (en) * | 2019-04-19 | 2019-07-23 | 杭州电子科技大学 | A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM |
Non-Patent Citations (4)
Title |
---|
Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals;LE YANG等;IEEE ACCESS;全文 * |
Recognition of Audio Depression Based on Convolutional Neural Network and Generative Antagonism Network Model;ZHIYONG WANG等;IEEE ACCESS;全文 * |
基于深度学习的音频抑郁症识别;李金鸣等;计算机应用与软件;全文 * |
基于自编码器的语音情感识别方法研究;钟昕孜 等;电子设计工程(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112349297A (en) | 2021-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112349297B (en) | Depression detection method based on microphone array | |
US10901063B2 (en) | Localization algorithm for sound sources with known statistics | |
CN109272989B (en) | Voice wake-up method, apparatus and computer readable storage medium | |
US10127922B2 (en) | Sound source identification apparatus and sound source identification method | |
Stöter et al. | Countnet: Estimating the number of concurrent speakers using supervised learning | |
Glodek et al. | Multiple classifier systems for the classification of audio-visual emotional states | |
US5621848A (en) | Method of partitioning a sequence of data frames | |
JPS62201500A (en) | Continuous speech recognition | |
Suvorov et al. | Deep residual network for sound source localization in the time domain | |
CN113314127B (en) | Bird song identification method, system, computer equipment and medium based on space orientation | |
CN112329819A (en) | Underwater target identification method based on multi-network fusion | |
Salvati et al. | A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients | |
Venkatesan et al. | Binaural classification-based speech segregation and robust speaker recognition system | |
US5832181A (en) | Speech-recognition system utilizing neural networks and method of using same | |
Salvati et al. | Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features. | |
Lin et al. | Domestic activities clustering from audio recordings using convolutional capsule autoencoder network | |
CN117762372A (en) | Multi-mode man-machine interaction system | |
CN115952840A (en) | Beam forming method, arrival direction identification method, device and chip thereof | |
Ganchev et al. | Automatic height estimation from speech in real-world setup | |
Amami et al. | A robust voice pathology detection system based on the combined bilstm–cnn architecture | |
Kanisha et al. | Speech recognition with advanced feature extraction methods using adaptive particle swarm optimization | |
Raju et al. | AUTOMATIC SPEECH RECOGNITION SYSTEM USING MFCC-BASED LPC APPROACH WITH BACK PROPAGATED ARTIFICIAL NEURAL NETWORKS. | |
Venkatesan et al. | Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest | |
Sailor et al. | Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection. | |
Kothapally et al. | Speech Detection and Enhancement Using Single Microphone for Distant Speech Applications in Reverberant Environments. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |