CN112349297B - Depression detection method based on microphone array - Google Patents

Depression detection method based on microphone array Download PDF

Info

Publication number
CN112349297B
CN112349297B CN202011248610.5A CN202011248610A CN112349297B CN 112349297 B CN112349297 B CN 112349297B CN 202011248610 A CN202011248610 A CN 202011248610A CN 112349297 B CN112349297 B CN 112349297B
Authority
CN
China
Prior art keywords
training
voice
neural network
convolutional neural
depression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011248610.5A
Other languages
Chinese (zh)
Other versions
CN112349297A (en
Inventor
焦亚萌
周成智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Polytechnic University
Original Assignee
Xian Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Polytechnic University filed Critical Xian Polytechnic University
Priority to CN202011248610.5A priority Critical patent/CN112349297B/en
Publication of CN112349297A publication Critical patent/CN112349297A/en
Application granted granted Critical
Publication of CN112349297B publication Critical patent/CN112349297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Psychiatry (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Veterinary Medicine (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physiology (AREA)
  • Epidemiology (AREA)
  • Fuzzy Systems (AREA)
  • Developmental Disabilities (AREA)
  • Educational Technology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychology (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a depression detection method based on a microphone array, which comprises the steps of collecting and preprocessing voice signals of a target patient by using the microphone array; extracting the preprocessed audio signal of the target patient and the MFCC (frequency spectrum coefficient) characteristics of the voice data of the existing depression patient to generate an audio spectrogram; the MFCC characteristics are sent into a 1D convolutional neural network to obtain P-dimensional characteristics of the MFCC; sending the audio spectrogram into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram; inputting the O-dimensional characteristics into a countermeasure generation network to generate a new spectrum image, and transmitting the generated new spectrum image into a 2D convolutional neural network for training; fusing the P-dimensional characteristics of the MFCC and the characteristics obtained by training and reducing the dimension through the full connection layer; training a classifier by using the dimension reduction characteristics; training the classifier to identify the test voice and obtaining the identification result. The invention improves the accuracy of identifying the depression in a non-experimental environment.

Description

Depression detection method based on microphone array
Technical Field
The invention belongs to the technical field of voice recognition methods, and particularly relates to a depression detection method based on a microphone array.
Background
At present, the voice signal has made some progress in the field of depression detection, but the diagnosis of the illness state of a patient mainly ensures that the patient carries out voice signal acquisition in front of a fixed voice acquisition device and mainly depends on a clinician to carry out diagnosis, and the common diagnosis methods include a Beck depression scale (BDI), a Hamiltonian depression scale (HAMD) and the like, so that the diagnosis result of the patient depends on the experience and the capability of doctors, and more importantly, the patient is matched. Therefore, most of the voices collected during the current examination of the patient are characterized by programming and mechanization, and the problem of inaccuracy of the collected voices of the patient is possibly caused. Therefore, the detection device must be able to collect the patient's voice under the condition of removing the background noise in the natural state of the patient's daily life.
A microphone array, which is composed of a number of acoustic sensors, is a system for sampling and processing the spatial characteristics of a sound field. In complex acoustic environments, noise always comes from all directions and often overlaps with speech signals in time and frequency spectrum, plus the effects of echo and reverberation, it is very difficult to capture relatively pure speech with a single microphone. The microphone array fuses the space-time information of the voice signals, and can simultaneously extract sound sources and inhibit noise.
Convolutional neural networks (CNN, convolutional Neural Network) are one of the deep learning algorithms established in recent years, which have good classification performance for large image processing. The biggest advantage of generating the countermeasure network (GAN, generative Adversarial Networks) is that the experimental problem of insufficient sample data is solved, a proper network model is constructed to generate a false and spurious sample, diagnosis and prediction of medical diseases can be effectively facilitated, and more important diagnosis basis is provided for medical research.
The advantage that the microphone array can clearly adopt sound signals is combined with the advantages of two deep learning methods of GAN and CNN, so that the accuracy of identifying depression is improved.
Disclosure of Invention
The invention aims to provide a depression detection method based on a microphone array, which improves the accuracy of depression identification.
The technical scheme adopted by the invention is as follows: a depression detection method based on a microphone array, comprising the steps of:
step 1, a microphone array is used for collecting voice signals of a target patient and preprocessing the voice signals;
step 2, extracting the audio signal preprocessed by the target patient in the step 1 and the MFCC characteristics of the voice data of the existing depression patient to generate an audio spectrogram;
step 3, the MFCC features extracted in the step 2 are sent into a 1D convolutional neural network to obtain P-dimensional features of the MFCC;
step 4, sending the audio spectrogram generated in the step 2 into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram;
step 5, inputting the O-dimensional characteristics obtained in the step 4 into an countermeasure generation network to generate a new spectrum image, and transmitting the generated new spectrum image into the 2D convolutional neural network in the step 4 for training;
step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 with the characteristics obtained by training in the step 5, and reducing the dimension through the full connection layer;
step 7, training a classifier through the feature after dimension reduction obtained in the step 6;
and 8, recognizing the test voice through the trained classifier in the step 7 to obtain a recognition result.
The present invention is also characterized in that,
the step 1 specifically comprises the following steps:
step 1.1, collecting a voice signal of a target patient through a quaternary cross microphone array;
step 1.2, carrying out frame windowing on the collected voice signal of the target patient, transforming the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, completing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting a signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining the energy entropy ratio to obtain an endpoint value of voice;
step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA positioning method;
and 1.4, synthesizing four paths of signals into one path of signal through a superdirective beam forming algorithm by using the voice signals subjected to end point detection and sound source positioning processing, so as to realize synthesis, noise reduction and enhancement of microphone array signals.
The step 2 specifically comprises the following steps:
step 2.1, firstly dividing a voice signal into frames through a Hamming window function; then a cepstral feature vector is generated and calculated for each frameDiscrete Fourier transform, which only keeps the logarithm of the amplitude spectrum, collects 24 frequency spectrum components of 44100 frequency bands in the Mel frequency range after the frequency spectrum is smoothed, and approximates the frequency spectrum to discrete cosine transform after Karhunen-Loeve transform is applied; finally obtain f per frame 1 ,f 2 ,...,f N ]A cepstrum feature;
step 2.2, according to the set frame number, framing and windowing the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal x (m) in the m-th frame to obtain a spectrogram; l filters are selected, and L frames having the same size as the filters are selected in the time direction, a spectrogram of L×L×3 is generated, and the size of the generated color image is adjusted to M×M×3.
The 1D convolutional neural network of step 3 is: using a Keras framework based on Tensorflow with an open source, only building two 1D convolution layers, wherein each layer adopts a correction linear unit as an activation function; the input dimension is Mx1, through w 1 A convolution layer filter with m multiplied by 1, dropout of 0.1 and maximum pool step of q 1 Outputting a feature vector of S; in the stage of training the 1D convolutional neural network, the MFCC features containing time-frequency information of each frame of voice signal are sequentially read into a memory by using a traversing method, a training set and a testing set are divided, labels are respectively added to the training set and the testing set, and processed data are transmitted into the convolutional neural network according to the set labels to carry out iterative training, and the total iteration is carried out for B times.
The 2D convolutional neural network of step 4 is: using a open source Tensorflow-based Keras framework to build a containing w 2 Two-dimensional convolution layers of size n x n, w 1 A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units are adopted in the convolutional layer and the fully connected layer as an activation function; in the stage of training convolutional neural network, using traversing method to orderly read spectrogram characteristics of each frame of speech signal containing texture-like information into internal memory, dividing training set and test set, respectively adding labels to training set and test set, then transferring the processed data into volume according to the set labelIn the neural network, iterative training is carried out, and total iterations are carried out for B times; training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.
The countermeasure generation network of step 5 is: the network model comprises a generator and a discriminator, wherein the generator network model consists of 1 full-connection layer, 3 transposed convolution layers and 2 batch standardization layers, and outputs a color picture with the size of MxMx3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolution layers, 2 batch standardization layers and 2 full connection layers by using a 7-layer convolution neural network model, and finally outputs a probability value; setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator.
The step 6 is specifically as follows: the P-dimensional characteristics of the MFCC extracted through the 1D convolutional neural network are fused with the O-dimensional characteristics of the spectrogram to obtain P+O-dimensional characteristics, and the dimension of the P+O-dimensional characteristics is changed into 256 dimensions through a full connecting layer.
The step 7 is specifically as follows:
step 7.1, taking the voice of the target patient as test voice and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X persons, labels of whether the X persons suffer from depression are used as a label dictionary, each label is provided with a corresponding index number, and index numbers of which the label index numbers are classified are set; after one test, adding a spectrogram generated by a target patient into a training data set;
and 7.2, for each label, the voice with depression is taken as a positive sample set, the voice without depression is taken as a negative sample set, and the positive sample set and the negative sample set are used for training the two-class SVM, so that the trained two-class SVM is obtained.
The beneficial effects of the invention are as follows: the microphone for voice acquisition is convenient to carry, and can acquire voice signals of a patient in a natural state; based on the depression identification research results combining the characteristics of the CNN, the MFCC and the GAN enhancement data set, the advantages of the MFCC and the CNN are combined, and the accuracy of depression identification in a non-experimental environment is improved.
Drawings
FIG. 1 is a schematic diagram of a method for detecting depression based on a microphone array of the present invention;
FIG. 2 is a schematic diagram of a microphone array used in a method for detecting depression based on a microphone array according to the present invention;
FIG. 3 is a schematic diagram of a CNN model in a microphone array-based depression detection method of the present invention;
fig. 4 is a schematic diagram of GAN model in a method for detecting depression based on a microphone array according to the present invention.
Detailed Description
The invention will be described in detail with reference to the accompanying drawings and detailed description.
The invention provides a depression detection method based on a microphone array, which is shown in fig. 1 to 4 and comprises the following steps:
step 1, can carry out accurate sound source localization and form pickup wave beam in the direction of target speaker through using annular microphone array, restrain noise and reflected sound, strengthen sound signal, can accurately discern 3-5 m's long-distance pronunciation under noisy environment, satisfy the demand to gathering at any time of patient's daily life speech signal, specifically:
step 1.1, collecting a patient voice signal through a quaternary cross microphone array;
step 1.2, framing and windowing the collected voice signal of the target patient, transforming the signal from the time domain ratio to the frequency domain by utilizing the fast Fourier transform, and completing the estimation of the spectrum factor by calculating the smooth power spectrum and the noise power spectrum. Outputting the spectrum subtracted signal. Finally, the combined energy entropy ratio is calculated and detected whether the patient voice signal is contained. Obtaining an endpoint value of the voice; the calculation process of the energy entropy ratio is as follows:
the energy of each frame is calculated as:
Figure BDA0002770860390000061
x i (m) is a signal of the i-th frame, and the frame length is N. The energy relation expression is:
E i =log 10 (1+e i /a)
a is a constant, and proper adjustment can distinguish unvoiced sound from noisy sound. The i-th frame voice signal is subjected to fast Fourier transform and then is:
Figure BDA0002770860390000071
obtaining the energy spectrum of the frequency component corresponding to the kth spectral line:
Figure BDA0002770860390000072
the normalized spectral probability density is:
Figure BDA0002770860390000073
short-term spectral entropy definition of speech frames:
Figure BDA0002770860390000074
energy-to-entropy ratio EH i The ratio of energy and entropy spectrum:
Figure BDA0002770860390000075
step 1.3, combining the end point detection result, using DOA positioning method to make position judgment for sound source signal, using the processing procedure of a frame of signal data to make description: by reading in voice data, taking the m-th frame as a processing object, and taking 4 paths of microphone signalsThe number corresponds to the data of the m-th frame, 4 paths of signals are combined into 1 path of signals, and W is carried out on the signals c (k) Weighting; then find the corresponding energy sum E of a certain angle on different frequency bands s Calculating to obtain energy value E corresponding to 360 angles of current frame signal s (i) The value of i is 0-360 degrees. Take the maximum value E in the 360 energies smax (i) And an angle i corresponding to the maximum energy value, the sound source angle determined by the current frame can be output. The band energy of each frame signal corresponding to a certain angle is:
Figure BDA0002770860390000081
wherein f 1 、f 2 Indicating the setting range of the frequency band 1-N/2+1, X sw (k) In order to perform band weighting processing on the 1-path signals after combination, the formula is as follows:
Figure BDA0002770860390000082
in which W is e (k) As a band weighting factor, the formula is:
Figure BDA0002770860390000083
wherein the index 0 is less than lambda and less than 1, W (k) is a masking weight factor, and represents that the frequency band with the maximum signal-to-noise ratio SNR in each frequency band is taken for the current data.
X s (k) To combine 4 signals into 1 signal, the formula is:
Figure BDA0002770860390000084
wherein X is i (k) Is 1 out of 4 signals.
Step 1.4, synthesizing 4 paths of signals into 1 path of signals through a superdirective beam forming algorithm by using the voice signals subjected to end point detection and sound source positioning processing, thereby realizing synthesis, noise reduction and enhancement of microphone array signals. The super-directivity beamforming algorithm is detailed as follows:
the microphone array of the invention selects a quaternary cross array, which can be regarded as one of uniform circular arrays, and the geometric relation of the array can know that the direction vector of arrival of a receiving signal with an angle theta is as follows:
Figure BDA0002770860390000085
wherein,,
Figure BDA0002770860390000091
the voice environment used by the method is mainly indoor and daily life, so that the noise matrix calculated based on the scattered noise field has certain applicability to the current microphone voice environment; the scattered noise field only describes the three-dimensional spherical surface homodromous noise field, and the related function expression is as follows:
Figure BDA0002770860390000092
where sinc (x) yields the sampling function sinpi x/pi x. The microphone array is composed of M array elements, and the signals received by the ith microphone are as follows:
Figure BDA0002770860390000093
wherein f represents frequency, A i The amplitude is represented by a value representing the amplitude,
Figure BDA0002770860390000094
the phase is represented, and according to the mathematical model theory of the optimal solution of 'superdirectivity', the correlation coefficient of noise signals between any two points in space is as follows:
Figure BDA0002770860390000095
the noise covariance matrix is normalized as:
R nn =[ρ ij ](i,j=1,2,...,N-1)
d ij representing the distance between any two array elements in the microphone array.
The invention adopts the minimum variance distortion-free response (MVDR) beam forming principle, which is under the constraint condition w of LCMV method H a (θ) =1, this approach keeps the signal strength while the variance of the noise is minimized, so to speak MVDR maximizes the signal-to-noise ratio (SNR) of the array output signal. The target is to select a filter coefficient w to minimize the total output power under the constraint condition that the voice signal is not distorted; therefore, the key objective is to solve the optimal solution of the weight coefficient w, and the constraint expression is as follows:
Figure BDA0002770860390000096
wherein a (θ) s )=[a 1 (θ),...,a M (θ)] T The target signal guiding vector represents a transfer function between the sound source direction and the microphone and can be calculated by a plurality of delay time tau; r is R x For a spatial signal correlation covariance matrix, when k noise signals that are not temporally correlated with each other arrive at a microphone array element from different directions, the spatial correlation covariance matrix is defined as:
Figure BDA0002770860390000101
the method is calculated by lagrange Multiplier:
Figure BDA0002770860390000102
we normalize the resulting R using the resulting noise covariance matrix nn Instead of the aboveNoise covariance matrix R in MVDR x The super directivity weighting factor can be obtained as:
Figure BDA0002770860390000103
and (5) completing the weighted beam forming of the multi-channel microphone by using the optimized super-directional weighting coefficient.
Step 2, extracting MFCC features and generating an audio spectrogram, specifically extracting a time-frequency representation and a texture-like representation of an audio signal simultaneously:
step 2.1, firstly dividing the voice signal into frames through a hamming window function. A cepstral feature vector is then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude spectrum is kept, and after the spectrum is smoothed, 24 spectral components of 44100 frequency bands are collected in the Mel frequency range. The components of the mel-spectrum vector calculated for each frame are highly correlated. Therefore, KL (Karhunen-Loeve) transform is applied, and then approximated as Discrete Cosine Transform (DCT). Finally, get/f per frame 1 ,f 2 ,...,f N ]A cepstrum feature;
and 2.2, according to the set frame number, framing and windowing the patient voice signal, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal x (m) in the m-th frame to obtain a spectrogram. In order to accommodate the input of the convolutional neural network, L filters are selected, and L frames having the same size as the filters are selected in the time direction, a spectrogram of l×l×3 is generated, and the size of the generated color image is adjusted to mxm×3.
Step 3, the MFCC features in the step 2 are sent into a 1D convolutional neural network to obtain P-dimensional features of the MFCC, and the 1D convolutional neural network is as follows: using a Keras framework based on Tensorflow with an open source, in order to prevent the over-fitting problem, only two one-dimensional (1D) convolution layers are built, wherein each layer adopts a correction linear unit (ReLU) as an activation function; the input dimension is Mx1, through w 1 A convolution layer filter with m multiplied by 1, dropout of 0.1 and maximum pool step of q 1 And outputting a feature vector of S. Training 1D convolutional nervesAnd in the stage of the network, the MFCC characteristics containing time-frequency information of each frame of voice signal are sequentially read into the memory by using a traversing method, a training set and a testing set are divided, labels are respectively added to the training set and the testing set, and processed data are transmitted into the convolutional neural network according to the set labels for iterative training, and total iterations are carried out for B times.
Step 4, sending the spectrogram in the step 2 into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram, wherein the 2D convolutional neural network is as follows: by using an open source Tensorflow-based Keras framework, referring to the network structure of AlexNet, a network structure containing w is simplified and built 2 Two-dimensional convolution layers of size n x n, w 1 A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units (ReLUs) are adopted as an activation function in the convolutional layer and the fully connected layer; in the stage of training the convolutional neural network, sequentially reading spectrogram characteristics containing texture-like information of each frame of voice signal into a memory by using a traversing method, dividing a training set and a testing set, respectively adding labels to the training set and the testing set, and then transmitting processed data into the convolutional neural network according to the set labels for iterative training, wherein the total iteration is carried out for B times. Training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.
And 5, inputting the characteristics obtained in the step 4 into a countermeasure generation network to generate a new frequency spectrum image, putting the generated new frequency spectrum into the original frequency spectrum data, and then executing the training in the step 4. The antagonism generation network is: the network structure based on DCGAN is simplified and parameter adjustment is carried out. The network model comprises a generator (generator) and a discriminator (discriminator), wherein the generator network model consists of 1 fully connected layer, 3 transposed convolution layers and 2 batch normalization layers, and outputs a color picture with the size of M multiplied by 3, and the discriminator part comprises 3 convolution layers and a fully connected layer with a softmax function; the discriminator network model consists of 3 convolutional layers, 2 batch normalization layers and 2 full connection layers using a 7-layer convolutional neural network model, and is finally output as a probability value. Setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator. And (4) transmitting the generated spectrogram meeting the standard into the convolutional network in the step (4) for retraining.
And step 6, fusing the MFCC features extracted in the step 3 and the features obtained by the expanded spectrogram data in the step 4, and performing dimension reduction through a full connection layer, wherein the method specifically comprises the following steps: the P-dimensional features of the MFCCs extracted through the CNNs are fused with the O-dimensional features of the spectrograms to obtain P+O-dimensional features, and the dimensions of the P+O-dimensional features are changed into 256 dimensions through a full connection layer.
And 7, training a classifier through the feature after dimension reduction processed in the step 6, wherein the classifier specifically comprises the following steps:
and 7.1, taking the voice of the target patient as test voice, and taking the voice data of the existing depression patient as training data. The training data comprises voice information of X persons, labels of whether the X persons suffer from depression are used as a label dictionary, each label is provided with a corresponding index number, and the index numbers of the label index numbers are set as index numbers of the class. After one test, adding the spectrogram generated by the target patient into the training data set.
Step 7.2, for each label, the voice with depression is taken as a positive sample set, and the voice without depression is taken as a negative sample set. Training the two-classification SVM by using a positive example sample set and a negative example sample set to obtain a trained two-classification SVM; the classifier training process is specifically as follows:
and determining two parameters, namely a kernel function and a penalty factor of the SVM by circularly checking the accuracy of the SVM training set, and performing model training by using the parameters after selecting the optimal parameters. Let training sample speech data be: { x i ,y i },x i ∈R n ,i=1,2,..,n,x i Is the O+P dimension feature vector, y i To determine whether a depression label is present, the SVM maps the training set to a high-dimensional space using a nonlinear mapping Φ (x), the most classified face that makes the nonlinear problem linear is described as: y=ω T Phi (x) +b, omega and b represent the weight and bias vector of the SVM.
To find the optimal ω and b, a relaxation factor ζ is then introduced i Transforming the classification plane to obtain the secondary optimization problem of the classification plane, namely:
Figure BDA0002770860390000131
s.t.y i (ω·Φ(x i )+b)≥1-ξ i
ξ i ≥0i=1,2,...,n
wherein: c represents a penalty parameter. The secondary optimization problem is transformed by introducing Lagrangian multipliers to obtain the method:
Figure BDA0002770860390000132
the weight vector ω is calculated as: omega = Σα i y i Φ(x i ) Φ (x), the decision function of the support vector machine can be described as: f (x) =sgn (α) i y i Φ(x i )·Φ(x j ) +b), simplifying calculation, and introducing a Gaussian direct basis (RBF) kernel function to make a decision function as follows:
Figure BDA0002770860390000133
where σ represents the width parameter of the RBF.
And 8, recognizing the test voice through the trained classifier in the step 7. The generated identification result can be sent to a guardian of the patient through WIFI so as to observe the illness state of the patient at any time.
Through the mode, the microphone for voice acquisition is convenient to carry, and the voice signal of the patient in the natural state can be acquired; based on the depression identification research results combining the characteristics of the CNN, the MFCC and the GAN enhancement data set, the advantages of the MFCC and the CNN are combined, and the accuracy of depression identification in a non-experimental environment is improved.
The depression identification challenge competition database of AVEC2013 audiovisual depression identification is used for carrying out depression identification test by using the depression detection method based on the microphone array, and the data set comprises 340 individuals of voice information. The specific operation is as follows:
step 1, preprocessing the voice signals under each sub-directory sequentially by using a traversing method, and dividing the voice signals into frames by using a Hamming window function. A cepstral feature vector is then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude is retained. After the spectrum is smoothed, 24 spectrum components of 44100 frequency bands are collected in the Mel frequency range. The components of the mel-spectrum vector calculated for each frame are highly correlated. Therefore, KL (Karhunen-Loeve) transform is applied, and then approximated as Discrete Cosine Transform (DCT).
Step 2, extracting MFCC characteristics after preprocessing signals, normalizing the MFCC characteristics, limiting the length of each section of voice to 10 seconds by dividing voice fragments, obtaining a 177-dimensional characteristic vector of each frame by 50 frames per second, and enabling the number of channels of each second of voice to be 50; converting the voice signal into a spectrogram, wherein the spectrogram limits the sampling frame number to 64 frames per second; a color picture having a spectrum of 64×64×3 pixels is obtained, and the picture size is adjusted to 200×200×3 pixel size.
Step 3, constructing a convolution pooling layer, wherein a model of a 5-layer convolution neural network is composed of 2 convolution layers, 2 maximum pooling layers and 1 full connection layer, input data of a first layer is 177 multiplied by 1 multiplied by 50 MFCC characteristics, convolution operation is carried out on the MFCC characteristics and the MFCC characteristics by adopting a convolution kernel of 5 multiplied by 1, the convolution kernel moves along the X axis and the y axis of the MFCC characteristics, the step length is 1 pixel, 100 convolution kernels are used in total to generate 173 multiplied by 1 multiplied by 100 pixel layers, a ReLU function is used as an activation function, the pixel layers are subjected to treatment of a ReLU unit to generate activated pixel layers, the activated pixel layers are subjected to treatment of maximum pooling operation, the scale of pooling operation is 4 multiplied by 1, the step length is default to be 1, and the pixel size after pooling is 43 multiplied by 1 multiplied by 100; the second layer uses a convolution kernel of 5×1×200, and a convolution operation is performed to generate 39×1×200 pixel layers. The pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pool operation, the pool operation scale is 4 multiplied by 1, the image size after the pool is 9 multiplied by 1 multiplied by 200, and then input neurons are disconnected immediately with 10% probability to update parameters when the parameters are updated by the Dropout layer; the multi-dimensional input is unidimensionally input by using a flattening layer, after flattening treatment, a group of unidimensionally pixel arrays are output, the sum contains 1800 data, and then the pixels are used as input to be transmitted into a full-connection layer for further operation.
And 4, building convolution pooling layers, wherein the method comprises 3 convolution layers, 3 maximum pooling layers and 1 full connection layer by using a 7-layer convolution neural network model. The input data of the first layer is a spectrogram of 200 multiplied by 3, convolution operation is carried out on the input data and the spectrogram by adopting convolution kernels of 3 multiplied by 3, the convolution kernels move along the X axis and the Y axis of the image, the step length is 1 pixel, 64 convolution kernels are used for generating 198 multiplied by 64 pixel layer data, a ReLU function is used as an activation function, the pixel layers are processed by the ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pooling operation, the scale of the pooling operation is 2 multiplied by 2, the step length is 2 by default, and the pixel size after pooling is 99 multiplied by 64; during back propagation, each convolution kernel corresponds to one deviation value, namely 64 convolution kernels of the first layer correspond to 64 deviation values input by the upper layer; the second layer uses 32 3×3×64 convolution kernels, and 97×97×32 pixel layers are generated after convolution operation. The pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pool operation, the pooled image size is 48 multiplied by 32 by using a pooled operation scale of 2, and then input neurons are immediately disconnected with 10% probability to update parameters when the parameters are updated by a Dropout layer, so that the parameters are prevented from being overfitted; in the back propagation in this layer, each convolution kernel corresponds to one offset value, i.e. 64 convolution kernels of the first layer correspond to 32 offset values of the upper layer input; similarly, the third layer uses 32 3×3×32 convolution kernels, and a convolution operation is performed to generate 46×46×32 pixel layers. The pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pool operation, the pool operation scale is 2 multiplied by 2, the image size after the pool is 23 multiplied by 32, and then input neurons are disconnected immediately with 10% probability to update parameters when the parameters are updated by the Dropout layer; the multi-dimensional input is unidimensionally processed by using a flattening layer, a group of unidimensionally pixel arrays are output after flattening treatment, the sum of the unidimensionally pixel arrays contains 16928 data, and then the pixels are taken as input to be transmitted into a full-connection layer for further operation.
In order to extract the characteristics of the spectrogram itself to send into a GAN network to generate a new spectrogram, the obtained multidimensional characteristics are required to be subjected to dimension reduction, a full-connection layer is built, the full-connection (Dense) is used for fully connecting the input 16928 data to 128 nerve units, 128 data are generated after the ReLU activation function processing, and 128 data are output after the Dropout processing to serve as voice emotion characteristics.
Step 5, the GAN generator network model of the invention consists of 1 full connection layer, 3 transposed convolution layers and 2 batch normalization layers. The first layer input data is 128 data extracted in the step 4, is connected with 4608 neurons through a full connection layer, and is converted into a shape of 3 multiplied by 512; the second layer uses transpose convolution to reduce 512 channels to 256 channels, kernel_size 3, step size 3, and pass through the batch normalization layer; the third layer uses transpose convolution to reduce 256 channels to 128 channels, kernel_size 5, step size 2, and pass through the batch normalization layer; the fourth layer uses transpose convolution to reduce 128 channels to 3 channels, kernel_size 4, step size 3;
the GAN discriminator network model of the invention consists of 3 convolutional layers, 2 batch normalization layers, and 1 fully connected layer using a 7-layer convolutional neural network model. The input data of the first layer is a spectrogram of 64 multiplied by 3, convolution operation is carried out on the spectrogram and a convolution kernel of 5 multiplied by 3, the convolution kernel moves along the X axis and the Y axis of the image, the step length is 1 pixel, 64 convolution kernels are used for generating 60 multiplied by 24 pixel layer data, a leakage-ReLU function is used as an activation function, and the pixel layers are subjected to the processing of the leakage-ReLU unit to generate an activated pixel layer; the second layer uses 128 5×5×128 convolution kernels, and 57×57×128 pixel layers are generated after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer for preventing overfitting; the third layer uses 256 5×5×256 convolution kernels, and generates 53×53×256 pixel layers after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer for preventing overfitting; using a flattening layer to unify multi-dimensional input, after flattening treatment, using the pixels as input to enter a full-connection layer, wherein the final output layer is 1 node, and outputting a probability value; the size of the generated spectrogram which meets the standard and is 64 multiplied by 3 is modified to be 200 multiplied by 3 pixel size, and then the spectrogram is transmitted into the convolution network in the step 4 for retraining.
And 6, constructing a full-connection layer, combining 1800-dimensional data extracted in the step 3 and 16928-dimensional data extracted in the step 4 into 18728-dimensional data, performing full connection on the 18728-dimensional data and 256 nerve units, generating 256 data after being processed by a ReLU activation function, and outputting 256 data after being processed by a Dropout as voice emotion characteristics.
Step 7, as the data set contains 292 persons, 43,800 voice information is used in total through clipping and screening, labels of 292 persons, which are whether to suffer from depression, are used as a label dictionary, each label is provided with a corresponding index number, index numbers with the label index number being similar are set, 90% of voice signals of the labels are used as a training set, and the rest 10% of voice signals are used as a test set;
for each label, the speech that always had depression was taken as a positive sample set and the speech that did not have depression was taken as a negative sample set. Training the two-classification SVM by using a positive example sample set and a negative example sample set to obtain a trained two-classification SVM;
and 8, recognizing test voice through the trained two-class SVM in the step 7.

Claims (8)

1. A method for detecting depression based on a microphone array, comprising the steps of:
step 1, a microphone array is used for collecting voice signals of a target patient and preprocessing the voice signals;
step 2, extracting the audio signal preprocessed by the target patient in the step 1 and the MFCC characteristics of the voice data of the existing depression patient to generate an audio spectrogram;
step 3, the MFCC features extracted in the step 2 are sent into a 1D convolutional neural network to obtain P-dimensional features of the MFCC;
step 4, sending the audio spectrogram generated in the step 2 into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram;
step 5, inputting the O-dimensional characteristics obtained in the step 4 into an countermeasure generation network to generate a new spectrum image, and transmitting the generated new spectrum image into the 2D convolutional neural network in the step 4 for training;
step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 with the characteristics obtained by training in the step 5, and reducing the dimension through the full connection layer;
step 7, training a classifier through the feature after dimension reduction obtained in the step 6;
and 8, recognizing the test voice through the trained classifier in the step 7 to obtain a recognition result.
2. The method for detecting depression based on microphone array as claimed in claim 1, wherein the step 1 specifically comprises the steps of:
step 1.1, collecting a voice signal of a target patient through a quaternary cross microphone array;
step 1.2, carrying out frame windowing on the collected voice signal of the target patient, transforming the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, completing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting a signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining the energy entropy ratio to obtain an endpoint value of voice;
step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA positioning method;
and 1.4, synthesizing four paths of signals into one path of signal through a superdirective beam forming algorithm by using the voice signals subjected to end point detection and sound source positioning processing, so as to realize synthesis, noise reduction and enhancement of microphone array signals.
3. The method for detecting depression based on microphone array as claimed in claim 2, wherein the step 2 specifically comprises the steps of:
step 2.1, firstly dividing a voice signal into frames through a Hamming window function; then generating a cepstrum feature vector, calculating discrete Fourier transform for each frame, only reserving logarithm of an amplitude spectrum, collecting 24 frequency spectrum components of 44100 frequency bands in a Mel frequency range after the frequency spectrum is smoothed, and approximating the frequency spectrum components to discrete cosine transform after Karhunen-Loeve transform is applied; finally obtain f per frame 1 ,f 2 ,...,f N ]A cepstrum feature;
step 2.2, according to the set frame number, framing and windowing the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal x (m) in the m-th frame to obtain a spectrogram; l filters are selected, and L frames having the same size as the filters are selected in the time direction, a spectrogram of L×L×3 is generated, and the size of the generated color image is adjusted to M×M×3.
4. The method for detecting depression based on microphone array as claimed in claim 3, wherein the 1D convolutional neural network of step 3 is: using a Keras framework based on Tensorflow with an open source, only building two 1D convolution layers, wherein each layer adopts a correction linear unit as an activation function; the input dimension is Mx1, through w 1 A convolution layer filter with m multiplied by 1, dropout of 0.1 and maximum pool step of q 1 Outputting a feature vector of S; in the stage of training the 1D convolutional neural network, the MFCC features containing time-frequency information of each frame of voice signal are sequentially read into a memory by using a traversing method, a training set and a testing set are divided, labels are respectively added to the training set and the testing set, and processed data are transmitted into the convolutional neural network according to the set labels to carry out iterative training, and the total iteration is carried out for B times.
5. A microphone-based array as defined in claim 4The depression detection method is characterized in that the 2D convolutional neural network in the step 4 is as follows: using a open source Tensorflow-based Keras framework to build a containing w 2 Two-dimensional convolution layers of size n x n, w 1 A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units are adopted in the convolutional layer and the fully connected layer as an activation function; in the stage of training the convolutional neural network, sequentially reading spectrogram characteristics containing texture-like information of each frame of voice signal into a memory by using a traversing method, dividing a training set and a testing set, respectively adding labels to the training set and the testing set, and then transmitting processed data into the convolutional neural network according to the set labels for iterative training, wherein the total iteration is carried out for B times; training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.
6. The method for detecting depression based on microphone array as claimed in claim 5, wherein the countermeasure generation network of step 5 is: the network model comprises a generator and a discriminator, wherein the generator network model consists of 1 full-connection layer, 3 transposed convolution layers and 2 batch standardization layers, and outputs a color picture with the size of MxMx3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolution layers, 2 batch standardization layers and 2 full connection layers by using a 7-layer convolution neural network model, and finally outputs a probability value; setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator.
7. The method for detecting depression based on microphone array as claimed in claim 6, wherein said step 6 is specifically: the P-dimensional characteristics of the MFCC extracted through the 1D convolutional neural network are fused with the O-dimensional characteristics of the spectrogram to obtain P+O-dimensional characteristics, and the dimension of the P+O-dimensional characteristics is changed into 256 dimensions through a full connecting layer.
8. The method for detecting depression based on microphone array as claimed in claim 7, wherein said step 7 is specifically:
step 7.1, taking the voice of the target patient as test voice and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X persons, labels of whether the X persons suffer from depression are used as a label dictionary, each label is provided with a corresponding index number, and index numbers of which the label index numbers are classified are set; after one test, adding a spectrogram generated by a target patient into a training data set;
and 7.2, for each label, the voice with depression is taken as a positive sample set, the voice without depression is taken as a negative sample set, and the positive sample set and the negative sample set are used for training the two-class SVM, so that the trained two-class SVM is obtained.
CN202011248610.5A 2020-11-10 2020-11-10 Depression detection method based on microphone array Active CN112349297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011248610.5A CN112349297B (en) 2020-11-10 2020-11-10 Depression detection method based on microphone array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011248610.5A CN112349297B (en) 2020-11-10 2020-11-10 Depression detection method based on microphone array

Publications (2)

Publication Number Publication Date
CN112349297A CN112349297A (en) 2021-02-09
CN112349297B true CN112349297B (en) 2023-07-04

Family

ID=74362344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011248610.5A Active CN112349297B (en) 2020-11-10 2020-11-10 Depression detection method based on microphone array

Country Status (1)

Country Link
CN (1) CN112349297B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012720B (en) * 2021-02-10 2023-06-16 杭州医典智能科技有限公司 Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN112818892B (en) * 2021-02-10 2023-04-07 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN112687390B (en) * 2021-03-12 2021-06-18 中国科学院自动化研究所 Depression state detection method and device based on hybrid network and lp norm pooling
CN113223507B (en) * 2021-04-14 2022-06-24 重庆交通大学 Abnormal speech recognition method based on double-input mutual interference convolutional neural network
CN113205803B (en) * 2021-04-22 2024-05-03 上海顺久电子科技有限公司 Voice recognition method and device with self-adaptive noise reduction capability
CN113476058B (en) * 2021-07-22 2022-11-29 北京脑陆科技有限公司 Intervention treatment method, device, terminal and medium for depression patients
CN113679413B (en) * 2021-09-15 2023-11-10 北方民族大学 VMD-CNN-based lung sound feature recognition and classification method and system
CN113820693B (en) * 2021-09-20 2023-06-23 西北工业大学 Uniform linear array element failure calibration method based on generation of countermeasure network
CN114219005B (en) * 2021-11-17 2023-04-18 太原理工大学 Depression classification method based on high-order spectrum voice features
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
CN109599129A (en) * 2018-11-13 2019-04-09 杭州电子科技大学 Voice depression recognition methods based on attention mechanism and convolutional neural networks
CN110047506A (en) * 2019-04-19 2019-07-23 杭州电子科技大学 A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
CN109599129A (en) * 2018-11-13 2019-04-09 杭州电子科技大学 Voice depression recognition methods based on attention mechanism and convolutional neural networks
CN110047506A (en) * 2019-04-19 2019-07-23 杭州电子科技大学 A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals;LE YANG等;IEEE ACCESS;全文 *
Recognition of Audio Depression Based on Convolutional Neural Network and Generative Antagonism Network Model;ZHIYONG WANG等;IEEE ACCESS;全文 *
基于深度学习的音频抑郁症识别;李金鸣等;计算机应用与软件;全文 *
基于自编码器的语音情感识别方法研究;钟昕孜 等;电子设计工程(第06期);全文 *

Also Published As

Publication number Publication date
CN112349297A (en) 2021-02-09

Similar Documents

Publication Publication Date Title
CN112349297B (en) Depression detection method based on microphone array
US10901063B2 (en) Localization algorithm for sound sources with known statistics
CN109272989B (en) Voice wake-up method, apparatus and computer readable storage medium
US10127922B2 (en) Sound source identification apparatus and sound source identification method
Stöter et al. Countnet: Estimating the number of concurrent speakers using supervised learning
Glodek et al. Multiple classifier systems for the classification of audio-visual emotional states
US5621848A (en) Method of partitioning a sequence of data frames
JPS62201500A (en) Continuous speech recognition
Suvorov et al. Deep residual network for sound source localization in the time domain
CN113314127B (en) Bird song identification method, system, computer equipment and medium based on space orientation
CN112329819A (en) Underwater target identification method based on multi-network fusion
Salvati et al. A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients
Venkatesan et al. Binaural classification-based speech segregation and robust speaker recognition system
US5832181A (en) Speech-recognition system utilizing neural networks and method of using same
Salvati et al. Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features.
Lin et al. Domestic activities clustering from audio recordings using convolutional capsule autoencoder network
CN117762372A (en) Multi-mode man-machine interaction system
CN115952840A (en) Beam forming method, arrival direction identification method, device and chip thereof
Ganchev et al. Automatic height estimation from speech in real-world setup
Amami et al. A robust voice pathology detection system based on the combined bilstm–cnn architecture
Kanisha et al. Speech recognition with advanced feature extraction methods using adaptive particle swarm optimization
Raju et al. AUTOMATIC SPEECH RECOGNITION SYSTEM USING MFCC-BASED LPC APPROACH WITH BACK PROPAGATED ARTIFICIAL NEURAL NETWORKS.
Venkatesan et al. Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest
Sailor et al. Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection.
Kothapally et al. Speech Detection and Enhancement Using Single Microphone for Distant Speech Applications in Reverberant Environments.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant