CN112349297A - Depression detection method based on microphone array - Google Patents

Depression detection method based on microphone array Download PDF

Info

Publication number
CN112349297A
CN112349297A CN202011248610.5A CN202011248610A CN112349297A CN 112349297 A CN112349297 A CN 112349297A CN 202011248610 A CN202011248610 A CN 202011248610A CN 112349297 A CN112349297 A CN 112349297A
Authority
CN
China
Prior art keywords
training
neural network
voice
convolutional neural
microphone array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011248610.5A
Other languages
Chinese (zh)
Other versions
CN112349297B (en
Inventor
焦亚萌
周成智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Polytechnic University
Original Assignee
Xian Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Polytechnic University filed Critical Xian Polytechnic University
Priority to CN202011248610.5A priority Critical patent/CN112349297B/en
Publication of CN112349297A publication Critical patent/CN112349297A/en
Application granted granted Critical
Publication of CN112349297B publication Critical patent/CN112349297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Veterinary Medicine (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Psychiatry (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Educational Technology (AREA)
  • Developmental Disabilities (AREA)
  • Psychology (AREA)
  • Social Psychology (AREA)
  • Epidemiology (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physiology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a depression detection method based on a microphone array, which comprises the steps of collecting a voice signal of a target patient by using the microphone array and preprocessing the voice signal; extracting MFCC characteristics of the audio signal preprocessed by the target patient and the voice data of the existing depression patient to generate an audio frequency spectrogram; sending the MFCC features into a 1D convolutional neural network to obtain P-dimensional features of the MFCC; sending the audio frequency spectrogram into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram; inputting the O-dimensional features into a countermeasure generation network to generate a new spectral image, and transmitting the generated new spectral image into a 2D convolutional neural network for training; fusing the P-dimensional characteristics of the MFCC and the characteristics obtained by training and reducing the dimensions through a full-connection layer; training a classifier by using the dimensionality reduction features; and training the classifier to identify the test voice to obtain an identification result. The method improves the accuracy of depression identification in non-experimental environments.

Description

Depression detection method based on microphone array
Technical Field
The invention belongs to the technical field of voice recognition methods, and particularly relates to a depression detection method based on a microphone array.
Background
Currently, some progress has been made in the field of depression detection, but diagnosis of patient's condition mainly requires that the patient performs voice signal acquisition before a fixed voice acquisition device and mainly relies on a clinician for diagnosis, and common diagnosis schemes include a becker depression scale (BDI), a hamilton depression scale (HAMD), and the like, so that diagnosis results of the patient are very dependent on the experience and ability of a physician, and more importantly, cooperation of the patient is required. Therefore, most of the collected voices during the examination of the patient present the characteristics of programming and mechanization, which may cause the problem of inaccurate collected voices of the patient. Therefore, the detection device must be able to collect the voice of the patient under the condition of removing background noise in the natural state of daily life of the patient.
A microphone array is composed of a number of acoustic sensors and is a system for sampling and processing the spatial characteristics of a sound field. In a complex acoustic environment, noise always comes from all directions, and often overlaps with a speech signal in time and frequency spectrum, and in addition to the effects of echo and reverberation, it is very difficult to capture relatively pure speech with a single microphone. And the microphone array fuses the space-time information of the voice signal, so that the sound source can be simultaneously extracted and the noise can be suppressed.
Convolutional Neural Networks (CNN) are one of deep learning algorithms established in recent years, and have a good classification performance for processing large-scale images. The generation of the antagonistic network (GAN) has the greatest advantage that the experimental problem of insufficient sample data is solved, and a sample which is false and genuine is generated by constructing a proper network model, so that the diagnosis and the prediction of medical diseases can be effectively facilitated, and more important diagnosis basis is provided for medical research.
The advantage that the microphone array can clearly adopt sound signals is combined with the advantages of two deep learning methods of GAN and CNN, and therefore the accuracy of depression identification is improved.
Disclosure of Invention
The invention aims to provide a depression detection method based on a microphone array, which improves the accuracy of depression identification.
The technical scheme adopted by the invention is as follows: a microphone array based depression detection method comprising the steps of:
step 1, collecting a voice signal of a target patient by using a microphone array and preprocessing the voice signal;
step 2, extracting MFCC characteristics of the audio signal preprocessed by the target patient and the voice data of the existing depression patient in the step 1 to generate an audio frequency spectrogram;
step 3, sending the MFCC features extracted in the step 2 into a 1D convolutional neural network to obtain P-dimensional features of the MFCC;
step 4, sending the audio frequency spectrogram generated in the step 2 into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram;
step 5, inputting the O-dimensional features obtained in the step 4 into a countermeasure generation network to generate a new spectral image, and transmitting the generated new spectral image into the 2D convolution neural network in the step 4 for training;
step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 and the characteristics obtained by training in the step 5, and reducing the dimensions through a full connection layer;
7, training a classifier by the feature obtained in the step 6 after dimensionality reduction;
and 8, identifying the test voice through the classifier trained in the step 7 to obtain an identification result.
The present invention is also characterized in that,
the step 1 specifically comprises the following steps:
step 1.1, acquiring a target patient voice signal through a quaternary cross microphone array;
step 1.2, performing frame windowing on the collected voice signal of the target patient, converting the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, finishing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting the signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining an energy-entropy ratio to obtain a voice endpoint value;
step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA (direction of arrival) positioning method;
and step 1.4, synthesizing four paths of signals into one path of signal through a super-directivity beam forming algorithm according to the voice signals subjected to endpoint detection and sound source positioning processing, and realizing synthesis, noise reduction and enhancement of the microphone array signals.
The step 2 specifically comprises the following steps:
step 2.1, firstly dividing the voice signal into frames through a Hamming window function; generating cepstrum characteristic vectors, calculating discrete Fourier transform for each frame, only reserving the logarithm of an amplitude spectrum, collecting 24 spectrum components of 44100 frequency bands in a Mel frequency range after the frequency spectrum is smoothed, and approximating the spectrum components to discrete cosine transform after the Karhunen-Loeve transform is applied; finally, each frame obtains [ f1,f2,...,fN]A cepstral feature;
step 2.2, according to the set frame number, performing framing and windowing on the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal in the mth frame to obtain a spectrogram; when L filters are selected and L frames having the same size as the filters are selected in the time direction, an L × 3 spectrogram is generated, and the size of the generated color image is adjusted to M × 3.
The 1D convolutional neural network of the step 3 is as follows: only two 1D convolutional layers are built by using an open-source Keras framework based on Tensorflow, wherein each layer adopts a correction linear unit as an activation function; input dimension of Mx 1, through w1Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q1Outputting a feature vector of S; at the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.
The 2D convolutional neural network of the step 4 is as follows: construction of a Keras frame containing w using an open source Tensorflow-based2N x n sized two-dimensional convolution layers, w1The convolutional neural network comprises maximum pooling layers and 1 full-connection layer with the output dimension of L, wherein correction linear units are adopted in the convolutional layers and the full-connection layers as activation functions; at the stage of training the convolutional neural network, sequentially reading spectrogram features containing texture-like information of each frame of voice signal into a memory by using a traversal method, dividing a training set and a testing set, adding labels to the training set and the testing set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training for B times in total; and training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.
The countermeasure generation network of step 5 is: based on the network structure of DCGAN, simplify it and carry on the adjustment on the parameter, the network model includes generator and discriminator, the generator network model is made up of 1 full connection layer, 3 transpose convolution layers and 2 pieces of batch standardized layers, output as a color picture of size M x 3, the discriminator part includes 3 convolution layers and a full connection layer with softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value; and setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda.
The step 6 specifically comprises the following steps: and fusing the P dimension characteristic of the MFCC extracted by the 1D convolutional neural network with the O dimension characteristic of the spectrogram to obtain a P + O dimension characteristic, and changing the dimension of the P + O dimension characteristic into 256 dimensions through a full connection layer.
The step 7 specifically comprises the following steps:
step 7.1, taking the voice of the target patient as a test voice, and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X individuals, tags of whether the X individuals suffer from depression are used as a tag dictionary, each tag has a corresponding index number, and the tag index numbers are set as index numbers of the class; after one test, adding a spectrogram generated by a target patient into a training data set;
and 7.2, for each label, using the voices which always suffer from depression as a positive example sample set and the voices which do not suffer from depression as a negative example sample set, and training the two-classification SVM by using the positive example sample set and the negative example sample set to obtain the trained two-classification SVM.
The invention has the beneficial effects that: according to the depression detection method based on the microphone array, the microphone for voice acquisition is convenient to carry, and voice signals of a patient in a natural state can be acquired; based on the depression recognition research result combined with the CNN, the MFCC characteristics and the GAN enhanced data set characteristics, the accuracy of depression recognition in a non-experimental environment is improved by combining the advantages of MFCC and CNN.
Drawings
FIG. 1 is a schematic diagram of a microphone array based depression detection method of the present invention;
FIG. 2 is a schematic diagram of a microphone array used in a method of the present invention for depression detection based on a microphone array;
FIG. 3 is a schematic diagram of a CNN model in a depression detection method based on a microphone array according to the present invention;
fig. 4 is a schematic diagram of a GAN model in a depression detection method based on a microphone array according to the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention provides a depression detection method based on a microphone array, which comprises the following steps as shown in figures 1 to 4:
step 1, can carry out accurate sound localization at the target speaker direction formation pickup beam through using annular microphone array, restrain noise and reflected sound, strengthen sound signal, can accurately discern 3-5 m's remote pronunciation under noisy environment, satisfy the demand of gathering at any time to speech signal in the patient daily life, specifically do:
step 1.1, acquiring a patient voice signal through a quaternary cross microphone array;
step 1.2, performing frame windowing on the collected voice signals of the target patient, converting the signals from a time domain ratio to a frequency domain by utilizing fast Fourier transform, and finishing the estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum. And outputting the spectrum-reduced signal. Finally, the energy-to-entropy ratio is combined and detected to determine whether the patient speech signal is contained. Obtaining an endpoint value of the voice; the calculation process of the energy-entropy ratio is as follows:
the energy per frame is calculated as:
Figure BDA0002770860390000061
xiand (m) is a signal of an ith frame, and the frame length is N. The energy relation expression is as follows:
Ei=log10(1+ei/a)
a is constant and proper adjustment can distinguish between unvoiced sounds and noise. The ith frame of voice signal is subjected to fast Fourier transform to obtain:
Figure BDA0002770860390000071
obtaining a frequency component energy spectrum corresponding to the kth spectral line:
Figure BDA0002770860390000072
the normalized spectral probability density is then:
Figure BDA0002770860390000073
short-time spectral entropy definition of a speech frame:
Figure BDA0002770860390000074
energy to entropy ratio EHiIs the ratio of energy and entropy spectrum:
Figure BDA0002770860390000075
step 1.3, combining the end point detection result, using a DOA positioning method to judge the position of the sound source signal, and explaining the processing process of a frame of signal data: reading voice data, taking the mth frame as a processing object, taking 4 paths of microphone signals corresponding to the mth frame data, combining the 4 paths of signals into a 1 path of signal, and performing W on the signalc(k) Weighting; then, the corresponding energy sum E of a certain angle on different frequency bands is obtainedsCalculating to obtain the energy value E corresponding to the current frame signal at 360 degreess(i) And the value of i is 0-360 degrees. Take the maximum E of these 360 energiessmax(i) And the angle i corresponding to the maximum energy value, the sound source angle determined by the current frame can be output. The band energy of each frame signal corresponding to a certain angle is:
Figure BDA0002770860390000081
in the formula (f)1、f2Indicates the setting range of the frequency band from 1 to N/2+1, Xsw(k) In order to perform frequency band weighting processing on the combined 1-path signal, the formula is as follows:
Figure BDA0002770860390000082
in the formula, We(k) As a band weighting factor, the formula is:
Figure BDA0002770860390000083
in the formula, the index 0 < lambda < 1, and W (k) is a masking weight factor, which indicates that the frequency band with the maximum signal-to-noise ratio SNR in each frequency band is selected for the current data.
Xs(k) To combine 4 signals into 1 signal, the formula is:
Figure BDA0002770860390000084
in the formula, Xi(k) Is 1 of the 4 signals.
And step 1.4, synthesizing the 4 paths of signals into 1 path of signal through a super-directional beam forming algorithm on the voice signals after the endpoint detection and the sound source positioning processing, thereby realizing the synthesis, noise reduction and enhancement of the microphone array signals. The super-directional beamforming algorithm is detailed as follows:
the microphone array quaternary cross array can be regarded as one of uniform circular arrays, and the arrival direction vector of a received signal at an angle theta is as follows according to the geometrical relationship of the array:
Figure BDA0002770860390000085
wherein the content of the first and second substances,
Figure BDA0002770860390000091
the voice environment used by the method is mainly indoor and daily life, so that the noise matrix calculated based on the scattered noise field has certain applicability to the current microphone voice environment; the scattered noise field only describes the equidirectional noise field of the three-dimensional sphere, and the expression of the correlation function of the scattered noise field is as follows:
Figure BDA0002770860390000092
where sinc (x) yields the sampling function sin π x/π x. The microphone array is composed of M array elements, and the signal received by the ith microphone is as follows:
Figure BDA0002770860390000093
wherein f represents frequency, AiThe amplitude is represented by a value representing the amplitude,
Figure BDA0002770860390000094
expressing the phase, according to the mathematical model theory of the optimal solution of the super directivity, the noise signal correlation coefficient between any two points in the space is as follows:
Figure BDA0002770860390000095
the noise covariance matrix is normalized to:
Rnn=[ρij](i,j=1,2,...,N-1)
dijrepresenting the distance between any two array elements in the microphone array.
The invention adopts the principle of minimum variance distortion free response (MVDR) beam forming, which is the constraint condition w of the LCMV methodHThis method is true when a (θ) ═ 1, and the signal strength is maintained while the variance of the noise is minimized, so to speak, MVDR maximizes the signal-to-noise ratio (SNR) of the array output signal. The aim is to select a filter coefficient w to minimize the total output power under the constraint condition of no distortion of a voice signal; therefore, the key objective is to solve the optimal solution of the weight coefficient w, and the constraint expression is as follows:
Figure BDA0002770860390000096
wherein, a (theta)s)=[a1(θ),...,aM(θ)]TA vector is guided for a target signal, represents a transfer function between a sound source direction and a microphone, and can be obtained by calculating multiple delay time tau; rxFor a spatial signal dependent covariance matrix, when k noise signals, which are not temporally correlated with each other, arrive at the microphone element from different directions, the spatial dependent covariance matrix is defined as:
Figure BDA0002770860390000101
calculated by the lagrange Multiplier method:
Figure BDA0002770860390000102
we normalize the resulting R using the obtained noise covariance matrixnnInstead of the noise covariance matrix R in the above-mentioned MVDRxThe superdirectivity weighting coefficient can be obtained as follows:
Figure BDA0002770860390000103
and completing weighted beam forming of the multi-channel microphone by using the optimized super-directional weighting coefficient.
Step 2, extracting MFCC characteristics and generating an audio frequency spectrogram, specifically extracting time-frequency representation and similar texture representation of an audio signal at the same time:
step 2.1, firstly, dividing the voice signal into frames by a Hamming window function. Cepstral feature vectors are then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude spectrum is retained and 24 spectral components of 44100 bands are collected over the Mel frequency range after the spectrum is smoothed. The components of the mel-frequency spectral vector calculated for each frame are highly correlated. Therefore, after applying KL (Karhunen-Loeve) transform, it is approximated as Discrete Cosine Transform (DCT). Finally, [ f ] is obtained per frame1,f2,...,fN]A cepstral feature;
and 2.2, performing framing and windowing on the voice signal of the patient according to the set frame number, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal in the mth frame to obtain a spectrogram. To accommodate the input of the convolutional neural network, L filters are selected, and L frames that are as large as the filters are selected in the time direction, so that an L × 3 spectrogram is generated, and the size of the generated color image is adjusted to M × 3.
And 3, sending the MFCC characteristics obtained in the step 2 into a 1D convolutional neural network to obtain P-dimensional characteristics of the MFCC, wherein the 1D convolutional neural network is as follows: using an open source tensirflow-based Keras framework, to prevent the over-fitting problem, only two one-dimensional (1D) convolutional layers are built, each layer using a corrective linear unit (ReLU) as an activation function; input dimension of Mx 1, through w1Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q1And outputting the feature vector of S. At the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.
And 4, sending the spectrogram obtained in the step 2 into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram, wherein the 2D convolutional neural network is as follows: an open-source Keras framework based on Tensorflow is used, and a network structure containing w is constructed by referring to AlexNet2N x n sized two-dimensional convolution layers, w1A convolutional neural network of maximum pooling layers and 1 fully-connected layer with output dimension L, wherein a correcting linear unit (ReLU) is adopted in both the convolutional layer and the fully-connected layer as an activation function; at the stage of training the convolutional neural network, sequentially reading the spectrogram characteristics of each frame of voice signal, which contain texture-like information, into a memory by using a traversal method, dividing a training set and a test set, adding labels to the training set and the test set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training, and iterating for B times in total. And training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.
And 5, inputting the characteristics obtained in the step 4 into a countermeasure generation network to generate a new frequency spectrum image, putting the generated new frequency spectrum image into original frequency spectrum image data, and then executing the training in the step 4. The countermeasure generation network is: based on the network structure of DCGAN, the network structure is simplified and the parameters are adjusted. The network model comprises a generator (generator) and a discriminator (discriminator), wherein the generator network model consists of 1 full-connection layer, 3 transposition convolution layers and 2 batch standardization layers, the output is a color picture with the size of M multiplied by 3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value. And setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda. And (4) transmitting the generated spectrogram meeting the standard into the convolution network in the step 4 for retraining.
And 6, fusing the MFCC features extracted in the step 3 and the features obtained by the expanded spectrogram data through the step 4, and reducing dimensions through a full connection layer, wherein the method specifically comprises the following steps of: and (3) fusing the P-dimensional feature of the MFCC extracted by the CNN with the O-dimensional feature of the spectrogram to obtain a P + O-dimensional feature, and enabling the dimension of the P + O-dimensional feature to be 256-dimensional through a full connection layer.
And 7, training a classifier through the obtained features subjected to dimensionality reduction processed in the step 6, wherein the training classifier specifically comprises the following steps:
and 7.1, taking the voice of the target patient as the test voice, and taking the voice data of the existing depression patient as training data. The training data comprises voice information of X individuals, tags of whether the X individuals suffer from depression are used as a tag dictionary, each tag has a corresponding index number, and the tag index numbers are set as index numbers of the class. After one test, the spectrogram generated by the target patient is added to the training data set.
Step 7.2, for each tag, the speech that always suffered from depression was taken as the positive sample set, and the speech that did not suffer from depression was taken as the negative sample set. Training a two-classification SVM by using the positive sample set and the negative sample set to obtain a trained two-classification SVM; the classifier training process is specifically as follows:
determining kernel function and punishment of SVM by circularly checking accuracy rate of SVM training setAnd selecting the optimal parameters of the two parameters of the penalty factor, and then utilizing the parameters to train the model. Let the training sample speech data be: { xi,yi},xi∈Rn,i=1,2,..,n,xiIs a feature vector of O + P dimension, yiTo determine whether a depression label is present, the SVM maps the training set to a high dimensional space using a non-linear mapping Φ (x), the most classified surface that makes the non-linear problem linear is described as: y ═ ωTΦ (x) + b, ω and b represent the weight and bias of the SVM.
To find the optimal ω and b, then the relaxation factor ξ is introducediAnd transforming the classification plane to obtain a secondary optimization problem, namely:
Figure BDA0002770860390000131
s.t.yi(ω·Φ(xi)+b)≥1-ξi
ξi≥0i=1,2,...,n
in the formula: c denotes a penalty parameter. Transforming the quadratic optimization problem by introducing a Lagrange multiplier to obtain:
Figure BDA0002770860390000132
the formula for the weight vector ω is: ω ═ Σ αiyiΦ(xi) Φ (x), the decision function of the support vector machine can be described as: f (x) sgn (α)iyiΦ(xi)·Φ(xj) + b), simplified computation, introducing a gaussian Radial Basis (RBF) kernel function and then a decision function:
Figure BDA0002770860390000133
where σ represents the width parameter of the RBF.
And 8, identifying the test voice through the classifier trained in the step 7. The generated identification result can be sent to the guardian of the patient through WIFI, so that the illness state of the patient can be observed at any time.
Through the mode, the microphone for voice acquisition based on the depression detection method of the microphone array is convenient to carry, and can acquire voice signals of a patient in a natural state; based on the depression recognition research result combined with the CNN, the MFCC characteristics and the GAN enhanced data set characteristics, the accuracy of depression recognition in a non-experimental environment is improved by combining the advantages of MFCC and CNN.
The depression recognition test was performed with the AVEC2013 audio visual depression recognition challenge race database using the microphone array based depression detection method of the present invention, the data set containing 340 speech information of the individual. The specific operation is as follows:
step 1, preprocessing the voice signals under each subdirectory in sequence by using a traversal method, and dividing the voice signals into frames by using a Hamming window function. Cepstral feature vectors are then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude is retained. After the spectrum is smoothed, 24 spectral components of 44100 bands are collected over the Mel frequency range. The components of the mel-frequency spectral vector calculated for each frame are highly correlated. Therefore, after applying KL (Karhunen-Loeve) transform, it is approximated as Discrete Cosine Transform (DCT).
Step 2, extracting MFCC features after preprocessing signals, normalizing the MFCC features, limiting the length of each section of voice to be 10 seconds by dividing voice fragments, obtaining 177-dimensional feature vectors of each frame by 50 frames per second, and setting the number of channels of each voice to be 50; then converting the voice signal into a spectrogram, wherein the spectrogram limits the number of sampling frames to 64 frames per second; obtaining a color picture with a spectrogram of 64 × 64 × 3 pixels, and adjusting the picture size to 200 × 200 × 3 pixels.
Step 3, building a convolution pooling layer, wherein a model of a 5-layer convolution neural network is composed of 2 convolution layers, 2 maximum pooling layers and 1 full-connection layer, the input data of the first layer is MFCC characteristics of 177 × 1 × 50, convolution operation is carried out by adopting a convolution kernel of 5 × 1 and the MFCC characteristics, the convolution kernel moves along two directions of an x axis and a y axis of the MFCC characteristics, the step length is 1 pixel, 100 convolution kernels are used in total to generate 173 × 1 × 100 pixel layers, a ReLU function is used as an activation function, the pixel layers are processed by a ReLU unit to generate activation pixel layers, the activation pixel layers are processed by maximum pooling operation, the scale of the pooling operation is 4 × 1, the step length is defaulted to be 1, and the size of the pooled pixels is 43 × 1 × 100; the second layer uses 5 × 1 × 200 convolution kernels, and generates 39 × 1 × 200 pixel layers after convolution operation. The pixel layers are processed by a ReLU unit to generate active pixel layers, the active pixel layers are processed by maximum pool operation, the size of the image after the pool operation is 4 multiplied by 1, the size of the image after the pool operation is 9 multiplied by 1 multiplied by 200, and then the input neurons are immediately disconnected with a probability of 10% to update the parameters when the parameters are updated by a Dropout layer; the multi-dimensional input is subjected to one-dimensional input by using a flattening layer, a group of one-dimensional pixel arrays are output after flattening processing, 1800 data are contained in the array, and then the pixels are used as input and transmitted to a full connection layer for the next operation.
And 4, building a convolution pooling layer, wherein a 7-layer convolution neural network model is composed of 3 convolution layers, 3 maximum pooling layers and 1 full-connection layer. The input data of the first layer is a spectrogram of 200 × 200 × 3, convolution operation is performed on the spectrogram by adopting a convolution kernel of 3 × 3 × 3, the convolution kernel moves along the x axis and the y axis of an image, the step size is 1 pixel, 64 convolution kernels are used in total to generate data of 198 × 198 × 64 pixels, a ReLU function is used as an activation function, the pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by maximum pooling operation, the scale of the pooling operation is 2 × 2, the step size is default to 2, and the size of the pooled pixels is 99 × 99 × 64; when reversely propagating, each convolution kernel should have an offset value, namely 64 convolution kernels of the first layer correspond to 64 offset values input by the upper layer; the second layer uses 32 convolution kernels, 3 × 3 × 64, and generates 97 × 97 × 32 pixel layers after convolution operation. The pixel layers are processed by a ReLU unit to generate active pixel layers, the active pixel layers are processed by maximum pool operation, the size of the image after the pool operation is 2 x 2, the size of the image after the pool operation is 48 x 32, and then when the parameters are updated by a Dropout layer, input neurons are immediately disconnected with a probability of 10% to update the parameters so as to prevent overfitting; in the backward propagation in this layer, each convolution kernel should have an offset value, i.e. 64 convolution kernels in the first layer correspond to 32 offset values input by the upper layer; similarly, the third layer uses 32 3 × 3 × 32 convolution kernels, and generates 46 × 46 × 32 pixel layers after convolution operation. The pixel layers are processed by a ReLU unit to generate active pixel layers, the active pixel layers are processed by maximum pool operation, the size of the image after the pool operation is 2 multiplied by 2, the size of the image after the pool operation is 23 multiplied by 32, and then the input neurons are immediately disconnected with a probability of 10% to update the parameters when the parameters are updated by a Dropout layer; the multi-dimensional input is subjected to one-dimensional input by using a flattening layer, a group of one-dimensional pixel arrays are output after flattening treatment, the total number of the one-dimensional pixel arrays comprises 16928 data, and then the pixels are used as input and transmitted into a full-connection layer to carry out the next operation.
In order to extract the features of the spectrogram to be sent to a GAN network to generate a new spectrogram, the multidimensional features obtained by the spectrogram need to be subjected to dimensionality reduction, a full connection layer is built, the full connection (Dense) is used for fully connecting 16928 input data to 128 neural units, then 128 data are generated after the processing of a ReLU activation function, and 128 data are output after the processing of Dropout and serve as speech emotion features.
And step 5, the GAN generator network model consists of 1 full-connection layer, 3 transposition convolution layers and 2 batch standardization layers. The first layer input data is 128 data extracted in step 4, is connected with 4608 neurons through a full connection layer, and is converted into a shape of 3 × 3 × 512; the second layer reduces 512 channels to 256 channels using transposed convolution, kernel _ size is 3, step size is 3, and passes through the batch normalization layer; the third layer reduces 256 channels to 128 channels using transposed convolution, kernel _ size is 5, step size is 2, and passes through the batch normalization layer; the fourth layer reduces 128 channels to 3 channels using transposed convolution, with a kernel _ size of 4 and a step size of 3;
the GAN discriminator network model of the invention uses 7 layers of convolutional neural network model and consists of 3 convolutional layers, 2 batch normalization layers and 1 full connection layer. The input data of the first layer is a spectrogram of 64 multiplied by 3, a convolution operation is carried out on the spectrogram by adopting a convolution kernel of 5 multiplied by 3, the convolution kernel moves along the x axis and the y axis of the image, the step length is 1 pixel, 64 convolution kernels are used together to generate data of 60 multiplied by 24 pixel layers, a Leakly-ReLU function is used as an activation function, and the pixel layers are processed by a Leakly-ReLU unit to generate an activation pixel layer; the second layer uses 128 5 × 5 × 128 convolution kernels, and generates 57 × 57 × 128 pixel layers after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer to prevent overfitting; the third layer generates 53 × 53 × 256 pixel layers by convolution operation using 256 5 × 5 × 256 convolution kernels. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer to prevent overfitting; using a flattening layer to carry out one-dimensional input, carrying out flattening treatment, then using the pixels as input to transmit the input into a full-connection layer, wherein the last layer of output layer is provided with 1 node, and outputting a probability value; the standard-compliant 64 × 64 × 3 generated spectrogram size is modified to 200 × 200 × 3 pixel size and introduced into the convolutional network of step 4 for retraining.
And 6, building a full connection layer, combining the 1800 dimensional data extracted in the step 3 and the 16928 dimensional data extracted in the step 4 into 18728 dimensional data, fully connecting the 18728 dimensional data with 256 neural units, processing by a ReLU activation function to generate 256 data, and outputting the 256 data after Dropout processing to serve as the speech emotion characteristics.
Step 7, because the data set contains 292 people, 43,800 sections of voice information are used together through clipping and screening, the tags of 292 people whether suffering from depression are used as a tag dictionary, each tag has a corresponding index number, the index numbers of the tags are set as the class index numbers, 90% of voice signals of the tags are used as a training set, and the rest 10% of voice signals are used as a testing set;
for each tag, the voices that always suffered from depression were taken as the positive sample set, and the voices that did not suffer from depression were taken as the negative sample set. Training a two-classification SVM by using the positive sample set and the negative sample set to obtain a trained two-classification SVM;
and 8, recognizing the test voice through the two-classification SVM trained in the step 7.

Claims (8)

1. A microphone array based depression detection method, comprising the steps of:
step 1, collecting a voice signal of a target patient by using a microphone array and preprocessing the voice signal;
step 2, extracting MFCC characteristics of the audio signal preprocessed by the target patient and the voice data of the existing depression patient in the step 1 to generate an audio frequency spectrogram;
step 3, sending the MFCC features extracted in the step 2 into a 1D convolutional neural network to obtain P-dimensional features of the MFCC;
step 4, sending the audio frequency spectrogram generated in the step 2 into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram;
step 5, inputting the O-dimensional features obtained in the step 4 into a countermeasure generation network to generate a new spectral image, and transmitting the generated new spectral image into the 2D convolution neural network in the step 4 for training;
step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 and the characteristics obtained by training in the step 5, and reducing the dimensions through a full connection layer;
7, training a classifier by the feature obtained in the step 6 after dimensionality reduction;
and 8, identifying the test voice through the classifier trained in the step 7 to obtain an identification result.
2. The microphone array based depression detection method as claimed in claim 1, wherein the step 1 comprises the following steps:
step 1.1, acquiring a target patient voice signal through a quaternary cross microphone array;
step 1.2, performing frame windowing on the collected voice signal of the target patient, converting the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, finishing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting the signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining an energy-entropy ratio to obtain a voice endpoint value;
step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA (direction of arrival) positioning method;
and step 1.4, synthesizing four paths of signals into one path of signal through a super-directivity beam forming algorithm according to the voice signals subjected to endpoint detection and sound source positioning processing, and realizing synthesis, noise reduction and enhancement of the microphone array signals.
3. A microphone array based depression detection method according to claim 2, wherein the step 2 comprises the following steps:
step 2.1, firstly dividing the voice signal into frames through a Hamming window function; generating cepstrum characteristic vectors, calculating discrete Fourier transform for each frame, only reserving the logarithm of an amplitude spectrum, collecting 24 spectrum components of 44100 frequency bands in a Mel frequency range after the frequency spectrum is smoothed, and approximating the spectrum components to discrete cosine transform after the Karhunen-Loeve transform is applied; finally, each frame obtains [ f1,f2,...,fN]A cepstral feature;
step 2.2, according to the set frame number, performing framing and windowing on the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal in the mth frame to obtain a spectrogram; when L filters are selected and L frames having the same size as the filters are selected in the time direction, an L × 3 spectrogram is generated, and the size of the generated color image is adjusted to M × 3.
4. The microphone array-based depression detection method as claimed in claim 3, wherein the 1D convolutional neural network of the step 3 is: only two 1D convolutional layers are built by using an open-source Keras framework based on Tensorflow, wherein each layer adopts a correction linear unit as an activation function; input dimension of Mx 1, through w1Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q1Output isA feature vector of S; at the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.
5. The microphone array-based depression detection method of claim 4, wherein the 2D convolutional neural network of the step 4 is: construction of a Keras frame containing w using an open source Tensorflow-based2N x n sized two-dimensional convolution layers, w1The convolutional neural network comprises maximum pooling layers and 1 full-connection layer with the output dimension of L, wherein correction linear units are adopted in the convolutional layers and the full-connection layers as activation functions; at the stage of training the convolutional neural network, sequentially reading spectrogram features containing texture-like information of each frame of voice signal into a memory by using a traversal method, dividing a training set and a testing set, adding labels to the training set and the testing set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training for B times in total; and training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.
6. The microphone array-based depression detection method of claim 5, wherein the confrontation generating network of step 5 is: based on the network structure of DCGAN, simplify it and carry on the adjustment on the parameter, the network model includes generator and discriminator, the generator network model is made up of 1 full connection layer, 3 transpose convolution layers and 2 pieces of batch standardized layers, output as a color picture of size M x 3, the discriminator part includes 3 convolution layers and a full connection layer with softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value; and setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda.
7. The microphone array-based depression detection method according to claim 6, wherein the step 6 is specifically: and fusing the P dimension characteristic of the MFCC extracted by the 1D convolutional neural network with the O dimension characteristic of the spectrogram to obtain a P + O dimension characteristic, and changing the dimension of the P + O dimension characteristic into 256 dimensions through a full connection layer.
8. The microphone array-based depression detection method according to claim 7, wherein the step 7 is specifically:
step 7.1, taking the voice of the target patient as a test voice, and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X individuals, tags of whether the X individuals suffer from depression are used as a tag dictionary, each tag has a corresponding index number, and the tag index numbers are set as index numbers of the class; after one test, adding a spectrogram generated by a target patient into a training data set;
and 7.2, for each label, using the voices which always suffer from depression as a positive example sample set and the voices which do not suffer from depression as a negative example sample set, and training the two-classification SVM by using the positive example sample set and the negative example sample set to obtain the trained two-classification SVM.
CN202011248610.5A 2020-11-10 2020-11-10 Depression detection method based on microphone array Active CN112349297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011248610.5A CN112349297B (en) 2020-11-10 2020-11-10 Depression detection method based on microphone array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011248610.5A CN112349297B (en) 2020-11-10 2020-11-10 Depression detection method based on microphone array

Publications (2)

Publication Number Publication Date
CN112349297A true CN112349297A (en) 2021-02-09
CN112349297B CN112349297B (en) 2023-07-04

Family

ID=74362344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011248610.5A Active CN112349297B (en) 2020-11-10 2020-11-10 Depression detection method based on microphone array

Country Status (1)

Country Link
CN (1) CN112349297B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687390A (en) * 2021-03-12 2021-04-20 中国科学院自动化研究所 Depression state detection method and device based on hybrid network and lp norm pooling
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113012720A (en) * 2021-02-10 2021-06-22 杭州医典智能科技有限公司 Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction
CN113205803A (en) * 2021-04-22 2021-08-03 上海顺久电子科技有限公司 Voice recognition method and device with adaptive noise reduction capability
CN113223507A (en) * 2021-04-14 2021-08-06 重庆交通大学 Abnormal speech recognition method based on double-input mutual interference convolutional neural network
CN113476058A (en) * 2021-07-22 2021-10-08 北京脑陆科技有限公司 Intervention treatment method, device, terminal and medium for depression patients
CN113679413A (en) * 2021-09-15 2021-11-23 北方民族大学 VMD-CNN-based lung sound feature identification and classification method and system
CN113820693A (en) * 2021-09-20 2021-12-21 西北工业大学 Uniform linear array element failure calibration method based on generation of countermeasure network
CN114219005A (en) * 2021-11-17 2022-03-22 太原理工大学 Depression classification method based on high-order spectral voice features
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
CN109599129A (en) * 2018-11-13 2019-04-09 杭州电子科技大学 Voice depression recognition methods based on attention mechanism and convolutional neural networks
CN110047506A (en) * 2019-04-19 2019-07-23 杭州电子科技大学 A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
CN109599129A (en) * 2018-11-13 2019-04-09 杭州电子科技大学 Voice depression recognition methods based on attention mechanism and convolutional neural networks
CN110047506A (en) * 2019-04-19 2019-07-23 杭州电子科技大学 A kind of crucial audio-frequency detection based on convolutional neural networks and Multiple Kernel Learning SVM

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LE YANG等: "Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals", IEEE ACCESS *
ZHIYONG WANG等: "Recognition of Audio Depression Based on Convolutional Neural Network and Generative Antagonism Network Model", IEEE ACCESS *
李金鸣等: "基于深度学习的音频抑郁症识别", 计算机应用与软件 *
钟昕孜 等: "基于自编码器的语音情感识别方法研究", 电子设计工程, no. 06 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113012720A (en) * 2021-02-10 2021-06-22 杭州医典智能科技有限公司 Depression detection method by multi-voice characteristic fusion under spectral subtraction noise reduction
CN113012720B (en) * 2021-02-10 2023-06-16 杭州医典智能科技有限公司 Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN112687390B (en) * 2021-03-12 2021-06-18 中国科学院自动化研究所 Depression state detection method and device based on hybrid network and lp norm pooling
CN112687390A (en) * 2021-03-12 2021-04-20 中国科学院自动化研究所 Depression state detection method and device based on hybrid network and lp norm pooling
CN113223507B (en) * 2021-04-14 2022-06-24 重庆交通大学 Abnormal speech recognition method based on double-input mutual interference convolutional neural network
CN113223507A (en) * 2021-04-14 2021-08-06 重庆交通大学 Abnormal speech recognition method based on double-input mutual interference convolutional neural network
CN113205803A (en) * 2021-04-22 2021-08-03 上海顺久电子科技有限公司 Voice recognition method and device with adaptive noise reduction capability
CN113205803B (en) * 2021-04-22 2024-05-03 上海顺久电子科技有限公司 Voice recognition method and device with self-adaptive noise reduction capability
CN113476058B (en) * 2021-07-22 2022-11-29 北京脑陆科技有限公司 Intervention treatment method, device, terminal and medium for depression patients
CN113476058A (en) * 2021-07-22 2021-10-08 北京脑陆科技有限公司 Intervention treatment method, device, terminal and medium for depression patients
CN113679413A (en) * 2021-09-15 2021-11-23 北方民族大学 VMD-CNN-based lung sound feature identification and classification method and system
CN113679413B (en) * 2021-09-15 2023-11-10 北方民族大学 VMD-CNN-based lung sound feature recognition and classification method and system
CN113820693A (en) * 2021-09-20 2021-12-21 西北工业大学 Uniform linear array element failure calibration method based on generation of countermeasure network
CN113820693B (en) * 2021-09-20 2023-06-23 西北工业大学 Uniform linear array element failure calibration method based on generation of countermeasure network
CN114219005A (en) * 2021-11-17 2022-03-22 太原理工大学 Depression classification method based on high-order spectral voice features
CN116978409A (en) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 Depression state evaluation method, device, terminal and medium based on voice signal

Also Published As

Publication number Publication date
CN112349297B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN112349297B (en) Depression detection method based on microphone array
CN111243620B (en) Voice separation model training method and device, storage medium and computer equipment
US10127922B2 (en) Sound source identification apparatus and sound source identification method
CN109841226A (en) A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
JPH02160298A (en) Noise removal system
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
WO2019232867A1 (en) Voice discrimination method and apparatus, and computer device, and storage medium
Venkatesan et al. Binaural classification-based speech segregation and robust speaker recognition system
Salvati et al. A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
CN113314127B (en) Bird song identification method, system, computer equipment and medium based on space orientation
US6567771B2 (en) Weighted pair-wise scatter to improve linear discriminant analysis
AU2362495A (en) Speech-recognition system utilizing neural networks and method of using same
Lin et al. Domestic activities clustering from audio recordings using convolutional capsule autoencoder network
CN109741733B (en) Voice phoneme recognition method based on consistency routing network
CN115952840A (en) Beam forming method, arrival direction identification method, device and chip thereof
Raju et al. AUTOMATIC SPEECH RECOGNITION SYSTEM USING MFCC-BASED LPC APPROACH WITH BACK PROPAGATED ARTIFICIAL NEURAL NETWORKS.
Kothapally et al. Speech Detection and Enhancement Using Single Microphone for Distant Speech Applications in Reverberant Environments.
Sailor et al. Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection.
Venkatesan et al. Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN113903344A (en) Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction
CN112259107A (en) Voiceprint recognition method under meeting scene small sample condition
Jannu et al. An Overview of Speech Enhancement Based on Deep Learning Techniques
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant