CN111326179B - Deep learning method for detecting crying of baby - Google Patents
Deep learning method for detecting crying of baby Download PDFInfo
- Publication number
- CN111326179B CN111326179B CN202010125193.9A CN202010125193A CN111326179B CN 111326179 B CN111326179 B CN 111326179B CN 202010125193 A CN202010125193 A CN 202010125193A CN 111326179 B CN111326179 B CN 111326179B
- Authority
- CN
- China
- Prior art keywords
- layer
- deep learning
- convolution
- voice
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 206010011469 Crying Diseases 0.000 title claims abstract description 36
- 238000013135 deep learning Methods 0.000 title claims abstract description 35
- 238000013528 artificial neural network Methods 0.000 claims abstract description 33
- 230000015654 memory Effects 0.000 claims abstract description 25
- 230000000306 recurrent effect Effects 0.000 claims abstract description 21
- 238000001514 detection method Methods 0.000 claims abstract description 18
- 238000009432 framing Methods 0.000 claims abstract description 12
- 210000003477 cochlea Anatomy 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 19
- 230000004044 response Effects 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 15
- 238000013178 mathematical model Methods 0.000 claims description 12
- 238000010586 diagram Methods 0.000 claims description 11
- 238000013139 quantization Methods 0.000 claims description 9
- 210000004027 cell Anatomy 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 230000014759 maintenance of location Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 210000000959 ear middle Anatomy 0.000 claims description 2
- 230000007787 long-term memory Effects 0.000 claims description 2
- 230000008447 perception Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 abstract description 2
- 238000013136 deep learning model Methods 0.000 abstract 1
- 238000000605 extraction Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a deep learning method for detecting crying of infants, and relates to the technical field of voice signal processing. The invention comprises the following steps: a. collecting voice signals; b. framing the voice signal segments and extracting cochlear voice characteristics of each frame; c. and inputting the adjacent N frames of voice characteristics into a pre-trained infant crying detection deep learning model to perform reasoning and judging whether crying exists or not. d. And voting the N frames of classification results by using a majority priority voting principle, and judging whether the baby crys exist in the N frames. The cochlea voice characteristics adopted by the invention are voice characteristic parameters which are more in line with auditory perception analysis of people, and the convolutional network and the long-short-term memory recurrent neural network are adopted as acoustic reasoning models for infant crying detection, so that the invention can adapt to a voice environment with low signal-to-noise ratio, and has higher accuracy rate compared with the traditional method.
Description
Technical Field
The invention belongs to the technical field of voice signal processing, and particularly relates to a deep learning method for detecting crying of infants.
Background
Parents or ancestors in modern society are easy to care for new-born infants due to busy work or housework, and infants can express emotion and demand only by crying, so that home care based on crying detection of infants has great market demands.
The current voice signal recognition system is usually composed of three parts, namely voice signal preprocessing, feature extraction and classification, wherein the feature extraction is the most important part, the quality of the feature extraction directly influences the recognition result, the voice gender features proposed by the prior researchers are mostly based on the rhythm features and the voice quality features of voice, and are all manually designed features, so that the robustness of the system is low and is easily influenced by the environment.
The deep learning method for detecting the crying of the baby is provided, and the problems are solved.
Disclosure of Invention
The invention aims to provide a deep learning method for detecting crying of a baby, which realizes the detection of crying of a baby with a voice signal.
In order to solve the technical problems, the invention is realized by the following technical scheme: the invention relates to a deep learning method for detecting crying of infants, which comprises the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice characteristics from each frame;
s003: and (3) performing infant crying detection to output a detection result, wherein the infant crying detection comprises the following steps of:
s0031: establishing a deep learning classifier based on a convolutional network and a long-short-term memory recurrent neural network;
s0032: and inputting the extracted adjacent N frames of cochlea speech features into a deep learning classifier to obtain N frames of classification results, voting the N frames of classification results by applying a majority priority voting principle, and obtaining a final infant crying detection result.
Further, the manner of voice signal acquisition in the step S001 is as follows:
s0011: inputting a voice signal with a microphone device;
s0012: the corresponding speech signal is obtained by sample quantization.
Further, the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.
Further, in the step S002, the frame length of the frame for framing the voice signal segment ranges from 20ms to 30ms, and the frame length ranges from 10ms to 15ms.
Further, the method for extracting the middle ear speech features in the step S002 comprises the following steps:
s0021: constructing a Gammatone filter bank based on a human ear cochlear auditory model, wherein the time domain expression form is as follows:
g(f,t)=kt a-1 e -2Πbt cos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
s0022: filtering the voice signal by utilizing an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter, t is the length of the output response signal, and t is a natural number, and the length input signals of the output response signal are kept equal;
s0023: the output response signal R (n, t) is subjected to framing to obtain response energy in the frame so as to obtain a cochlea-like diagram, wherein the formula is as follows: gm (i) =log ([ |r| (i, m))] 1/2 );
Wherein i represents an i-th gammatine filter, i=0, 1, 2, …, N-1, n=64 is the number of filter banks; m represents an mth frame, m=0, 1, 2, …, M-1, M being the number of frames after framing; each frame of the cochlear-like map is called a gammatine feature coefficient GF, and one GF feature vector is composed of 8 frequency components.
Further, the Gamma filter group adopts a 4-order Gamma filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz.
Further, the adjacent N frames of cochlear speech features in step S0032 refers to splicing the one-dimensional cochlear speech features of the previous and subsequent frames into an mxn two-dimensional feature matrix, which is recorded as:
Further, the deep learning classifier in S0031 includes two convolution layers and a long-short-term memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolution network and long-short-term memory recurrent neural networkn is the number of training samples in the training set; i represents a training sample sequence number; x is X i ∈R M×N A 64x9 two-dimensional cochlear feature; x is X i The corresponding training label is L i ,L i ∈Z={0,1} * 0 represents X i Is an environment background sound sample without baby crying, 1 represents X i Is a sample of infant crying;
s00312: establishing a deep learning classifier model of a convolutional network and a long-short-term memory recurrent neural network, wherein the classifier model sequentially comprises an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlea feature map with a width of N and a height of M;
s00314: establishing two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3x3, the number of convolution kernels is 8, and after the first layer of convolution operation is performed on the input layer, a three-dimensional convolution characteristic diagram with 8 output channels, M height and N width is used as the input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layers is 3x3, the number of convolution kernels is 16, and after the second layer of convolution operation, the three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolutional operation formula is as follows:
W ij is the weight parameter of the classifier model, X i For neuron input, Z j Is an intermediate result, y j Is the neural network activation output and is used as the input of the next layer;
s00315: before the LSTM layer is established, a reshape operation is needed to be carried out on the second layer convolution layer before the second layer convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is 16xMxN two-dimensional feature map, and for LSTM:16×m represents the length of the input feature, N represents the time sequence length;
s00316: LSTM is long-term memory network, is a nerve cell mathematical model with memory, and consists of three gate structures of forgetting gate, input gate and output gate;
the forget gate can read the h transmitted by the last state t-1 And current input x t Outputting an array of between 0 and 1 to each cell state C t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f );
wherein W is f Is the neural network weight parameter of the forgetting gate, b f Is a bias term, σ represents a sigmoid function;
the mathematical model formula of the input gate is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i );
wherein W is i 、W c Is the neural network weight parameter of the input gate, b i 、b c Is a bias term, σ represents a sigmoid function;
the output gate mathematical model formula is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
wherein W is o Is the neural network weight parameter of the output gate, b o Is a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, and adopting a two-class softmax classifier, wherein the 16-X MxN classifier is adoptedAfter the two-dimensional feature map passes through the LSTM layer, N h are output according to the time sequence t The feature layer is connected with the two kinds of softmax, outputs N loss, and the specific formula is as follows:
wherein x is t =W s ·h t +b s Is h t The feature layer is output after passing through the softmax classifier.
Further, the method for voting on the N-frame classification result by using the majority priority voting principle in S0032 includes: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C i I=1, 2, …, N, and averagingIf p is more than or equal to 0.5, the infant cry is indicated, otherwise, the infant cry is indicated.
The invention has the following beneficial effects:
aiming at the condition that the traditional voice recognition is easily affected by environment transformation, the cochlear voice feature adopted by the invention is voice feature parameters which are more in line with the auditory perception analysis of people, and the convolution network and the long-short-term memory recurrent neural network classifier are adopted as the acoustic reasoning model for detecting the baby crying, so that the voice recognition method can adapt to the voice environment with low signal to noise ratio, and the long-short-term memory recurrent neural network can fully utilize the contextual voice mailbox, so that the dimension is more abundant, and the voice recognition method has higher recognition rate compared with the traditional method.
Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a system of the present invention;
FIG. 2 is a schematic view of cochlear speech feature extraction of the present invention;
fig. 3 is a schematic diagram of a convolutional network and long and short memory recurrent neural network classifier of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-3, the invention discloses a deep learning method for detecting crying of infants, which comprises the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice characteristics from each frame;
s003: and (3) performing infant crying detection to output a detection result, wherein the infant crying detection comprises the following steps of:
s0031: establishing a deep learning classifier based on a convolutional network and a long-short-term memory recurrent neural network;
s0032: and inputting the extracted adjacent N frames of cochlea speech features into a deep learning classifier to obtain N frames of classification results, voting the N frames of classification results by applying a majority priority voting principle, and obtaining a final infant crying detection result.
The manner of voice signal acquisition in step S001 is as follows:
s0011: inputting a voice signal with a microphone device;
s0012: the corresponding voice signal is obtained through sampling and quantization, the sampling frequency of the sampling and quantization is 16KHz, and the quantization precision is 16 bits.
Wherein the method comprises the steps ofIn step S002, the voice signal segment is framed with a window of 20ms, namely 320 points in frame length, and a window sliding step size stride of 10ms, namely 160 points, is performed; the voice Length is Length, and the number of voice segment frames is N, as shown in the following formula:
the method for extracting the cochlear speech features in the step S002 comprises the following steps:
s0021: constructing a Gammatone filter bank based on a human ear cochlear auditory model, wherein the time domain expression form is as follows:
g(f,t)=kt a-1 e -2Πbt cos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, phi is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
the Gamma filter group adopts a 4-order Gamma filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz;
s0022: filtering the voice signal by utilizing an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter, t is the length of the output response signal, and t is a natural number, and the length input signals of the output response signal are kept equal;
s0023: the output response signal R (n, t) is subjected to framing to obtain response energy in the frame so as to obtain a cochlea-like diagram, wherein the formula is as follows: gm (i) =log ([ |r| (i, m))] 1/2 );
Wherein i represents an i-th gammatine filter, i=0, 1, 2, …, N-1, n=64 is the number of filter banks; m represents an mth frame, m=0, 1, 2, …, M-1, M being the number of frames after framing; each frame of the cochlear-like map is called a gammatine feature coefficient GF, and one GF feature vector is composed of 8 frequency components.
The adjacent N frames of cochlear speech features in step S0032 are that one-dimensional cochlear speech features of the front and back frames are spliced into an mxn two-dimensional feature matrix, which is recorded as:
Wherein, the deep learning classifier in S0031 comprises two convolution layers and a long and short memory recurrent neural network Layer (LSTM); the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolution network and long-short-term memory recurrent neural networkn is the number of training samples in the training set; i represents a training sample sequence number; x is X i ∈R M×N Is a 64X9 two-dimensional speech cochlea feature, X i The corresponding training label is L i ,L i ∈Z={0,1} * 0 represents X i Is an environment background sound sample without baby crying, 1 represents X i Is a sample of infant crying;
s00312: establishing a deep learning classifier model of a convolutional network and a long-short-term memory recurrent neural network, wherein the classifier model sequentially comprises an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlea feature map with a width of N and a height of M;
s00314: establishing two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3x3, the number of convolution kernels is 8, and after the first layer of convolution operation is performed on the input layer, a three-dimensional convolution characteristic diagram with 8 output channels, M height and N width is used as the input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layers is 3x3, the number of convolution kernels is 16, and after the second layer of convolution operation, the three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolutional operation formula is as follows:
W ij is the weight parameter of the classifier model, X i For neuron input, Z j Is an intermediate result, y j Is the neural network activation output and is used as the input of the next layer;
s00315: before the LSTM layer is established, a reshape operation is needed to be carried out on the second layer convolution layer before the second layer convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is 16xMxN two-dimensional feature map, and for LSTM:16×m represents the length of the input feature, N represents the time sequence length;
s00316: LSTM is long-short-term memory network, is a nerve cell mathematical model with memory, and mainly comprises three gate structures of forgetting gate, input gate and output gate;
the first step is to determine what information will be forgotten from the cell state, which is done by a forget gate which will read h from the last state transfer t-1 And current input x t Outputting an array of between 0 and 1 to each cell state C t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f );
wherein W is f Is the neural network weight parameter of the forgetting gate, b f Is a bias term, σ represents a sigmoid function;
the second step is to determine how much new information to be added to the cell state, which is accomplished by the input gate, by first generating an input selection layer to determine which information needs to be updated i t An input alternative layer for updating contentsThen we need to update the old cell state, C t-1 Updated to C t The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the following steps of setting the old state C t-1 And f t Multiply and add +.>The mathematical model formula of the input gate is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i );
wherein W is i 、W c Is the neural network weight parameter of the input gate, b i 、b c Is a bias term, σ represents a sigmoid function;
the third step, deciding what value to output, done by the output gate, first we run a sigmoid layer to determine which part of the cell state will be output; next we process the cell state through tanh (yielding a value between-1 and 1) and multiply it with the output of the sigmoid gate, which ultimately we will output only that part of our determined output, the mathematical model formula of the output gate is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
wherein W is o Is the neural network weight parameter of the output gate, b o Is a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16-X MxN two-dimensional feature map passes through an LSTM layer t The feature layer is connected with the two kinds of softmax, outputs N loss, and the specific formula is as follows:
wherein x is t =W s ·h t +b s Is h t The feature layer updates the network bias term and weight through the output after the softmax classifier by a reverse transfer algorithm during training.
The method for voting on the N frame classification results by using the majority priority voting principle in S0032 comprises the following steps: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C i I=1, 2, …, N, and averagingIf p is more than or equal to 0.5, the infant cry is indicated, otherwise, the infant cry is indicated.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in: aiming at the condition that the traditional voice gender recognition is easily influenced by environmental transformation, the cochlea voice characteristics adopted by the invention are voice characteristic parameters which are more in line with auditory perception analysis of people, and the convolutional network and the long-short-term memory recurrent neural network classifier are adopted as an acoustic reasoning model for detecting the baby cry, so that the method can adapt to the voice environment with low signal to noise ratio, and the long-short-term memory recurrent neural network can fully utilize the contextual voice mailbox, so that the dimension is more abundant, and the method has higher recognition rate compared with the traditional method.
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.
Claims (6)
1. A method for deep learning of infant crying detection, which is characterized by comprising the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice characteristics from each frame;
s003: and (3) performing infant crying detection to output a detection result, wherein the infant crying detection comprises the following steps of:
s0031: establishing a deep learning classifier based on a convolutional network and a long-short-term memory recurrent neural network;
s0032: inputting the extracted adjacent N frames of cochlea speech features into a deep learning classifier to obtain N frames of classification results, voting the N frames of classification results by applying a majority priority voting principle, and obtaining a final infant crying detection result;
the deep learning classifier in the step S0031 comprises two convolution layers and a long-short-time memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolution network and long-short-term memory recurrent neural networkn is the number of training samples in the training set; i represents a training sample sequence number; x is X i ∈R M×N A 64x9 two-dimensional cochlear feature; x is X i The corresponding training label is L i ,L i ∈Z={0,1} * 0 represents X i Is an environment background sound sample without baby crying, 1 represents X i Is a sample of infant crying;
s00312: establishing a deep learning classifier model of a convolutional network and a long-short-term memory recurrent neural network, wherein the classifier model sequentially comprises an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlea feature map with a width of N and a height of M;
s00314: establishing two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3x3, the number of convolution kernels is 8, and after the first layer of convolution operation is performed on the input layer, a three-dimensional convolution characteristic diagram with 8 output channels, M height and N width is used as the input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layers is 3x3, the number of convolution kernels is 16, and after the second layer of convolution operation, the three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolutional operation formula is as follows:
W ij is the weight parameter of the classifier model, X i For neuron input, Z j Is an intermediate result, y j Is the neural network activation output and is used as the input of the next layer;
s00315: before the LSTM layer is established, a reshape operation is needed to be carried out on the second layer convolution layer before the second layer convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is 16xMxN two-dimensional feature map, and for LSTM:16×m represents the length of the input feature, N represents the time sequence length;
s00316: LSTM is long-term memory network, is a nerve cell mathematical model with memory, and consists of three gate structures of forgetting gate, input gate and output gate;
the forget gate can read the h transmitted by the last state t-1 And current input x t Outputs a 0 to 1Array of space to each cell state C t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f );
wherein W is f Is the neural network weight parameter of the forgetting gate, b f Is a bias term, σ represents a sigmoid function;
the mathematical model formula of the input gate is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i );
wherein W is i 、W c Is the neural network weight parameter of the input gate, b i 、b c Is a bias term, σ represents a sigmoid function;
the output gate mathematical model formula is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
wherein W is o Is the neural network weight parameter of the output gate, b o Is a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16-X MxN two-dimensional feature map passes through an LSTM layer t The feature layer is connected with the two kinds of softmax, outputs N loss, and the specific formula is as follows:
wherein x is t =W s ·h t +b s Is h t The feature layer is output after passing through a softmax classifier;
the method for extracting the voice characteristics of the middle ear scroll in the step S002 comprises the following steps:
s0021: constructing a Gammatone filter bank based on a human ear cochlear auditory model, wherein the time domain expression form is as follows:
g(f,t)=kt a-1 e -2Πbt cos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, t is the length of the output response signal, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
s0022: filtering the voice signal by utilizing an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter, t is the length of the output response signal, and t is a natural number, and the length of the output response signal is equal to that of the input signal;
s0023: the output response signal R (n, t) is subjected to framing to obtain response energy in the frame so as to obtain a cochlea-like diagram, wherein the formula is as follows: gm (i) =log ([ |r| (i, m))] 1/2 );
Wherein i represents an i-th gammatine filter, i=0, 1, 2, …, N-1, n=64 is the number of filter banks; m represents an mth frame, m=0, 1, 2, …, M-1, M being the number of frames after framing; each frame of the cochlea-like map is called a gammatine feature coefficient GF, and one GF feature vector is composed of 8 frequency components;
the adjacent N frames of cochlear speech features in step S0032 are that one-dimensional cochlear speech features of the front and back frames are spliced into an mxn two-dimensional feature matrix, which is recorded as:
2. The deep learning method for infant crying detection according to claim 1, wherein the voice signal collection in step S001 is as follows:
s0011: inputting a voice signal with a microphone device;
s0012: the corresponding speech signal is obtained by sample quantization.
3. The deep learning method for detecting infant crying according to claim 2, wherein the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.
4. The deep learning method for detecting infant crying as claimed in claim 1, wherein the frame length of the step S002 of framing the voice signal segment is 20 ms-30 ms, and the frame step length is 10 ms-15 ms.
5. The method for deep learning for detecting infant crying according to claim 1, wherein the gammatine filter set is a 4-order gammatine filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz.
6. The deep learning method for detecting infant crying as claimed in claim 1, wherein the method for voting on N frames of classification results using majority priority voting in S0032 is as follows: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C i I=1, 2, …, N, and averagingIf p is more than or equal to 0.5, the infant cry is indicated, otherwise, the infant cry is indicated. />
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010125193.9A CN111326179B (en) | 2020-02-27 | 2020-02-27 | Deep learning method for detecting crying of baby |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010125193.9A CN111326179B (en) | 2020-02-27 | 2020-02-27 | Deep learning method for detecting crying of baby |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111326179A CN111326179A (en) | 2020-06-23 |
CN111326179B true CN111326179B (en) | 2023-05-26 |
Family
ID=71172973
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010125193.9A Active CN111326179B (en) | 2020-02-27 | 2020-02-27 | Deep learning method for detecting crying of baby |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111326179B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112382311B (en) * | 2020-11-16 | 2022-08-19 | 谭昊玥 | Infant crying intention identification method and device based on hybrid neural network |
US20240115859A1 (en) * | 2021-02-18 | 2024-04-11 | The Johns Hopkins University | Method and system for processing input signals using machine learning for neural activation |
CN117037849A (en) * | 2021-02-26 | 2023-11-10 | 武汉星巡智能科技有限公司 | Infant crying classification method, device and equipment based on feature extraction and classification |
CN113392736A (en) * | 2021-05-31 | 2021-09-14 | 五八到家有限公司 | Monitoring method, system, equipment and medium for improving safety of home service |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102163427A (en) * | 2010-12-20 | 2011-08-24 | 北京邮电大学 | Method for detecting audio exceptional event based on environmental model |
CN107818779A (en) * | 2017-09-15 | 2018-03-20 | 北京理工大学 | A kind of infant's crying sound detection method, apparatus, equipment and medium |
CN109243493A (en) * | 2018-10-30 | 2019-01-18 | 南京工程学院 | Based on the vagitus emotion identification method for improving long memory network in short-term |
CN109509484A (en) * | 2018-12-25 | 2019-03-22 | 科大讯飞股份有限公司 | A kind of prediction technique and device of baby crying reason |
CN110428843A (en) * | 2019-03-11 | 2019-11-08 | 杭州雄迈信息技术有限公司 | A kind of voice gender identification deep learning method |
-
2020
- 2020-02-27 CN CN202010125193.9A patent/CN111326179B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102163427A (en) * | 2010-12-20 | 2011-08-24 | 北京邮电大学 | Method for detecting audio exceptional event based on environmental model |
CN107818779A (en) * | 2017-09-15 | 2018-03-20 | 北京理工大学 | A kind of infant's crying sound detection method, apparatus, equipment and medium |
CN109243493A (en) * | 2018-10-30 | 2019-01-18 | 南京工程学院 | Based on the vagitus emotion identification method for improving long memory network in short-term |
CN109509484A (en) * | 2018-12-25 | 2019-03-22 | 科大讯飞股份有限公司 | A kind of prediction technique and device of baby crying reason |
CN110428843A (en) * | 2019-03-11 | 2019-11-08 | 杭州雄迈信息技术有限公司 | A kind of voice gender identification deep learning method |
Non-Patent Citations (1)
Title |
---|
Type-2 Fuzzy Sets Applied to Pattern Matching for the Classification of Cries of Infants under Neurological Risk;Karen Santiago-Sánchez 等;《ICIC 2009》;20091231 * |
Also Published As
Publication number | Publication date |
---|---|
CN111326179A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111326179B (en) | Deep learning method for detecting crying of baby | |
CN110428843B (en) | Voice gender recognition deep learning method | |
CN109326302B (en) | Voice enhancement method based on voiceprint comparison and generation of confrontation network | |
US11908455B2 (en) | Speech separation model training method and apparatus, storage medium and computer device | |
WO2021143327A1 (en) | Voice recognition method, device, and computer-readable storage medium | |
CN110245608B (en) | Underwater target identification method based on half tensor product neural network | |
CN108766419B (en) | Abnormal voice distinguishing method based on deep learning | |
CN109841226A (en) | A kind of single channel real-time noise-reducing method based on convolution recurrent neural network | |
CN112581979B (en) | Speech emotion recognition method based on spectrogram | |
CN109427328B (en) | Multichannel voice recognition method based on filter network acoustic model | |
CN110364143A (en) | Voice awakening method, device and its intelligent electronic device | |
CN111461176A (en) | Multi-mode fusion method, device, medium and equipment based on normalized mutual information | |
CN105206270A (en) | Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM) | |
CN111899757A (en) | Single-channel voice separation method and system for target speaker extraction | |
CN113924786B (en) | Neural network model for cochlear mechanics and processing | |
CN113643723A (en) | Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information | |
CN111951824A (en) | Detection method for distinguishing depression based on sound | |
CN112587153A (en) | End-to-end non-contact atrial fibrillation automatic detection system and method based on vPPG signal | |
CN115602152B (en) | Voice enhancement method based on multi-stage attention network | |
CN113763965A (en) | Speaker identification method with multiple attention characteristics fused | |
CN113763966B (en) | End-to-end text irrelevant voiceprint recognition method and system | |
Watrous | Phoneme discrimination using connectionist networks | |
CN113974607A (en) | Sleep snore detecting system based on impulse neural network | |
CN111723717A (en) | Silent voice recognition method and system | |
CN112329819A (en) | Underwater target identification method based on multi-network fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 311422 4th floor, building 9, Yinhu innovation center, 9 Fuxian Road, Yinhu street, Fuyang District, Hangzhou City, Zhejiang Province Patentee after: Zhejiang Xinmai Microelectronics Co.,Ltd. Address before: 311400 4th floor, building 9, Yinhu innovation center, No.9 Fuxian Road, Yinhu street, Fuyang District, Hangzhou City, Zhejiang Province Patentee before: Hangzhou xiongmai integrated circuit technology Co.,Ltd. |