CN111326179B - Deep learning method for detecting crying of baby - Google Patents

Deep learning method for detecting crying of baby Download PDF

Info

Publication number
CN111326179B
CN111326179B CN202010125193.9A CN202010125193A CN111326179B CN 111326179 B CN111326179 B CN 111326179B CN 202010125193 A CN202010125193 A CN 202010125193A CN 111326179 B CN111326179 B CN 111326179B
Authority
CN
China
Prior art keywords
layer
deep learning
convolution
voice
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010125193.9A
Other languages
Chinese (zh)
Other versions
CN111326179A (en
Inventor
罗世操
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Xinmai Microelectronics Co ltd
Original Assignee
Hangzhou Xiongmai Integrated Circuit Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Xiongmai Integrated Circuit Technology Co Ltd filed Critical Hangzhou Xiongmai Integrated Circuit Technology Co Ltd
Priority to CN202010125193.9A priority Critical patent/CN111326179B/en
Publication of CN111326179A publication Critical patent/CN111326179A/en
Application granted granted Critical
Publication of CN111326179B publication Critical patent/CN111326179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a deep learning method for detecting crying of infants, and relates to the technical field of voice signal processing. The invention comprises the following steps: a. collecting voice signals; b. framing the voice signal segments and extracting cochlear voice characteristics of each frame; c. and inputting the adjacent N frames of voice characteristics into a pre-trained infant crying detection deep learning model to perform reasoning and judging whether crying exists or not. d. And voting the N frames of classification results by using a majority priority voting principle, and judging whether the baby crys exist in the N frames. The cochlea voice characteristics adopted by the invention are voice characteristic parameters which are more in line with auditory perception analysis of people, and the convolutional network and the long-short-term memory recurrent neural network are adopted as acoustic reasoning models for infant crying detection, so that the invention can adapt to a voice environment with low signal-to-noise ratio, and has higher accuracy rate compared with the traditional method.

Description

Deep learning method for detecting crying of baby
Technical Field
The invention belongs to the technical field of voice signal processing, and particularly relates to a deep learning method for detecting crying of infants.
Background
Parents or ancestors in modern society are easy to care for new-born infants due to busy work or housework, and infants can express emotion and demand only by crying, so that home care based on crying detection of infants has great market demands.
The current voice signal recognition system is usually composed of three parts, namely voice signal preprocessing, feature extraction and classification, wherein the feature extraction is the most important part, the quality of the feature extraction directly influences the recognition result, the voice gender features proposed by the prior researchers are mostly based on the rhythm features and the voice quality features of voice, and are all manually designed features, so that the robustness of the system is low and is easily influenced by the environment.
The deep learning method for detecting the crying of the baby is provided, and the problems are solved.
Disclosure of Invention
The invention aims to provide a deep learning method for detecting crying of a baby, which realizes the detection of crying of a baby with a voice signal.
In order to solve the technical problems, the invention is realized by the following technical scheme: the invention relates to a deep learning method for detecting crying of infants, which comprises the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice characteristics from each frame;
s003: and (3) performing infant crying detection to output a detection result, wherein the infant crying detection comprises the following steps of:
s0031: establishing a deep learning classifier based on a convolutional network and a long-short-term memory recurrent neural network;
s0032: and inputting the extracted adjacent N frames of cochlea speech features into a deep learning classifier to obtain N frames of classification results, voting the N frames of classification results by applying a majority priority voting principle, and obtaining a final infant crying detection result.
Further, the manner of voice signal acquisition in the step S001 is as follows:
s0011: inputting a voice signal with a microphone device;
s0012: the corresponding speech signal is obtained by sample quantization.
Further, the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.
Further, in the step S002, the frame length of the frame for framing the voice signal segment ranges from 20ms to 30ms, and the frame length ranges from 10ms to 15ms.
Further, the method for extracting the middle ear speech features in the step S002 comprises the following steps:
s0021: constructing a Gammatone filter bank based on a human ear cochlear auditory model, wherein the time domain expression form is as follows:
g(f,t)=kt a-1 e -2Πbt cos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
s0022: filtering the voice signal by utilizing an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter, t is the length of the output response signal, and t is a natural number, and the length input signals of the output response signal are kept equal;
s0023: the output response signal R (n, t) is subjected to framing to obtain response energy in the frame so as to obtain a cochlea-like diagram, wherein the formula is as follows: gm (i) =log ([ |r| (i, m))] 1/2 );
Wherein i represents an i-th gammatine filter, i=0, 1, 2, …, N-1, n=64 is the number of filter banks; m represents an mth frame, m=0, 1, 2, …, M-1, M being the number of frames after framing; each frame of the cochlear-like map is called a gammatine feature coefficient GF, and one GF feature vector is composed of 8 frequency components.
Further, the Gamma filter group adopts a 4-order Gamma filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz.
Further, the adjacent N frames of cochlear speech features in step S0032 refers to splicing the one-dimensional cochlear speech features of the previous and subsequent frames into an mxn two-dimensional feature matrix, which is recorded as:
Figure BDA0002394195400000031
wherein F is E R M×N N is an odd number.
Further, the deep learning classifier in S0031 includes two convolution layers and a long-short-term memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolution network and long-short-term memory recurrent neural network
Figure BDA0002394195400000032
n is the number of training samples in the training set; i represents a training sample sequence number; x is X i ∈R M×N A 64x9 two-dimensional cochlear feature; x is X i The corresponding training label is L i ,L i ∈Z={0,1} * 0 represents X i Is an environment background sound sample without baby crying, 1 represents X i Is a sample of infant crying;
s00312: establishing a deep learning classifier model of a convolutional network and a long-short-term memory recurrent neural network, wherein the classifier model sequentially comprises an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlea feature map with a width of N and a height of M;
s00314: establishing two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3x3, the number of convolution kernels is 8, and after the first layer of convolution operation is performed on the input layer, a three-dimensional convolution characteristic diagram with 8 output channels, M height and N width is used as the input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layers is 3x3, the number of convolution kernels is 16, and after the second layer of convolution operation, the three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolutional operation formula is as follows:
Figure BDA0002394195400000041
y j =RELU(z j );
W ij is the weight parameter of the classifier model, X i For neuron input, Z j Is an intermediate result, y j Is the neural network activation output and is used as the input of the next layer;
s00315: before the LSTM layer is established, a reshape operation is needed to be carried out on the second layer convolution layer before the second layer convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is 16xMxN two-dimensional feature map, and for LSTM:16×m represents the length of the input feature, N represents the time sequence length;
s00316: LSTM is long-term memory network, is a nerve cell mathematical model with memory, and consists of three gate structures of forgetting gate, input gate and output gate;
the forget gate can read the h transmitted by the last state t-1 And current input x t Outputting an array of between 0 and 1 to each cell state C t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f );
wherein W is f Is the neural network weight parameter of the forgetting gate, b f Is a bias term, σ represents a sigmoid function;
the mathematical model formula of the input gate is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i );
Figure BDA0002394195400000051
Figure BDA0002394195400000052
wherein W is i 、W c Is the neural network weight parameter of the input gate, b i 、b c Is a bias term, σ represents a sigmoid function;
the output gate mathematical model formula is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
wherein W is o Is the neural network weight parameter of the output gate, b o Is a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, and adopting a two-class softmax classifier, wherein the 16-X MxN classifier is adoptedAfter the two-dimensional feature map passes through the LSTM layer, N h are output according to the time sequence t The feature layer is connected with the two kinds of softmax, outputs N loss, and the specific formula is as follows:
Figure BDA0002394195400000053
wherein x is t =W s ·h t +b s Is h t The feature layer is output after passing through the softmax classifier.
Further, the method for voting on the N-frame classification result by using the majority priority voting principle in S0032 includes: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C i I=1, 2, …, N, and averaging
Figure BDA0002394195400000054
If p is more than or equal to 0.5, the infant cry is indicated, otherwise, the infant cry is indicated.
The invention has the following beneficial effects:
aiming at the condition that the traditional voice recognition is easily affected by environment transformation, the cochlear voice feature adopted by the invention is voice feature parameters which are more in line with the auditory perception analysis of people, and the convolution network and the long-short-term memory recurrent neural network classifier are adopted as the acoustic reasoning model for detecting the baby crying, so that the voice recognition method can adapt to the voice environment with low signal to noise ratio, and the long-short-term memory recurrent neural network can fully utilize the contextual voice mailbox, so that the dimension is more abundant, and the voice recognition method has higher recognition rate compared with the traditional method.
Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a system of the present invention;
FIG. 2 is a schematic view of cochlear speech feature extraction of the present invention;
fig. 3 is a schematic diagram of a convolutional network and long and short memory recurrent neural network classifier of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-3, the invention discloses a deep learning method for detecting crying of infants, which comprises the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice characteristics from each frame;
s003: and (3) performing infant crying detection to output a detection result, wherein the infant crying detection comprises the following steps of:
s0031: establishing a deep learning classifier based on a convolutional network and a long-short-term memory recurrent neural network;
s0032: and inputting the extracted adjacent N frames of cochlea speech features into a deep learning classifier to obtain N frames of classification results, voting the N frames of classification results by applying a majority priority voting principle, and obtaining a final infant crying detection result.
The manner of voice signal acquisition in step S001 is as follows:
s0011: inputting a voice signal with a microphone device;
s0012: the corresponding voice signal is obtained through sampling and quantization, the sampling frequency of the sampling and quantization is 16KHz, and the quantization precision is 16 bits.
Wherein the method comprises the steps ofIn step S002, the voice signal segment is framed with a window of 20ms, namely 320 points in frame length, and a window sliding step size stride of 10ms, namely 160 points, is performed; the voice Length is Length, and the number of voice segment frames is N, as shown in the following formula:
Figure BDA0002394195400000071
the method for extracting the cochlear speech features in the step S002 comprises the following steps:
s0021: constructing a Gammatone filter bank based on a human ear cochlear auditory model, wherein the time domain expression form is as follows:
g(f,t)=kt a-1 e -2Πbt cos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, phi is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
the Gamma filter group adopts a 4-order Gamma filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz;
s0022: filtering the voice signal by utilizing an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter, t is the length of the output response signal, and t is a natural number, and the length input signals of the output response signal are kept equal;
s0023: the output response signal R (n, t) is subjected to framing to obtain response energy in the frame so as to obtain a cochlea-like diagram, wherein the formula is as follows: gm (i) =log ([ |r| (i, m))] 1/2 );
Wherein i represents an i-th gammatine filter, i=0, 1, 2, …, N-1, n=64 is the number of filter banks; m represents an mth frame, m=0, 1, 2, …, M-1, M being the number of frames after framing; each frame of the cochlear-like map is called a gammatine feature coefficient GF, and one GF feature vector is composed of 8 frequency components.
The adjacent N frames of cochlear speech features in step S0032 are that one-dimensional cochlear speech features of the front and back frames are spliced into an mxn two-dimensional feature matrix, which is recorded as:
Figure BDA0002394195400000081
wherein F is E R M×N N is an odd number.
Wherein, the deep learning classifier in S0031 comprises two convolution layers and a long and short memory recurrent neural network Layer (LSTM); the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolution network and long-short-term memory recurrent neural network
Figure BDA0002394195400000091
n is the number of training samples in the training set; i represents a training sample sequence number; x is X i ∈R M×N Is a 64X9 two-dimensional speech cochlea feature, X i The corresponding training label is L i ,L i ∈Z={0,1} * 0 represents X i Is an environment background sound sample without baby crying, 1 represents X i Is a sample of infant crying;
s00312: establishing a deep learning classifier model of a convolutional network and a long-short-term memory recurrent neural network, wherein the classifier model sequentially comprises an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlea feature map with a width of N and a height of M;
s00314: establishing two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3x3, the number of convolution kernels is 8, and after the first layer of convolution operation is performed on the input layer, a three-dimensional convolution characteristic diagram with 8 output channels, M height and N width is used as the input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layers is 3x3, the number of convolution kernels is 16, and after the second layer of convolution operation, the three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolutional operation formula is as follows:
Figure BDA0002394195400000092
y j =RELU(z j );
W ij is the weight parameter of the classifier model, X i For neuron input, Z j Is an intermediate result, y j Is the neural network activation output and is used as the input of the next layer;
s00315: before the LSTM layer is established, a reshape operation is needed to be carried out on the second layer convolution layer before the second layer convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is 16xMxN two-dimensional feature map, and for LSTM:16×m represents the length of the input feature, N represents the time sequence length;
s00316: LSTM is long-short-term memory network, is a nerve cell mathematical model with memory, and mainly comprises three gate structures of forgetting gate, input gate and output gate;
the first step is to determine what information will be forgotten from the cell state, which is done by a forget gate which will read h from the last state transfer t-1 And current input x t Outputting an array of between 0 and 1 to each cell state C t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f );
wherein W is f Is the neural network weight parameter of the forgetting gate, b f Is a bias term, σ represents a sigmoid function;
the second step is to determine how much new information to be added to the cell state, which is accomplished by the input gate, by first generating an input selection layer to determine which information needs to be updated i t An input alternative layer for updating contents
Figure BDA0002394195400000101
Then we need to update the old cell state, C t-1 Updated to C t The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the following steps of setting the old state C t-1 And f t Multiply and add +.>
Figure BDA0002394195400000102
The mathematical model formula of the input gate is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i );
Figure BDA0002394195400000103
Figure BDA0002394195400000104
wherein W is i 、W c Is the neural network weight parameter of the input gate, b i 、b c Is a bias term, σ represents a sigmoid function;
the third step, deciding what value to output, done by the output gate, first we run a sigmoid layer to determine which part of the cell state will be output; next we process the cell state through tanh (yielding a value between-1 and 1) and multiply it with the output of the sigmoid gate, which ultimately we will output only that part of our determined output, the mathematical model formula of the output gate is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
wherein W is o Is the neural network weight parameter of the output gate, b o Is a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16-X MxN two-dimensional feature map passes through an LSTM layer t The feature layer is connected with the two kinds of softmax, outputs N loss, and the specific formula is as follows:
Figure BDA0002394195400000111
wherein x is t =W s ·h t +b s Is h t The feature layer updates the network bias term and weight through the output after the softmax classifier by a reverse transfer algorithm during training.
The method for voting on the N frame classification results by using the majority priority voting principle in S0032 comprises the following steps: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C i I=1, 2, …, N, and averaging
Figure BDA0002394195400000112
If p is more than or equal to 0.5, the infant cry is indicated, otherwise, the infant cry is indicated.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in: aiming at the condition that the traditional voice gender recognition is easily influenced by environmental transformation, the cochlea voice characteristics adopted by the invention are voice characteristic parameters which are more in line with auditory perception analysis of people, and the convolutional network and the long-short-term memory recurrent neural network classifier are adopted as an acoustic reasoning model for detecting the baby cry, so that the method can adapt to the voice environment with low signal to noise ratio, and the long-short-term memory recurrent neural network can fully utilize the contextual voice mailbox, so that the dimension is more abundant, and the method has higher recognition rate compared with the traditional method.
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims (6)

1. A method for deep learning of infant crying detection, which is characterized by comprising the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice characteristics from each frame;
s003: and (3) performing infant crying detection to output a detection result, wherein the infant crying detection comprises the following steps of:
s0031: establishing a deep learning classifier based on a convolutional network and a long-short-term memory recurrent neural network;
s0032: inputting the extracted adjacent N frames of cochlea speech features into a deep learning classifier to obtain N frames of classification results, voting the N frames of classification results by applying a majority priority voting principle, and obtaining a final infant crying detection result;
the deep learning classifier in the step S0031 comprises two convolution layers and a long-short-time memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolution network and long-short-term memory recurrent neural network
Figure FDA0004086358890000011
n is the number of training samples in the training set; i represents a training sample sequence number; x is X i ∈R M×N A 64x9 two-dimensional cochlear feature; x is X i The corresponding training label is L i ,L i ∈Z={0,1} * 0 represents X i Is an environment background sound sample without baby crying, 1 represents X i Is a sample of infant crying;
s00312: establishing a deep learning classifier model of a convolutional network and a long-short-term memory recurrent neural network, wherein the classifier model sequentially comprises an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlea feature map with a width of N and a height of M;
s00314: establishing two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3x3, the number of convolution kernels is 8, and after the first layer of convolution operation is performed on the input layer, a three-dimensional convolution characteristic diagram with 8 output channels, M height and N width is used as the input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layers is 3x3, the number of convolution kernels is 16, and after the second layer of convolution operation, the three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolutional operation formula is as follows:
Figure FDA0004086358890000021
W ij is the weight parameter of the classifier model, X i For neuron input, Z j Is an intermediate result, y j Is the neural network activation output and is used as the input of the next layer;
s00315: before the LSTM layer is established, a reshape operation is needed to be carried out on the second layer convolution layer before the second layer convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is 16xMxN two-dimensional feature map, and for LSTM:16×m represents the length of the input feature, N represents the time sequence length;
s00316: LSTM is long-term memory network, is a nerve cell mathematical model with memory, and consists of three gate structures of forgetting gate, input gate and output gate;
the forget gate can read the h transmitted by the last state t-1 And current input x t Outputs a 0 to 1Array of space to each cell state C t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f );
wherein W is f Is the neural network weight parameter of the forgetting gate, b f Is a bias term, σ represents a sigmoid function;
the mathematical model formula of the input gate is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i );
Figure FDA0004086358890000031
Figure FDA0004086358890000032
wherein W is i 、W c Is the neural network weight parameter of the input gate, b i 、b c Is a bias term, σ represents a sigmoid function;
the output gate mathematical model formula is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
wherein W is o Is the neural network weight parameter of the output gate, b o Is a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16-X MxN two-dimensional feature map passes through an LSTM layer t The feature layer is connected with the two kinds of softmax, outputs N loss, and the specific formula is as follows:
Figure FDA0004086358890000033
n=1、2、…、N;k=1、2;j=1、2;y k ∈{0,1};
wherein x is t =W s ·h t +b s Is h t The feature layer is output after passing through a softmax classifier;
the method for extracting the voice characteristics of the middle ear scroll in the step S002 comprises the following steps:
s0021: constructing a Gammatone filter bank based on a human ear cochlear auditory model, wherein the time domain expression form is as follows:
g(f,t)=kt a-1 e -2Πbt cos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, t is the length of the output response signal, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
s0022: filtering the voice signal by utilizing an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter, t is the length of the output response signal, and t is a natural number, and the length of the output response signal is equal to that of the input signal;
s0023: the output response signal R (n, t) is subjected to framing to obtain response energy in the frame so as to obtain a cochlea-like diagram, wherein the formula is as follows: gm (i) =log ([ |r| (i, m))] 1/2 );
Wherein i represents an i-th gammatine filter, i=0, 1, 2, …, N-1, n=64 is the number of filter banks; m represents an mth frame, m=0, 1, 2, …, M-1, M being the number of frames after framing; each frame of the cochlea-like map is called a gammatine feature coefficient GF, and one GF feature vector is composed of 8 frequency components;
the adjacent N frames of cochlear speech features in step S0032 are that one-dimensional cochlear speech features of the front and back frames are spliced into an mxn two-dimensional feature matrix, which is recorded as:
Figure FDA0004086358890000041
wherein F is E R M×N N is an odd number.
2. The deep learning method for infant crying detection according to claim 1, wherein the voice signal collection in step S001 is as follows:
s0011: inputting a voice signal with a microphone device;
s0012: the corresponding speech signal is obtained by sample quantization.
3. The deep learning method for detecting infant crying according to claim 2, wherein the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.
4. The deep learning method for detecting infant crying as claimed in claim 1, wherein the frame length of the step S002 of framing the voice signal segment is 20 ms-30 ms, and the frame step length is 10 ms-15 ms.
5. The method for deep learning for detecting infant crying according to claim 1, wherein the gammatine filter set is a 4-order gammatine filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz.
6. The deep learning method for detecting infant crying as claimed in claim 1, wherein the method for voting on N frames of classification results using majority priority voting in S0032 is as follows: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C i I=1, 2, …, N, and averaging
Figure FDA0004086358890000051
If p is more than or equal to 0.5, the infant cry is indicated, otherwise, the infant cry is indicated. />
CN202010125193.9A 2020-02-27 2020-02-27 Deep learning method for detecting crying of baby Active CN111326179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010125193.9A CN111326179B (en) 2020-02-27 2020-02-27 Deep learning method for detecting crying of baby

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010125193.9A CN111326179B (en) 2020-02-27 2020-02-27 Deep learning method for detecting crying of baby

Publications (2)

Publication Number Publication Date
CN111326179A CN111326179A (en) 2020-06-23
CN111326179B true CN111326179B (en) 2023-05-26

Family

ID=71172973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010125193.9A Active CN111326179B (en) 2020-02-27 2020-02-27 Deep learning method for detecting crying of baby

Country Status (1)

Country Link
CN (1) CN111326179B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382311B (en) * 2020-11-16 2022-08-19 谭昊玥 Infant crying intention identification method and device based on hybrid neural network
US20240115859A1 (en) * 2021-02-18 2024-04-11 The Johns Hopkins University Method and system for processing input signals using machine learning for neural activation
CN117037849A (en) * 2021-02-26 2023-11-10 武汉星巡智能科技有限公司 Infant crying classification method, device and equipment based on feature extraction and classification
CN113392736A (en) * 2021-05-31 2021-09-14 五八到家有限公司 Monitoring method, system, equipment and medium for improving safety of home service

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163427A (en) * 2010-12-20 2011-08-24 北京邮电大学 Method for detecting audio exceptional event based on environmental model
CN107818779A (en) * 2017-09-15 2018-03-20 北京理工大学 A kind of infant's crying sound detection method, apparatus, equipment and medium
CN109243493A (en) * 2018-10-30 2019-01-18 南京工程学院 Based on the vagitus emotion identification method for improving long memory network in short-term
CN109509484A (en) * 2018-12-25 2019-03-22 科大讯飞股份有限公司 A kind of prediction technique and device of baby crying reason
CN110428843A (en) * 2019-03-11 2019-11-08 杭州雄迈信息技术有限公司 A kind of voice gender identification deep learning method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163427A (en) * 2010-12-20 2011-08-24 北京邮电大学 Method for detecting audio exceptional event based on environmental model
CN107818779A (en) * 2017-09-15 2018-03-20 北京理工大学 A kind of infant's crying sound detection method, apparatus, equipment and medium
CN109243493A (en) * 2018-10-30 2019-01-18 南京工程学院 Based on the vagitus emotion identification method for improving long memory network in short-term
CN109509484A (en) * 2018-12-25 2019-03-22 科大讯飞股份有限公司 A kind of prediction technique and device of baby crying reason
CN110428843A (en) * 2019-03-11 2019-11-08 杭州雄迈信息技术有限公司 A kind of voice gender identification deep learning method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Type-2 Fuzzy Sets Applied to Pattern Matching for the Classification of Cries of Infants under Neurological Risk;Karen Santiago-Sánchez 等;《ICIC 2009》;20091231 *

Also Published As

Publication number Publication date
CN111326179A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN111326179B (en) Deep learning method for detecting crying of baby
CN110428843B (en) Voice gender recognition deep learning method
CN109326302B (en) Voice enhancement method based on voiceprint comparison and generation of confrontation network
US11908455B2 (en) Speech separation model training method and apparatus, storage medium and computer device
WO2021143327A1 (en) Voice recognition method, device, and computer-readable storage medium
CN110245608B (en) Underwater target identification method based on half tensor product neural network
CN108766419B (en) Abnormal voice distinguishing method based on deep learning
CN109841226A (en) A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
CN112581979B (en) Speech emotion recognition method based on spectrogram
CN109427328B (en) Multichannel voice recognition method based on filter network acoustic model
CN110364143A (en) Voice awakening method, device and its intelligent electronic device
CN111461176A (en) Multi-mode fusion method, device, medium and equipment based on normalized mutual information
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN113924786B (en) Neural network model for cochlear mechanics and processing
CN113643723A (en) Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information
CN111951824A (en) Detection method for distinguishing depression based on sound
CN112587153A (en) End-to-end non-contact atrial fibrillation automatic detection system and method based on vPPG signal
CN115602152B (en) Voice enhancement method based on multi-stage attention network
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
CN113763966B (en) End-to-end text irrelevant voiceprint recognition method and system
Watrous Phoneme discrimination using connectionist networks
CN113974607A (en) Sleep snore detecting system based on impulse neural network
CN111723717A (en) Silent voice recognition method and system
CN112329819A (en) Underwater target identification method based on multi-network fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 311422 4th floor, building 9, Yinhu innovation center, 9 Fuxian Road, Yinhu street, Fuyang District, Hangzhou City, Zhejiang Province

Patentee after: Zhejiang Xinmai Microelectronics Co.,Ltd.

Address before: 311400 4th floor, building 9, Yinhu innovation center, No.9 Fuxian Road, Yinhu street, Fuyang District, Hangzhou City, Zhejiang Province

Patentee before: Hangzhou xiongmai integrated circuit technology Co.,Ltd.