CN111326179A - Deep learning method for baby cry detection - Google Patents

Deep learning method for baby cry detection Download PDF

Info

Publication number
CN111326179A
CN111326179A CN202010125193.9A CN202010125193A CN111326179A CN 111326179 A CN111326179 A CN 111326179A CN 202010125193 A CN202010125193 A CN 202010125193A CN 111326179 A CN111326179 A CN 111326179A
Authority
CN
China
Prior art keywords
layer
convolution
voice
cochlear
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010125193.9A
Other languages
Chinese (zh)
Other versions
CN111326179B (en
Inventor
罗世操
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Xinmai Microelectronics Co ltd
Original Assignee
Hangzhou Xiongmai Integrated Circuit Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Xiongmai Integrated Circuit Technology Co Ltd filed Critical Hangzhou Xiongmai Integrated Circuit Technology Co Ltd
Priority to CN202010125193.9A priority Critical patent/CN111326179B/en
Publication of CN111326179A publication Critical patent/CN111326179A/en
Application granted granted Critical
Publication of CN111326179B publication Critical patent/CN111326179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a deep learning method for baby cry detection, and relates to the technical field of voice signal processing. The invention comprises the following steps: a. collecting voice signals; b. framing the voice signal segments and extracting cochlear voice features of each frame; c. and inputting the voice characteristics of the adjacent N frames into a pre-trained baby cry detection deep learning model to carry out reasoning and judge whether crying exists or not. d. And voting the classification results of the N frames by using a majority priority voting principle, and judging whether the baby cries in the N frames. The cochlear voice characteristics adopted by the invention are voice characteristic parameters which are more in line with auditory perception analysis of a human, and the convolutional network and the long and short term memory recurrent neural network are adopted as an acoustic reasoning model for detecting the crying of the infant, so that the cochlear voice detection method can adapt to the voice environment with low signal-to-noise ratio, and has higher accuracy compared with the traditional method.

Description

Deep learning method for baby cry detection
Technical Field
The invention belongs to the technical field of voice signal processing, and particularly relates to a deep learning method for baby cry detection.
Background
Parents or elders of modern society are easy to cared for new babies due to busy work or housework, and babies can express emotions and demands only by crying, so that the family nursing based on baby crying detection has a great market demand.
The current speech signal recognition system is generally composed of three parts, namely speech signal preprocessing, feature extraction and classification, wherein the feature extraction is the most important part, the quality of the feature extraction directly influences the recognition result, most of the speech gender features proposed by researchers before are based on the prosody feature and the voice gender feature of speech, and are both artificially designed features, the robustness of the system is low, and the system is easily influenced by the environment.
The method for deeply learning the crying detection of the baby is provided, and the problems are solved.
Disclosure of Invention
The invention aims to provide a deep learning method for baby cry detection, which realizes baby cry detection on voice signals.
In order to solve the technical problems, the invention is realized by the following technical scheme: the invention relates to a deep learning method for detecting baby cry, which comprises the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice features of each frame;
s003: carrying out baby cry detection to output a detection result, wherein the baby cry detection comprises the following steps:
s0031: establishing a deep learning classifier based on a convolutional network and a long-time and short-time memory recurrent neural network;
s0032: inputting the extracted adjacent N frames of cochlear voice features into a deep learning classifier to obtain N frames of classification results, and voting the N frames of classification results by using a majority-first voting principle to obtain a final baby crying detection result.
Further, the voice signal acquisition mode in step S001 is as follows:
s0011: inputting a voice signal by using a microphone device;
s0012: the corresponding speech signal is obtained by sample quantization.
Further, the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.
Further, in step S002, the frame length range for framing the speech signal segment is 20ms to 30ms, and the frame step range is 10ms to 15 ms.
Further, the method for extracting cochlear speech features in step S002 includes the steps of:
s0021: constructing a Gamma atom filter bank based on a human ear cochlea auditory model, wherein the time domain expression form is as follows:
g(f,t)=kta-1e-2Πbtcos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
s0022: performing Gamma-tone filter filtering on the voice signal by using an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter and takes the value of 64, t is the length of the output response signal and takes the value of a natural number, and the length of the output response signal and the input signal are kept equal;
s0023: the response energy in the frame is obtained by framing the output response signal R (n, t) to obtain a cochlea-like map, and the formula is as follows: gm (i) ═ log ([ | R | (i, m)]1/2);
Wherein, i represents the ith Gamma filter, i is 0, 1, 2, …, N-1, N is 64, the number of the filter group; m represents the mth frame, M is 0, 1, 2, … and M-1, and M is the number of frames after framing; each frame of the cochlear image-like map is called a Gammatone feature coefficient GF, and one GF feature vector is composed of 8 frequency components.
Furthermore, the Gamma filter group adopts 4-order Gamma filters with 64 channels, and the center frequency is between 50Hz and 8000 Hz.
Further, the adjacent N frames of cochlear speech features in step S0032 refer to stitching one-dimensional cochlear speech features of preceding and following frames into a M × N two-dimensional feature matrix, which is recorded as:
Figure BDA0002394195400000031
wherein, F ∈ RM×NAnd N is an odd number.
Further, the deep learning classifier in S0031 includes two convolutional layers and a long-term memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-time and short-time memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolutional network and long-time and short-time memory recurrent neural network
Figure BDA0002394195400000032
n is the number of training samples in the training set; i represents a training sample sequence number; xi∈RM×NA cochlear feature of speech in 64x9 two dimensions; xiThe corresponding training label is Li,Li∈Z={0,1}*0 represents XiIs a sample of ambient background sounds without baby crying, 1 denotes XiIs a sample of the baby's crying;
s00312: establishing a convolutional network and a deep learning classifier model for a long-time memory recurrent neural network, wherein the classifier model sequentially consists of an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlear feature map with the width of N and the height of M;
s00314: establishing two layers of convolution layers, wherein the size of convolution kernels of the first layer of convolution layers is 3x3, the number of convolution kernels is 8, and after the input layer is subjected to the first layer of convolution operation, a three-dimensional convolution characteristic graph with the output channel number of 8, the height of M and the width of N is used as an input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layer is 3x3, the number of the convolution kernels is 16, and after the second layer of convolution operation, a three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolution operation formula is as follows:
Figure BDA0002394195400000041
yj=RELU(zj);
Wijis a classifier model weight parameter, XiFor neuronal input, ZjIs an intermediate result, yjThe neural network is used for activating output and simultaneously used as input of the next layer;
s00315: establishing an LSTM layer, wherein a reshape operation needs to be performed on a second convolution layer before the second convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is set as a 16xMxN two-dimensional feature map, and for the LSTM: 16M denotes the length of the input features, N denotes the time sequence length;
s00316: the LSTM is a long-short term memory network, is a neural cell mathematical model with memory and consists of three gate structures, namely a forgetting gate, an input gate and an output gate;
the forgetting gate can read the h transmitted from the previous statet-1And the current input xtOutputting an array of values between 0 and 1 to each of the cells in the cell state Ct-1The number in (1) represents complete retention, and 0 represents complete rejection; the forgetting gate mathematical model formula is as follows:
ft=σ(Wf·[ht-1,xt]+bf);
wherein, WfIs the neural network weight parameter of the forgetting gate, bfIs a bias term, σ represents a sigmoid function;
the input gate mathematical model formula is as follows:
it=σ(Wi·[ht-1,xt]+bi);
Figure BDA0002394195400000051
Figure BDA0002394195400000052
wherein, Wi、WcIs the neural network weight parameter of the input gate, bi、bcIs a bias term, σ represents a sigmoid function;
the mathematical model formula of the output gate is as follows:
ot=σ(Wo·[ht-1,xt]+bo);
ht=ot*tanh(Ct);
wherein, WoIs the neural network weight parameter of the output gate, boIs a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16x MxN two-dimensional feature map passes through an LSTM layertThe characteristic layer is connected with the binary softmax and outputs N los, and the specific formula is as follows:
Figure BDA0002394195400000053
wherein x ist=Ws·ht+bsIs htAnd (5) outputting the feature layer after passing through the softmax classifier.
Further, the method for voting the classification result of the N frames by using the majority-first voting principle in S0032 includes: inputting the cochlear speech features extracted by each frame into a deep learning classifier to obtain N frame classification results CiI is 1, 2, …, N, and average is obtained
Figure BDA0002394195400000054
If p is more than or equal to 0.5, the baby cries, otherwise, the baby does not cry.
The invention has the following beneficial effects:
aiming at the problem that the traditional voice recognition is easily influenced by environment transformation, the cochlear voice features adopted by the invention are voice feature parameters which are more in line with human auditory perception analysis, and a convolutional network and a long and short memory recurrent neural network classifier are adopted as an acoustic inference model for detecting the crying of the baby, so that the cochlear voice recognition method can adapt to the voice environment with low signal to noise ratio, and the long and short memory recurrent neural network can fully utilize a context voice mailbox, so that the cochlear voice recognition method has richer dimensions and higher recognition rate compared with the traditional method.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a block diagram of the system of the present invention;
FIG. 2 is a schematic diagram of cochlear speech feature extraction of the present invention;
FIG. 3 is a schematic diagram of a convolutional network and a long-and-short memory recurrent neural network classifier according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-3, the present invention provides a deep learning method for detecting baby cry, comprising the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice features of each frame;
s003: carrying out baby cry detection to output a detection result, wherein the baby cry detection comprises the following steps:
s0031: establishing a deep learning classifier based on a convolutional network and a long-time and short-time memory recurrent neural network;
s0032: inputting the extracted adjacent N frames of cochlear voice features into a deep learning classifier to obtain N frames of classification results, and voting the N frames of classification results by using a majority-first voting principle to obtain a final baby crying detection result.
The voice signal acquisition mode in step S001 is as follows:
s0011: inputting a voice signal by using a microphone device;
s0012: the corresponding voice signal is obtained through sampling and quantization, the sampling frequency of the sampling and quantization is 16KHz, and the quantization precision is 16 bits.
In step S002, framing the speech signal segment is performed according to the standard of 20ms window, i.e., frame length 320 points, and 10ms window sliding step size stride, i.e., 160 points; if the voice Length is Length, the number of the voice section frames N is shown as the following formula:
Figure BDA0002394195400000071
the method for extracting the cochlear speech features in the step S002 comprises the following steps:
s0021: constructing a Gamma atom filter bank based on a human ear cochlea auditory model, wherein the time domain expression form is as follows:
g(f,t)=kta-1e-2Πbtcos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
wherein, the Gamma atom filter group adopts 4-order Gamma atom filters of 64 channels, and the central frequency is between 50Hz and 8000 Hz;
s0022: performing Gamma-tone filter filtering on the voice signal by using an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter and takes the value of 64, t is the length of the output response signal and takes the value of a natural number, and the length of the output response signal and the input signal are kept equal;
s0023: framing of the output response signal R (n, t) to determine the response energy in the frameQuantity to obtain a cochlea map, the formula is as follows: gm (i) ═ log ([ | R | (i, m)]1/2);
Wherein, i represents the ith Gamma filter, i is 0, 1, 2, …, N-1, N is 64, the number of the filter group; m represents the mth frame, M is 0, 1, 2, … and M-1, and M is the number of frames after framing; each frame of the cochlear image-like map is called a Gammatone feature coefficient GF, and one GF feature vector is composed of 8 frequency components.
The adjacent N frames of cochlear speech features in step S0032 are obtained by concatenating one-dimensional cochlear speech features of preceding and following frames into a M × N two-dimensional feature matrix, which is recorded as:
Figure BDA0002394195400000081
wherein, F ∈ RM×NAnd N is an odd number.
Wherein, the deep learning classifier in S0031 comprises two convolution layers and a long-time memory recurrent neural network Layer (LSTM); the specific method for training the deep learning classifier based on the convolutional network and the long-time and short-time memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolutional network and long-time and short-time memory recurrent neural network
Figure BDA0002394195400000091
n is the number of training samples in the training set; i represents a training sample sequence number; xi∈RM×NIs a 64X9 two-dimensional voice cochlea characteristic, XiThe corresponding training label is Li,Li∈Z={0,1}*0 represents XiIs a sample of ambient background sounds without baby crying, 1 denotes XiIs a sample of the baby's crying;
s00312: establishing a convolutional network and a deep learning classifier model for a long-time memory recurrent neural network, wherein the classifier model sequentially consists of an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlear feature map with the width of N and the height of M;
s00314: establishing two layers of convolution layers, wherein the size of convolution kernels of the first layer of convolution layers is 3x3, the number of convolution kernels is 8, and after the input layer is subjected to the first layer of convolution operation, a three-dimensional convolution characteristic graph with the output channel number of 8, the height of M and the width of N is used as an input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layer is 3x3, the number of the convolution kernels is 16, and after the second layer of convolution operation, a three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolution operation formula is as follows:
Figure BDA0002394195400000092
yj=RELU(zj);
Wijis a classifier model weight parameter, XiFor neuronal input, ZjIs an intermediate result, yjThe neural network is used for activating output and simultaneously used as input of the next layer;
s00315: establishing an LSTM layer, wherein a reshape operation needs to be performed on a second convolution layer before the second convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is set as a 16xMxN two-dimensional feature map, and for the LSTM: 16M denotes the length of the input features, N denotes the time sequence length;
s00316: the LSTM is a long-short term memory network, is a neural cell mathematical model with memory and mainly comprises three gate structures of a forgetting gate, an input gate and an output gate;
the first step determines what information will be forgotten from the cell state, which is done by a forgetting gate that reads h passed from the previous statet-1And the current input xtOutputting an array of values between 0 and 1 to each of the cells in the cell state Ct-1The number in (1) represents complete retention, and 0 represents complete rejection; the forgetting gate mathematical model formula is as follows:
ft=σ(Wf·[ht-1,xt]+bf);
wherein, WfIs the neural network weight parameter of the forgetting gate, bfIs a bias term, σ represents a sigmoid function;
the second step is to determine how much new information to add to the cell state, done by the input gate, first generating an input selection layer to determine which information needs to be updated itAn input alternative layer for updated content
Figure BDA0002394195400000101
Then, we need to renew the old cell state, and Ct-1Is updated to Ct(ii) a The specific steps are that the old state C is adoptedt-1And ftMultiply by and add
Figure BDA0002394195400000102
The input gate mathematical model formula is as follows:
it=σ(Wi·[ht-1,xt]+bi);
Figure BDA0002394195400000103
Figure BDA0002394195400000104
wherein, Wi、WcIs the neural network weight parameter of the input gate, bi、bcIs a bias term, σ represents a sigmoid function;
determining what value is output, wherein the output is completed by an output gate, and firstly, a sigmoid layer is operated to determine which part of the cell state is output; then we process the cell state by tanh (to get a value between-1 and 1) and multiply it with the output of the sigmoid gate, and finally we will only output that part of the output we determine, the output gate mathematical model formula is as follows:
ot=σ(Wo·[ht-1,xt]+bo);
ht=ot*tanh(Ct);
wherein, WoIs the neural network weight parameter of the output gate, boIs a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16x MxN two-dimensional feature map passes through an LSTM layertThe characteristic layer is connected with the binary softmax and outputs N los, and the specific formula is as follows:
Figure BDA0002394195400000111
wherein x ist=Ws·ht+bsIs htAnd (3) outputting the feature layer through the softmax classifier, and updating the network bias term and the weight through a reverse transfer algorithm in the training process.
In S0032, the method of voting for the N-frame classification result by using the majority-first voting principle is as follows: inputting the cochlear speech features extracted by each frame into a deep learning classifier to obtain N frame classification results CiI is 1, 2, …, N, and average is obtained
Figure BDA0002394195400000112
If p is more than or equal to 0.5, the baby cries, otherwise, the baby does not cry.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the invention provides a deep learning method for baby cry detection, aiming at the problem that the traditional voice gender recognition is easily influenced by environment transformation, the cochlear voice characteristics adopted by the deep learning method are voice characteristic parameters which are more in line with human auditory perception analysis, and a convolutional network and a long-short memory recurrent neural network classifier are adopted as an acoustic reasoning model for baby cry detection, so that the deep learning method can adapt to a low-signal-to-noise-ratio voice environment, can fully utilize a context voice mailbox by utilizing the long-short memory recurrent neural network, is richer in dimensionality, and has higher recognition rate compared with the traditional method.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (9)

1. A deep learning method for detecting baby crying is characterized by comprising the following steps:
s001: collecting voice signals;
s002: framing the voice signal segment, and extracting cochlear voice features of each frame;
s003: carrying out baby cry detection to output a detection result, wherein the baby cry detection comprises the following steps:
s0031: establishing a deep learning classifier based on a convolutional network and a long-time and short-time memory recurrent neural network;
s0032: inputting the extracted adjacent N frames of cochlear voice features into a deep learning classifier to obtain N frames of classification results, and voting the N frames of classification results by using a majority-first voting principle to obtain a final baby crying detection result.
2. The method as claimed in claim 1, wherein the voice signal is collected in step S001 in the following manner:
s0011: inputting a voice signal by using a microphone device;
s0012: the corresponding speech signal is obtained by sample quantization.
3. The method as claimed in claim 2, wherein the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.
4. The method as claimed in claim 1, wherein the frame length of the framing of the speech signal segment in step S002 is in the range of 20ms to 30ms, and the frame step length is in the range of 10ms to 15 ms.
5. The method as claimed in claim 1, wherein the step S002 of extracting the cochlear speech feature comprises the steps of:
s0021: constructing a Gamma atom filter bank based on a human ear cochlea auditory model, wherein the time domain expression form is as follows:
g(f,t)=kta-1e-2Πbtcos(2Πft+Φ),t≥0;
wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, the relationship between the attenuation factor and the center frequency is:
b=24.7(4.37f/1000+1);
s0022: performing Gamma-tone filter filtering on the voice signal by using an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter and takes the value of 64, t is the length of the output response signal and takes the value of a natural number, and the length of the output response signal and the input signal are kept equal;
s0023: the response energy in the frame is obtained by framing the output response signal R (n, t) to obtain a cochlea-like map, and the formula is as follows: gm (i) ═ log ([ | R | (i, m)]1/2);
Wherein, i represents the ith Gamma filter, i is 0, 1, 2, …, N-1, N is 64, the number of the filter group; m represents the mth frame, M is 0, 1, 2, … and M-1, and M is the number of frames after framing; each frame of the cochlear image-like map is called a Gammatone feature coefficient GF, and one GF feature vector is composed of 8 frequency components.
6. The method as claimed in claim 5, wherein the Gamma filter bank adopts 4-order Gamma filters of 64 channels, and the center frequency is 50 Hz-8000 Hz.
7. The method as claimed in claim 1, wherein the adjacent N frames of cochlear speech features in step S0032 are obtained by concatenating one-dimensional cochlear speech features of preceding and following frames into a M × N two-dimensional feature matrix, which is recorded as:
Figure FDA0002394195390000031
wherein, F ∈ RM×NAnd N is an odd number.
8. The method as claimed in claim 1, wherein the deep learning classifier in S0031 comprises two convolution layers and a long-term memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-time and short-time memory recurrent neural network comprises the following steps:
s00311: training set for establishing deep learning classifier of convolutional network and long-time and short-time memory recurrent neural network
Figure FDA0002394195390000032
n is the number of training samples in the training set; i represents a training sample sequence number; xi∈RM×NA cochlear feature of speech in 64x9 two dimensions; xiThe corresponding training label is Li,Li∈Z={0,1}*0 represents XiIs a sample of ambient background sounds without baby crying, 1 denotes XiIs a sample of the baby's crying;
s00312: establishing a convolutional network and a deep learning classifier model for a long-time memory recurrent neural network, wherein the classifier model sequentially consists of an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;
s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlear feature map with the width of N and the height of M;
s00314: establishing two layers of convolution layers, wherein the size of convolution kernels of the first layer of convolution layers is 3x3, the number of convolution kernels is 8, and after the input layer is subjected to the first layer of convolution operation, a three-dimensional convolution characteristic graph with the output channel number of 8, the height of M and the width of N is used as an input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layer is 3x3, the number of the convolution kernels is 16, and after the second layer of convolution operation, a three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolution operation formula is as follows:
Figure FDA0002394195390000033
yj=RELU(zj);
Wijis a classifier model weight parameter, XiFor neuronal input, ZjIs an intermediate result, yjThe neural network is used for activating output and simultaneously used as input of the next layer;
s00315: establishing an LSTM layer, wherein a reshape operation needs to be performed on a second convolution layer before the second convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is set as a 16xMxN two-dimensional feature map, and for the LSTM: 16M denotes the length of the input features, N denotes the time sequence length;
s00316: the LSTM is a long-short term memory network, is a neural cell mathematical model with memory and consists of three gate structures, namely a forgetting gate, an input gate and an output gate;
the forgetting gate can read the h transmitted from the previous statet-1And the current input xtOutputting an array of values between 0 and 1 to each of the cells in the cell state Ct-1The number in (1) represents complete retention, and 0 represents complete rejection; the forgetting gate mathematical model formula is as follows:
ft=σ(Wf·[ht-1,xt]+bf);
wherein, WfIs the neural network weight parameter of the forgetting gate, bfIs a bias term, σ represents a sigmoid function;
the input gate mathematical model formula is as follows:
it=σ(Wi·[ht-1,xt]+bi);
Figure FDA0002394195390000041
Figure FDA0002394195390000042
wherein, Wi、WcIs the neural network weight parameter of the input gate, bi、bcIs a bias term, σ represents a sigmoid function;
the mathematical model formula of the output gate is as follows:
ot=σ(Wo·[ht-1,xt]+bo);
ht=ot*tanh(Ct);
wherein, WoIs the neural network weight parameter of the output gate, boIs a bias term, σ represents a sigmoid function;
s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16x MxN two-dimensional feature map passes through an LSTM layertThe characteristic layer is connected with the binary softmax and outputs N los, and the specific formula is as follows:
Figure FDA0002394195390000051
wherein x ist=Ws·ht+bsIs htAnd (5) outputting the feature layer after passing through the softmax classifier.
9. The method as claimed in claim 1, wherein the voting method in S0032 using the majority-first voting principle to vote the classification results of N frames comprises: inputting the cochlear speech features extracted by each frame into a deep learning classifier to obtain N frame classification results CiI is 1, 2, …, N, and average is obtained
Figure FDA0002394195390000052
If p is more than or equal to 0.5, the baby cries, otherwise, the baby does not cry.
CN202010125193.9A 2020-02-27 2020-02-27 Deep learning method for detecting crying of baby Active CN111326179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010125193.9A CN111326179B (en) 2020-02-27 2020-02-27 Deep learning method for detecting crying of baby

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010125193.9A CN111326179B (en) 2020-02-27 2020-02-27 Deep learning method for detecting crying of baby

Publications (2)

Publication Number Publication Date
CN111326179A true CN111326179A (en) 2020-06-23
CN111326179B CN111326179B (en) 2023-05-26

Family

ID=71172973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010125193.9A Active CN111326179B (en) 2020-02-27 2020-02-27 Deep learning method for detecting crying of baby

Country Status (1)

Country Link
CN (1) CN111326179B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382311A (en) * 2020-11-16 2021-02-19 谭昊玥 Infant crying intention identification method and device based on hybrid neural network
CN113392736A (en) * 2021-05-31 2021-09-14 五八到家有限公司 Monitoring method, system, equipment and medium for improving safety of home service
WO2022178316A1 (en) * 2021-02-18 2022-08-25 The Johns Hopkins University Processing input signals using machine learning for neural activation
CN117037849A (en) * 2021-02-26 2023-11-10 武汉星巡智能科技有限公司 Infant crying classification method, device and equipment based on feature extraction and classification

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163427A (en) * 2010-12-20 2011-08-24 北京邮电大学 Method for detecting audio exceptional event based on environmental model
CN107818779A (en) * 2017-09-15 2018-03-20 北京理工大学 A kind of infant's crying sound detection method, apparatus, equipment and medium
CN109243493A (en) * 2018-10-30 2019-01-18 南京工程学院 Based on the vagitus emotion identification method for improving long memory network in short-term
CN109509484A (en) * 2018-12-25 2019-03-22 科大讯飞股份有限公司 A kind of prediction technique and device of baby crying reason
CN110428843A (en) * 2019-03-11 2019-11-08 杭州雄迈信息技术有限公司 A kind of voice gender identification deep learning method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163427A (en) * 2010-12-20 2011-08-24 北京邮电大学 Method for detecting audio exceptional event based on environmental model
CN107818779A (en) * 2017-09-15 2018-03-20 北京理工大学 A kind of infant's crying sound detection method, apparatus, equipment and medium
CN109243493A (en) * 2018-10-30 2019-01-18 南京工程学院 Based on the vagitus emotion identification method for improving long memory network in short-term
CN109509484A (en) * 2018-12-25 2019-03-22 科大讯飞股份有限公司 A kind of prediction technique and device of baby crying reason
CN110428843A (en) * 2019-03-11 2019-11-08 杭州雄迈信息技术有限公司 A kind of voice gender identification deep learning method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KAREN SANTIAGO-SÁNCHEZ 等: "Type-2 Fuzzy Sets Applied to Pattern Matching for the Classification of Cries of Infants under Neurological Risk", 《ICIC 2009》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382311A (en) * 2020-11-16 2021-02-19 谭昊玥 Infant crying intention identification method and device based on hybrid neural network
CN112382311B (en) * 2020-11-16 2022-08-19 谭昊玥 Infant crying intention identification method and device based on hybrid neural network
WO2022178316A1 (en) * 2021-02-18 2022-08-25 The Johns Hopkins University Processing input signals using machine learning for neural activation
CN117037849A (en) * 2021-02-26 2023-11-10 武汉星巡智能科技有限公司 Infant crying classification method, device and equipment based on feature extraction and classification
CN113392736A (en) * 2021-05-31 2021-09-14 五八到家有限公司 Monitoring method, system, equipment and medium for improving safety of home service

Also Published As

Publication number Publication date
CN111326179B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
CN111326179B (en) Deep learning method for detecting crying of baby
CN109326302B (en) Voice enhancement method based on voiceprint comparison and generation of confrontation network
WO2021143327A1 (en) Voice recognition method, device, and computer-readable storage medium
CN110428843B (en) Voice gender recognition deep learning method
EP4002362A1 (en) Method and apparatus for training speech separation model, storage medium, and computer device
CN112581979B (en) Speech emotion recognition method based on spectrogram
CN109473120A (en) A kind of abnormal sound signal recognition method based on convolutional neural networks
CN108766419A (en) A kind of abnormal speech detection method based on deep learning
JP2654917B2 (en) Speaker independent isolated word speech recognition system using neural network
CN109243493B (en) Infant crying emotion recognition method based on improved long-time and short-time memory network
CN110853680A (en) double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition
CN111951824A (en) Detection method for distinguishing depression based on sound
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN108520753A (en) Voice lie detection method based on the two-way length of convolution memory network in short-term
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN112587153A (en) End-to-end non-contact atrial fibrillation automatic detection system and method based on vPPG signal
CN113643723A (en) Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information
Świetlicka et al. Hierarchical ANN system for stuttering identification
CN115862684A (en) Audio-based depression state auxiliary detection method for dual-mode fusion type neural network
CN115602152B (en) Voice enhancement method based on multi-stage attention network
CN115346561B (en) Depression emotion assessment and prediction method and system based on voice characteristics
CN115881164A (en) Voice emotion recognition method and system
CN112466284A (en) Mask voice identification method
CN111723717A (en) Silent voice recognition method and system
CN115171878A (en) Depression detection method based on BiGRU and BiLSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 311422 4th floor, building 9, Yinhu innovation center, 9 Fuxian Road, Yinhu street, Fuyang District, Hangzhou City, Zhejiang Province

Patentee after: Zhejiang Xinmai Microelectronics Co.,Ltd.

Address before: 311400 4th floor, building 9, Yinhu innovation center, No.9 Fuxian Road, Yinhu street, Fuyang District, Hangzhou City, Zhejiang Province

Patentee before: Hangzhou xiongmai integrated circuit technology Co.,Ltd.