CN111326179A

CN111326179A - Deep learning method for baby cry detection

Info

Publication number: CN111326179A
Application number: CN202010125193.9A
Authority: CN
Inventors: 罗世操
Original assignee: Hangzhou Xiongmai Integrated Circuit Technology Co Ltd
Current assignee: Zhejiang Xinmai Microelectronics Co ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-23
Anticipated expiration: 2040-02-27
Also published as: CN111326179B

Abstract

The invention discloses a deep learning method for baby cry detection, and relates to the technical field of voice signal processing. The invention comprises the following steps: a. collecting voice signals; b. framing the voice signal segments and extracting cochlear voice features of each frame; c. and inputting the voice characteristics of the adjacent N frames into a pre-trained baby cry detection deep learning model to carry out reasoning and judge whether crying exists or not. d. And voting the classification results of the N frames by using a majority priority voting principle, and judging whether the baby cries in the N frames. The cochlear voice characteristics adopted by the invention are voice characteristic parameters which are more in line with auditory perception analysis of a human, and the convolutional network and the long and short term memory recurrent neural network are adopted as an acoustic reasoning model for detecting the crying of the infant, so that the cochlear voice detection method can adapt to the voice environment with low signal-to-noise ratio, and has higher accuracy compared with the traditional method.

Description

Deep learning method for baby cry detection

Technical Field

The invention belongs to the technical field of voice signal processing, and particularly relates to a deep learning method for baby cry detection.

Background

Parents or elders of modern society are easy to cared for new babies due to busy work or housework, and babies can express emotions and demands only by crying, so that the family nursing based on baby crying detection has a great market demand.

The current speech signal recognition system is generally composed of three parts, namely speech signal preprocessing, feature extraction and classification, wherein the feature extraction is the most important part, the quality of the feature extraction directly influences the recognition result, most of the speech gender features proposed by researchers before are based on the prosody feature and the voice gender feature of speech, and are both artificially designed features, the robustness of the system is low, and the system is easily influenced by the environment.

The method for deeply learning the crying detection of the baby is provided, and the problems are solved.

Disclosure of Invention

The invention aims to provide a deep learning method for baby cry detection, which realizes baby cry detection on voice signals.

In order to solve the technical problems, the invention is realized by the following technical scheme: the invention relates to a deep learning method for detecting baby cry, which comprises the following steps:

s001: collecting voice signals;

s002: framing the voice signal segment, and extracting cochlear voice features of each frame;

s003: carrying out baby cry detection to output a detection result, wherein the baby cry detection comprises the following steps:

s0031: establishing a deep learning classifier based on a convolutional network and a long-time and short-time memory recurrent neural network;

s0032: inputting the extracted adjacent N frames of cochlear voice features into a deep learning classifier to obtain N frames of classification results, and voting the N frames of classification results by using a majority-first voting principle to obtain a final baby crying detection result.

Further, the voice signal acquisition mode in step S001 is as follows:

s0011: inputting a voice signal by using a microphone device;

s0012: the corresponding speech signal is obtained by sample quantization.

Further, the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.

Further, in step S002, the frame length range for framing the speech signal segment is 20ms to 30ms, and the frame step range is 10ms to 15 ms.

Further, the method for extracting cochlear speech features in step S002 includes the steps of:

s0021: constructing a Gamma atom filter bank based on a human ear cochlea auditory model, wherein the time domain expression form is as follows:

g(f,t)＝kt^a-1e^-2Πbtcos(2Πft+Φ),t≥0；

wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, the relationship between the attenuation factor and the center frequency is:

b＝24.7(4.37f/1000+1)；

s0022: performing Gamma-tone filter filtering on the voice signal by using an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter and takes the value of 64, t is the length of the output response signal and takes the value of a natural number, and the length of the output response signal and the input signal are kept equal;

s0023: the response energy in the frame is obtained by framing the output response signal R (n, t) to obtain a cochlea-like map, and the formula is as follows: gm (i) ═ log ([ | R | (i, m)]^1/2)；

Wherein, i represents the ith Gamma filter, i is 0, 1, 2, …, N-1, N is 64, the number of the filter group; m represents the mth frame, M is 0, 1, 2, … and M-1, and M is the number of frames after framing; each frame of the cochlear image-like map is called a Gammatone feature coefficient GF, and one GF feature vector is composed of 8 frequency components.

Furthermore, the Gamma filter group adopts 4-order Gamma filters with 64 channels, and the center frequency is between 50Hz and 8000 Hz.

Further, the adjacent N frames of cochlear speech features in step S0032 refer to stitching one-dimensional cochlear speech features of preceding and following frames into a M × N two-dimensional feature matrix, which is recorded as:

wherein, F ∈ R^M×NAnd N is an odd number.

Further, the deep learning classifier in S0031 includes two convolutional layers and a long-term memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-time and short-time memory recurrent neural network comprises the following steps:

s00311: training set for establishing deep learning classifier of convolutional network and long-time and short-time memory recurrent neural network

n is the number of training samples in the training set; i represents a training sample sequence number; x_i∈R^M×NA cochlear feature of speech in 64x9 two dimensions; x_iThe corresponding training label is L_i，L_i∈Z＝{0,1}^*0 represents X_iIs a sample of ambient background sounds without baby crying, 1 denotes X_iIs a sample of the baby's crying;

s00312: establishing a convolutional network and a deep learning classifier model for a long-time memory recurrent neural network, wherein the classifier model sequentially consists of an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;

s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlear feature map with the width of N and the height of M;

s00314: establishing two layers of convolution layers, wherein the size of convolution kernels of the first layer of convolution layers is 3x3, the number of convolution kernels is 8, and after the input layer is subjected to the first layer of convolution operation, a three-dimensional convolution characteristic graph with the output channel number of 8, the height of M and the width of N is used as an input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layer is 3x3, the number of the convolution kernels is 16, and after the second layer of convolution operation, a three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolution operation formula is as follows:

y_j＝RELU(z_j)；

W_ijis a classifier model weight parameter, X_iFor neuronal input, Z_jIs an intermediate result, y_jThe neural network is used for activating output and simultaneously used as input of the next layer;

s00315: establishing an LSTM layer, wherein a reshape operation needs to be performed on a second convolution layer before the second convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is set as a 16xMxN two-dimensional feature map, and for the LSTM: 16M denotes the length of the input features, N denotes the time sequence length;

s00316: the LSTM is a long-short term memory network, is a neural cell mathematical model with memory and consists of three gate structures, namely a forgetting gate, an input gate and an output gate;

the forgetting gate can read the h transmitted from the previous state_t-1And the current input x_tOutputting an array of values between 0 and 1 to each of the cells in the cell state C_t-1The number in (1) represents complete retention, and 0 represents complete rejection; the forgetting gate mathematical model formula is as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)；

wherein, W_fIs the neural network weight parameter of the forgetting gate, b_fIs a bias term, σ represents a sigmoid function;

the input gate mathematical model formula is as follows:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)；

wherein, W_i、W_cIs the neural network weight parameter of the input gate, b_i、b_cIs a bias term, σ represents a sigmoid function;

the mathematical model formula of the output gate is as follows:

o_t＝σ(W_o·[h_t-1,x_t]+b_o)；

h_t＝o_t*tanh(C_t)；

wherein, W_oIs the neural network weight parameter of the output gate, b_oIs a bias term, σ represents a sigmoid function;

s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16x MxN two-dimensional feature map passes through an LSTM layer_tThe characteristic layer is connected with the binary softmax and outputs N los, and the specific formula is as follows:

wherein x is_t＝W_s·h_t+b_sIs h_tAnd (5) outputting the feature layer after passing through the softmax classifier.

Further, the method for voting the classification result of the N frames by using the majority-first voting principle in S0032 includes: inputting the cochlear speech features extracted by each frame into a deep learning classifier to obtain N frame classification results C_iI is 1, 2, …, N, and average is obtained

If p is more than or equal to 0.5, the baby cries, otherwise, the baby does not cry.

The invention has the following beneficial effects:

aiming at the problem that the traditional voice recognition is easily influenced by environment transformation, the cochlear voice features adopted by the invention are voice feature parameters which are more in line with human auditory perception analysis, and a convolutional network and a long and short memory recurrent neural network classifier are adopted as an acoustic inference model for detecting the crying of the baby, so that the cochlear voice recognition method can adapt to the voice environment with low signal to noise ratio, and the long and short memory recurrent neural network can fully utilize a context voice mailbox, so that the cochlear voice recognition method has richer dimensions and higher recognition rate compared with the traditional method.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is a schematic diagram of cochlear speech feature extraction of the present invention;

FIG. 3 is a schematic diagram of a convolutional network and a long-and-short memory recurrent neural network classifier according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, the present invention provides a deep learning method for detecting baby cry, comprising the following steps:

s001: collecting voice signals;

The voice signal acquisition mode in step S001 is as follows:

s0011: inputting a voice signal by using a microphone device;

s0012: the corresponding voice signal is obtained through sampling and quantization, the sampling frequency of the sampling and quantization is 16KHz, and the quantization precision is 16 bits.

In step S002, framing the speech signal segment is performed according to the standard of 20ms window, i.e., frame length 320 points, and 10ms window sliding step size stride, i.e., 160 points; if the voice Length is Length, the number of the voice section frames N is shown as the following formula:

the method for extracting the cochlear speech features in the step S002 comprises the following steps:

g(f,t)＝kt^a-1e^-2Πbtcos(2Πft+Φ),t≥0；

b＝24.7(4.37f/1000+1)；

wherein, the Gamma atom filter group adopts 4-order Gamma atom filters of 64 channels, and the central frequency is between 50Hz and 8000 Hz;

s0023: framing of the output response signal R (n, t) to determine the response energy in the frameQuantity to obtain a cochlea map, the formula is as follows: gm (i) ═ log ([ | R | (i, m)]^1/2)；

The adjacent N frames of cochlear speech features in step S0032 are obtained by concatenating one-dimensional cochlear speech features of preceding and following frames into a M × N two-dimensional feature matrix, which is recorded as:

wherein, F ∈ R^M×NAnd N is an odd number.

Wherein, the deep learning classifier in S0031 comprises two convolution layers and a long-time memory recurrent neural network Layer (LSTM); the specific method for training the deep learning classifier based on the convolutional network and the long-time and short-time memory recurrent neural network comprises the following steps:

n is the number of training samples in the training set; i represents a training sample sequence number; x_i∈R^M×NIs a 64X9 two-dimensional voice cochlea characteristic, X_iThe corresponding training label is L_i，L_i∈Z＝{0,1}^*0 represents X_iIs a sample of ambient background sounds without baby crying, 1 denotes X_iIs a sample of the baby's crying;

y_j＝RELU(z_j)；

s00316: the LSTM is a long-short term memory network, is a neural cell mathematical model with memory and mainly comprises three gate structures of a forgetting gate, an input gate and an output gate;

the first step determines what information will be forgotten from the cell state, which is done by a forgetting gate that reads h passed from the previous state_t-1And the current input x_tOutputting an array of values between 0 and 1 to each of the cells in the cell state C_t-1The number in (1) represents complete retention, and 0 represents complete rejection; the forgetting gate mathematical model formula is as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)；

the second step is to determine how much new information to add to the cell state, done by the input gate, first generating an input selection layer to determine which information needs to be updated i_tAn input alternative layer for updated content

Then, we need to renew the old cell state, and C_t-1Is updated to C_t(ii) a The specific steps are that the old state C is adopted_t-1And f_tMultiply by and add

The input gate mathematical model formula is as follows:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)；

determining what value is output, wherein the output is completed by an output gate, and firstly, a sigmoid layer is operated to determine which part of the cell state is output; then we process the cell state by tanh (to get a value between-1 and 1) and multiply it with the output of the sigmoid gate, and finally we will only output that part of the output we determine, the output gate mathematical model formula is as follows:

o_t＝σ(W_o·[h_t-1,x_t]+b_o)；

h_t＝o_t*tanh(C_t)；

wherein x is_t＝W_s·h_t+b_sIs h_tAnd (3) outputting the feature layer through the softmax classifier, and updating the network bias term and the weight through a reverse transfer algorithm in the training process.

In S0032, the method of voting for the N-frame classification result by using the majority-first voting principle is as follows: inputting the cochlear speech features extracted by each frame into a deep learning classifier to obtain N frame classification results C_iI is 1, 2, …, N, and average is obtained

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the invention provides a deep learning method for baby cry detection, aiming at the problem that the traditional voice gender recognition is easily influenced by environment transformation, the cochlear voice characteristics adopted by the deep learning method are voice characteristic parameters which are more in line with human auditory perception analysis, and a convolutional network and a long-short memory recurrent neural network classifier are adopted as an acoustic reasoning model for baby cry detection, so that the deep learning method can adapt to a low-signal-to-noise-ratio voice environment, can fully utilize a context voice mailbox by utilizing the long-short memory recurrent neural network, is richer in dimensionality, and has higher recognition rate compared with the traditional method.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A deep learning method for detecting baby crying is characterized by comprising the following steps:

s001: collecting voice signals;

2. The method as claimed in claim 1, wherein the voice signal is collected in step S001 in the following manner:

s0011: inputting a voice signal by using a microphone device;

s0012: the corresponding speech signal is obtained by sample quantization.

3. The method as claimed in claim 2, wherein the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.

4. The method as claimed in claim 1, wherein the frame length of the framing of the speech signal segment in step S002 is in the range of 20ms to 30ms, and the frame step length is in the range of 10ms to 15 ms.

5. The method as claimed in claim 1, wherein the step S002 of extracting the cochlear speech feature comprises the steps of:

g(f,t)＝kt^a-1e^-2Πbtcos(2Πft+Φ),t≥0；

b＝24.7(4.37f/1000+1)；

6. The method as claimed in claim 5, wherein the Gamma filter bank adopts 4-order Gamma filters of 64 channels, and the center frequency is 50 Hz-8000 Hz.

7. The method as claimed in claim 1, wherein the adjacent N frames of cochlear speech features in step S0032 are obtained by concatenating one-dimensional cochlear speech features of preceding and following frames into a M × N two-dimensional feature matrix, which is recorded as:

wherein, F ∈ R^M×NAnd N is an odd number.

8. The method as claimed in claim 1, wherein the deep learning classifier in S0031 comprises two convolution layers and a long-term memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-time and short-time memory recurrent neural network comprises the following steps:

y_j＝RELU(z_j)；

f_t＝σ(W_f·[h_t-1,x_t]+b_f)；

the input gate mathematical model formula is as follows:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)；

the mathematical model formula of the output gate is as follows:

o_t＝σ(W_o·[h_t-1,x_t]+b_o)；

h_t＝o_t*tanh(C_t)；

9. The method as claimed in claim 1, wherein the voting method in S0032 using the majority-first voting principle to vote the classification results of N frames comprises: inputting the cochlear speech features extracted by each frame into a deep learning classifier to obtain N frame classification results C_iI is 1, 2, …, N, and average is obtained