CN111326179B

CN111326179B - Deep learning method for detecting crying of baby

Info

Publication number: CN111326179B
Application number: CN202010125193.9A
Authority: CN
Inventors: 罗世操
Original assignee: Hangzhou Xiongmai Integrated Circuit Technology Co Ltd
Current assignee: Zhejiang Xinmai Microelectronics Co ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2023-05-26
Anticipated expiration: 2040-02-27
Also published as: CN111326179A

Abstract

The invention discloses a deep learning method for detecting crying of infants, and relates to the technical field of voice signal processing. The invention comprises the following steps: a. collecting voice signals; b. framing the voice signal segments and extracting cochlear voice characteristics of each frame; c. and inputting the adjacent N frames of voice characteristics into a pre-trained infant crying detection deep learning model to perform reasoning and judging whether crying exists or not. d. And voting the N frames of classification results by using a majority priority voting principle, and judging whether the baby crys exist in the N frames. The cochlea voice characteristics adopted by the invention are voice characteristic parameters which are more in line with auditory perception analysis of people, and the convolutional network and the long-short-term memory recurrent neural network are adopted as acoustic reasoning models for infant crying detection, so that the invention can adapt to a voice environment with low signal-to-noise ratio, and has higher accuracy rate compared with the traditional method.

Description

Deep learning method for detecting crying of baby

Technical Field

The invention belongs to the technical field of voice signal processing, and particularly relates to a deep learning method for detecting crying of infants.

Background

Parents or ancestors in modern society are easy to care for new-born infants due to busy work or housework, and infants can express emotion and demand only by crying, so that home care based on crying detection of infants has great market demands.

The current voice signal recognition system is usually composed of three parts, namely voice signal preprocessing, feature extraction and classification, wherein the feature extraction is the most important part, the quality of the feature extraction directly influences the recognition result, the voice gender features proposed by the prior researchers are mostly based on the rhythm features and the voice quality features of voice, and are all manually designed features, so that the robustness of the system is low and is easily influenced by the environment.

The deep learning method for detecting the crying of the baby is provided, and the problems are solved.

Disclosure of Invention

The invention aims to provide a deep learning method for detecting crying of a baby, which realizes the detection of crying of a baby with a voice signal.

In order to solve the technical problems, the invention is realized by the following technical scheme: the invention relates to a deep learning method for detecting crying of infants, which comprises the following steps:

s001: collecting voice signals;

s002: framing the voice signal segment, and extracting cochlear voice characteristics from each frame;

s003: and (3) performing infant crying detection to output a detection result, wherein the infant crying detection comprises the following steps of:

s0031: establishing a deep learning classifier based on a convolutional network and a long-short-term memory recurrent neural network;

s0032: and inputting the extracted adjacent N frames of cochlea speech features into a deep learning classifier to obtain N frames of classification results, voting the N frames of classification results by applying a majority priority voting principle, and obtaining a final infant crying detection result.

Further, the manner of voice signal acquisition in the step S001 is as follows:

s0011: inputting a voice signal with a microphone device;

s0012: the corresponding speech signal is obtained by sample quantization.

Further, the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.

Further, in the step S002, the frame length of the frame for framing the voice signal segment ranges from 20ms to 30ms, and the frame length ranges from 10ms to 15ms.

Further, the method for extracting the middle ear speech features in the step S002 comprises the following steps:

s0021: constructing a Gammatone filter bank based on a human ear cochlear auditory model, wherein the time domain expression form is as follows:

g(f,t)＝kt ^a-1 e ^-2Πbt cos(2Πft+Φ),t≥0；

wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:

b＝24.7(4.37f/1000+1)；

s0022: filtering the voice signal by utilizing an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter, t is the length of the output response signal, and t is a natural number, and the length input signals of the output response signal are kept equal;

s0023: the output response signal R (n, t) is subjected to framing to obtain response energy in the frame so as to obtain a cochlea-like diagram, wherein the formula is as follows: gm (i) =log ([ |r| (i, m))] ^1/2 )；

Wherein i represents an i-th gammatine filter, i=0, 1, 2, …, N-1, n=64 is the number of filter banks; m represents an mth frame, m=0, 1, 2, …, M-1, M being the number of frames after framing; each frame of the cochlear-like map is called a gammatine feature coefficient GF, and one GF feature vector is composed of 8 frequency components.

Further, the Gamma filter group adopts a 4-order Gamma filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz.

Further, the adjacent N frames of cochlear speech features in step S0032 refers to splicing the one-dimensional cochlear speech features of the previous and subsequent frames into an mxn two-dimensional feature matrix, which is recorded as:

wherein F is E R ^M×N N is an odd number.

Further, the deep learning classifier in S0031 includes two convolution layers and a long-short-term memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:

s00311: training set for establishing deep learning classifier of convolution network and long-short-term memory recurrent neural network

n is the number of training samples in the training set; i represents a training sample sequence number; x is X _i ∈R ^M×N A 64x9 two-dimensional cochlear feature; x is X _i The corresponding training label is L _i ，L _i ∈Z＝{0,1} ^* 0 represents X _i Is an environment background sound sample without baby crying, 1 represents X _i Is a sample of infant crying;

s00312: establishing a deep learning classifier model of a convolutional network and a long-short-term memory recurrent neural network, wherein the classifier model sequentially comprises an input layer, two convolutional layers, an LSTM layer and a softmax layer according to a model reasoning sequence;

s00313: establishing an input layer, wherein the input layer is a two-dimensional cochlea feature map with a width of N and a height of M;

s00314: establishing two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3x3, the number of convolution kernels is 8, and after the first layer of convolution operation is performed on the input layer, a three-dimensional convolution characteristic diagram with 8 output channels, M height and N width is used as the input layer of the second layer of convolution; the convolution kernel size of the second layer of convolution layers is 3x3, the number of convolution kernels is 16, and after the second layer of convolution operation, the three-dimensional convolution characteristic diagram with 16 output channels, M height and N width is output; the convolutional layer activation function is ReLU, and the convolutional operation formula is as follows:

y _j ＝RELU(z _j )；

W _ij is the weight parameter of the classifier model, X _i For neuron input, Z _j Is an intermediate result, y _j Is the neural network activation output and is used as the input of the next layer;

s00315: before the LSTM layer is established, a reshape operation is needed to be carried out on the second layer convolution layer before the second layer convolution layer is connected with the LSTM layer, the original 16xMxN three-dimensional feature map reshape is 16xMxN two-dimensional feature map, and for LSTM:16×m represents the length of the input feature, N represents the time sequence length;

s00316: LSTM is long-term memory network, is a nerve cell mathematical model with memory, and consists of three gate structures of forgetting gate, input gate and output gate;

the forget gate can read the h transmitted by the last state _t-1 And current input x _t Outputting an array of between 0 and 1 to each cell state C _t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )；

wherein W is _f Is the neural network weight parameter of the forgetting gate, b _f Is a bias term, σ represents a sigmoid function;

the mathematical model formula of the input gate is as follows:

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )；

wherein W is _i 、W _c Is the neural network weight parameter of the input gate, b _i 、b _c Is a bias term, σ represents a sigmoid function;

the output gate mathematical model formula is as follows:

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o )；

h _t ＝o _t *tanh(C _t )；

wherein W is _o Is the neural network weight parameter of the output gate, b _o Is a bias term, σ represents a sigmoid function;

s00317: establishing a classifier loss function layer, and adopting a two-class softmax classifier, wherein the 16-X MxN classifier is adoptedAfter the two-dimensional feature map passes through the LSTM layer, N h are output according to the time sequence _t The feature layer is connected with the two kinds of softmax, outputs N loss, and the specific formula is as follows:

wherein x is _t ＝W _s ·h _t +b _s Is h _t The feature layer is output after passing through the softmax classifier.

Further, the method for voting on the N-frame classification result by using the majority priority voting principle in S0032 includes: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C _i I=1, 2, …, N, and averaging

If p is more than or equal to 0.5, the infant cry is indicated, otherwise, the infant cry is indicated.

The invention has the following beneficial effects:

aiming at the condition that the traditional voice recognition is easily affected by environment transformation, the cochlear voice feature adopted by the invention is voice feature parameters which are more in line with the auditory perception analysis of people, and the convolution network and the long-short-term memory recurrent neural network classifier are adopted as the acoustic reasoning model for detecting the baby crying, so that the voice recognition method can adapt to the voice environment with low signal to noise ratio, and the long-short-term memory recurrent neural network can fully utilize the contextual voice mailbox, so that the dimension is more abundant, and the voice recognition method has higher recognition rate compared with the traditional method.

Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a system of the present invention;

FIG. 2 is a schematic view of cochlear speech feature extraction of the present invention;

fig. 3 is a schematic diagram of a convolutional network and long and short memory recurrent neural network classifier of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-3, the invention discloses a deep learning method for detecting crying of infants, which comprises the following steps:

s001: collecting voice signals;

The manner of voice signal acquisition in step S001 is as follows:

s0011: inputting a voice signal with a microphone device;

s0012: the corresponding voice signal is obtained through sampling and quantization, the sampling frequency of the sampling and quantization is 16KHz, and the quantization precision is 16 bits.

Wherein the method comprises the steps ofIn step S002, the voice signal segment is framed with a window of 20ms, namely 320 points in frame length, and a window sliding step size stride of 10ms, namely 160 points, is performed; the voice Length is Length, and the number of voice segment frames is N, as shown in the following formula:

the method for extracting the cochlear speech features in the step S002 comprises the following steps:

g(f,t)＝kt ^a-1 e ^-2Πbt cos(2Πft+Φ),t≥0；

wherein k is the filter gain, a is the filter order, f is the center frequency, phi is the phase, b is the attenuation factor, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:

b＝24.7(4.37f/1000+1)；

the Gamma filter group adopts a 4-order Gamma filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz;

The adjacent N frames of cochlear speech features in step S0032 are that one-dimensional cochlear speech features of the front and back frames are spliced into an mxn two-dimensional feature matrix, which is recorded as:

wherein F is E R ^M×N N is an odd number.

Wherein, the deep learning classifier in S0031 comprises two convolution layers and a long and short memory recurrent neural network Layer (LSTM); the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:

n is the number of training samples in the training set; i represents a training sample sequence number; x is X _i ∈R ^M×N Is a 64X9 two-dimensional speech cochlea feature, X _i The corresponding training label is L _i ，L _i ∈Z＝{0,1} ^* 0 represents X _i Is an environment background sound sample without baby crying, 1 represents X _i Is a sample of infant crying;

y _j ＝RELU(z _j )；

s00316: LSTM is long-short-term memory network, is a nerve cell mathematical model with memory, and mainly comprises three gate structures of forgetting gate, input gate and output gate;

the first step is to determine what information will be forgotten from the cell state, which is done by a forget gate which will read h from the last state transfer _t-1 And current input x _t Outputting an array of between 0 and 1 to each cell state C _t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )；

the second step is to determine how much new information to be added to the cell state, which is accomplished by the input gate, by first generating an input selection layer to determine which information needs to be updated i _t An input alternative layer for updating contents

Then we need to update the old cell state, C _t-1 Updated to C _t The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the following steps of setting the old state C _t-1 And f _t Multiply and add +.>

The mathematical model formula of the input gate is as follows:

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )；

the third step, deciding what value to output, done by the output gate, first we run a sigmoid layer to determine which part of the cell state will be output; next we process the cell state through tanh (yielding a value between-1 and 1) and multiply it with the output of the sigmoid gate, which ultimately we will output only that part of our determined output, the mathematical model formula of the output gate is as follows:

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o )；

h _t ＝o _t *tanh(C _t )；

s00317: establishing a classifier loss function layer, adopting a two-class softmax classifier, and outputting N h according to time sequence after a 16-X MxN two-dimensional feature map passes through an LSTM layer _t The feature layer is connected with the two kinds of softmax, outputs N loss, and the specific formula is as follows:

wherein x is _t ＝W _s ·h _t +b _s Is h _t The feature layer updates the network bias term and weight through the output after the softmax classifier by a reverse transfer algorithm during training.

The method for voting on the N frame classification results by using the majority priority voting principle in S0032 comprises the following steps: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C _i I=1, 2, …, N, and averaging

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: aiming at the condition that the traditional voice gender recognition is easily influenced by environmental transformation, the cochlea voice characteristics adopted by the invention are voice characteristic parameters which are more in line with auditory perception analysis of people, and the convolutional network and the long-short-term memory recurrent neural network classifier are adopted as an acoustic reasoning model for detecting the baby cry, so that the method can adapt to the voice environment with low signal to noise ratio, and the long-short-term memory recurrent neural network can fully utilize the contextual voice mailbox, so that the dimension is more abundant, and the method has higher recognition rate compared with the traditional method.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A method for deep learning of infant crying detection, which is characterized by comprising the following steps:

s001: collecting voice signals;

s0032: inputting the extracted adjacent N frames of cochlea speech features into a deep learning classifier to obtain N frames of classification results, voting the N frames of classification results by applying a majority priority voting principle, and obtaining a final infant crying detection result;

the deep learning classifier in the step S0031 comprises two convolution layers and a long-short-time memory recurrent neural network layer; the specific method for training the deep learning classifier based on the convolutional network and the long-short-term memory recurrent neural network comprises the following steps:

the forget gate can read the h transmitted by the last state _t-1 And current input x _t Outputs a 0 to 1Array of space to each cell state C _t-1 In (2), 1 means complete retention, 0 means complete rejection; the forgetting door mathematical model formula is as follows:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )；

the mathematical model formula of the input gate is as follows:

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )；

the output gate mathematical model formula is as follows:

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o )；

h _t ＝o _t *tanh(C _t )；

n＝1、2、…、N；k＝1、2；j＝1、2；y _k ∈{0,1}；

wherein x is _t ＝W _s ·h _t +b _s Is h _t The feature layer is output after passing through a softmax classifier;

the method for extracting the voice characteristics of the middle ear scroll in the step S002 comprises the following steps:

g(f,t)＝kt ^a-1 e ^-2Πbt cos(2Πft+Φ),t≥0；

wherein k is the filter gain, a is the filter order, f is the center frequency, Φ is the phase, b is the attenuation factor, t is the length of the output response signal, the attenuation factor determines the bandwidth of the corresponding filter, and the relationship between the attenuation factor and the center frequency is:

b＝24.7(4.37f/1000+1)；

s0022: filtering the voice signal by utilizing an FFT-based overlap-add method to obtain an output response signal R (n, t), wherein n is the number of channels of the filter, t is the length of the output response signal, and t is a natural number, and the length of the output response signal is equal to that of the input signal;

Wherein i represents an i-th gammatine filter, i=0, 1, 2, …, N-1, n=64 is the number of filter banks; m represents an mth frame, m=0, 1, 2, …, M-1, M being the number of frames after framing; each frame of the cochlea-like map is called a gammatine feature coefficient GF, and one GF feature vector is composed of 8 frequency components;

wherein F is E R ^M×N N is an odd number.

2. The deep learning method for infant crying detection according to claim 1, wherein the voice signal collection in step S001 is as follows:

s0011: inputting a voice signal with a microphone device;

s0012: the corresponding speech signal is obtained by sample quantization.

3. The deep learning method for detecting infant crying according to claim 2, wherein the sampling frequency of the sampling quantization is 16KHz, and the quantization precision is 16 bits.

4. The deep learning method for detecting infant crying as claimed in claim 1, wherein the frame length of the step S002 of framing the voice signal segment is 20 ms-30 ms, and the frame step length is 10 ms-15 ms.

5. The method for deep learning for detecting infant crying according to claim 1, wherein the gammatine filter set is a 4-order gammatine filter with 64 channels, and the center frequency is between 50Hz and 8000 Hz.

6. The deep learning method for detecting infant crying as claimed in claim 1, wherein the method for voting on N frames of classification results using majority priority voting in S0032 is as follows: inputting the cochlear speech features extracted from each frame into a deep learning classifier to obtain N-frame classification results C _i I=1, 2, …, N, and averaging

If p is more than or equal to 0.5, the infant cry is indicated, otherwise, the infant cry is indicated. />