WO2019079972A1

WO2019079972A1 - Specific sound recognition method and apparatus, and storage medium

Info

Publication number: WO2019079972A1
Application number: PCT/CN2017/107505
Authority: WO
Inventors: 刘洪涛; 王伟; 孟亚彬
Original assignee: 深圳和而泰智能控制股份有限公司
Priority date: 2017-10-24
Filing date: 2017-10-24
Publication date: 2019-05-02
Also published as: CN109074822A; CN109074822B

Abstract

A specific sound recognition method and apparatus, and a storage medium. The method comprises: sampling a sound signal and obtaining a Mel-Frequency Cepstral Coefficient (MFCC) characteristic parameter matrix of the sound signal (201); extracting a characteristic parameter from the MFCC characteristic parameter matrix of the sound signal (202); and inputting the characteristic parameter into a pre-obtained specific sound characteristic model based on a depth neural network for recognition to determine whether the sound signal is a specific sound (203). The method and apparatus adopt a recognition algorithm based on an MFCC characteristic parameter and a depth neural network model, and the algorithm has low complexity and a small amount of calculation, so that the hardware requirement is low and the manufacturing cost of the product is reduced.

Description

Specific sound recognition method, device and storage medium

Technical field

The embodiments of the present application relate to sound processing technologies, and in particular, to a specific sound recognition method, device, and storage medium.

Background technique

In life, we can hear certain sounds with no actual semantics every day. Such as: snoring, coughing, sneezing, etc., although they have no actual semantics, they can accurately reflect people's physiological needs, state or material quality. For example, a doctor can distinguish people's health through the patient's snoring, coughing, sneezing and so on. This kind of specific sound content is relatively simple and repetitive, but it is an indispensable part of our life. It is of great significance to effectively identify and judge various specific sound signals.

At present, research has identified specific sounds by speech recognition technology. For example, there is a method for recognizing coughing sounds, combining the characteristics of coughing sounds with speech recognition technology to establish a cough model, and using a model matching method based on Dynamic Time Warping (DTW) to perform the coughing sound of a specific person. Identification.

In the process of implementing the present application, the inventors have found that at least the following problems exist in the related art: the existing specific voice recognition algorithm has a large amount of calculation and high requirements on hardware devices.

Summary of the invention

The purpose of the present application is to provide a specific voice recognition method, device and storage medium, which can identify a specific sound, and has a simple algorithm, a small amount of calculation, and low requirements on hardware devices.

To achieve the above objective, in a first aspect, an embodiment of the present application provides a specific voice recognition method, where the method includes:

Sampling the sound signal and acquiring a characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal;

Extracting a feature parameter from a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;

Entering the characteristic parameter into a pre-acquired deep neural network-based specific sound feature model Line recognition to determine if the sound signal is a particular sound.

Optionally, the method further includes: acquiring the specific sound feature model based on the depth neural network in advance.

Optionally, the pre-acquiring the specific sound feature model based on the deep neural network includes:

Collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;

Extracting the feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the specific sound sample signal;

Taking the characteristic parameters of the specific sound sample signal as input, training based on the depth neural network model to obtain the specific sound feature model based on the deep neural network.

Optionally, the extracting the feature parameter from a matrix of the frequency coefficient of the frequency coefficient of the cepstral coefficient of the specific sound sample signal comprises:

The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a feature vector;

And dividing the feature vector from the feature vector header to the feature vector tail to segment the feature vector according to a preset step size, to obtain a feature parameter including a set of sub-feature vectors whose lengths are preset lengths, and each The sub-feature vectors have the same label, and the preset step size is an integral multiple of the length of the cepstral coefficient of each frame, and the preset length is an integer multiple of the length of the cepstral coefficient of each frame;

Extracting the feature parameters from the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal, including:

The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal are sequentially connected end to end to form a feature vector;

And dividing the feature vector from the feature vector header to the feature vector tail to segment the feature vector according to the preset step size, to obtain a set of sub-feature vectors each having a length of the preset length. Characteristic Parameters.

Optionally, the taking the feature parameter of the specific sound sample signal as an input, training the depth neural network model to obtain the specific sound feature model based on the deep neural network, including:

Taking the characteristic parameters of the specific sound sample signal as input, performing model training based on a deep confidence network algorithm, and obtaining respective initial parameters of the specific sound feature model based on the depth neural network;

Deep gradient neural network based gradient descent and back propagation algorithms for each of the initial parameters Fine-tuning, obtaining various parameters of a specific sound feature model based on deep neural networks.

Optionally, the step of inputting the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound comprises:

Inputting a set of sub-feature vectors included in the feature parameter into a pre-acquired specific sound feature model based on a depth neural network, and obtaining a prediction result corresponding to a set of sub-feature vectors;

If the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.

Optionally, the specific sound includes any one of a coughing sound, a snoring sound, and a sneezing sound.

In a second aspect, the embodiment of the present application further provides a specific voice recognition device, where the device includes:

a sampling and feature parameter obtaining module, configured to sample a sound signal and obtain a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;

a feature parameter extraction module, configured to extract a feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the sound signal;

a feature matching module, configured to confirm whether the feature parameter matches a pre-acquired deep neural network-based specific sound feature model;

And a confirmation module, configured to confirm that the sound signal is a specific sound if the feature parameter matches a pre-acquired deep neural network-based specific sound feature model.

Optionally, the device further includes:

a feature model preset module, configured to pre-acquire the specific sound feature model based on the depth neural network;

The feature model preset module is specifically configured to:

In a third aspect, the embodiment of the present application further provides a specific voice recognition device, where the specific voice recognition device includes:

a sound input unit for receiving a sound signal;

a signal processing unit, configured to perform signal processing on the sound signal;

The signal processing unit is connected to an operation processing unit built in or externally to a specific sound recognition device, and the operation processing unit includes:

At least one processor; and,

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described above.

In a fourth aspect, the embodiment of the present application further provides a storage medium, where the storage medium stores executable instructions, when the executable instructions are executed by a specific voice recognition device, causing the specific voice recognition device to perform the foregoing method. .

In a fifth aspect, the embodiment of the present application further provides a program product, where the program product includes a program stored on a storage medium, where the program includes program instructions, when the program instruction is executed by a specific voice recognition device, The particular voice recognition device performs the method described above.

The specific voice recognition method, device and storage medium provided by the embodiments of the present application adopt a recognition algorithm based on the characteristic parameters of the Mel frequency cepstrum coefficient and the deep neural network model, and the algorithm has low complexity and small calculation amount, so that the hardware requirement is low. Reduced product manufacturing costs.

DRAWINGS

The one or more embodiments are exemplified by the accompanying drawings in the accompanying drawings, and FIG. The figures in the drawings do not constitute a scale limitation unless otherwise stated.

1 is a schematic structural diagram of an application environment of each embodiment of the present application;

2 is a schematic flowchart of pre-acquiring a specific sound feature model based on a deep neural network in a specific voice recognition method provided by an embodiment of the present application;

3 is a schematic diagram of a Meyer frequency filtering process in the MFCC coefficient calculation process;

Figure 4 is a time-amplitude diagram of a coughing sound signal;

5 is a schematic diagram of the step of extracting a feature parameter to divide a feature vector into individual sub-feature vectors;

Figure 6 is a schematic diagram of a general deep neural network structure;

7 is a schematic diagram of a general deep confidence network structure;

FIG. 8 is a flowchart of a step of extracting feature parameters in a specific voice recognition method according to an embodiment of the present application; schematic diagram;

9 is a schematic flowchart of steps of training a specific sound feature model based on a deep neural network in a specific voice recognition method provided by an embodiment of the present application;

10 is a schematic flowchart of a specific voice recognition method provided by an embodiment of the present application;

11 is a schematic structural diagram of a specific voice recognition apparatus according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a specific voice recognition apparatus according to an embodiment of the present application; FIG.

FIG. 13 is a schematic structural diagram of a specific voice recognition device according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. It is a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

The embodiment of the present application proposes a specific sound recognition scheme based on the Mel Frequency Cepstral Coefficients (MFCC) characteristic parameter and the Deep Neural Network (DNN) algorithm, which is applicable to the application shown in FIG. 1 . surroundings. The application environment includes a user 10 and a specific voice recognition device 20 for receiving a sound from the user 10 and identifying the sound to determine if the sound is a particular sound.

Further, after recognizing that the sound is a specific sound, the specific recognition device 20 can also record and process the specific sound to output the situation information that the user 10 issues a specific sound. The situation information for the particular sound may include the number of times a particular sound, the length of a particular sound, and the decibel of a particular sound. For example, a counter may be included in a specific sound recognition device for counting a specific sound when a specific sound is detected; by including a timer in a specific sound recognition device for detecting a specific sound, The duration of the particular sound is counted; the decibel detection means can be included in the particular sound recognition device for detecting the decibel of the particular sound when a particular sound is detected.

The recognition principle of the specific sound in the embodiment of the present application is similar to the principle of the speech recognition. The input sound is processed and then input into the sound model to be recognized, thereby obtaining the recognition result. It can be divided into two phases, namely a specific sound model training phase and a specific sound recognition phase. Specific sound model The training phase mainly collects a certain number of specific sound sample signals, calculates the MFCC feature parameter matrix of the specific sound sample signal, extracts the feature parameters from the MFCC feature parameter matrix, and trains the feature parameters based on the DNN algorithm to obtain specific sound features. model. In the specific voice recognition stage, the MFCC feature parameter matrix is calculated for the sound signal that needs to be judged, and the corresponding feature parameter is extracted from the MFCC feature parameter matrix of the sound signal, and then the feature parameter is input into the specific sound feature model for identification, Determine if the sound signal is a specific sound. The identification process mainly includes steps of preprocessing, feature extraction, model training, pattern matching and decision.

Wherein, in the pre-processing step, sampling a specific sound sample signal and calculating a MFCC feature parameter matrix of the specific sound sample signal. In the feature extraction step, feature parameters are extracted from the MFCC feature parameter matrix. In the model training step, the feature parameters extracted from the MFCC feature parameter matrix of the specific sound sample signal are taken as inputs, and a specific sound feature model based on the deep neural network is trained. In the pattern matching and decision step, a specific sound feature model is utilized to identify whether the new sound signal is a particular sound. Wherein, identifying whether the new sound signal is a specific sound comprises: first calculating a MFCC feature parameter matrix of the sound signal, and then extracting a characteristic parameter of the sound signal from the MFCC feature parameter matrix, and then inputting the characteristic parameter of the sound signal into the specific sound feature The model is identified to determine if the sound signal is a particular sound.

The combination of MFCC and DNN to identify specific sounds can simplify the complexity of the algorithm, reduce the amount of computation, and significantly improve the accuracy of specific voice recognition.

The embodiment of the present application provides a specific voice recognition method, which can be used in the specific voice recognition device 20, where the specific voice recognition method needs to obtain a specific sound feature model based on the DNN in advance, and the specific sound feature model based on the DNN can be It is pre-configured, and can also be trained by the method in the following steps 101 to 103. After the DNN-based specific sound feature model is trained, the specific sound can be identified based on the specific sound feature model based on the DNN, and further Alternatively, if the accuracy of the DNN-based specific sound feature model for identifying a particular sound is unacceptable due to scene change or other reasons, the DNN-based specific sound feature model may be reconfigured or trained.

Wherein, as shown in FIG. 2, the pre-obtaining DNN-based specific sound feature model includes:

Step 101: Acquire a preset number of specific sound sample signals and acquire a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;

A specific sound sample signal s(n) is sampled, and an MFCC feature parameter matrix of the specific sound sample signal is obtained according to the specific sound sample signal. Mel frequency cepstrum coefficient is mainly used for sound data Feature extraction and reduction of operational dimensions. For example, for a frame with 512 dimensions (sampling points), after processing by MFCC, the most important 40-dimensional data can be extracted, and the purpose of dimensionality reduction is also achieved. The calculation of the Mel frequency cepstral coefficient generally includes: pre-emphasis, framing, windowing, fast Fourier transform, mel filter bank and discrete cosine transform.

Obtaining the MFCC feature parameter matrix of the specific sound sample signal includes the following steps:

1 pre-emphasis

The purpose of pre-emphasis is to raise the high-frequency portion, flatten the spectrum of the signal, and maintain the spectrum in the same frequency-to-noise ratio in the entire frequency band from low frequency to high frequency. At the same time, it is also to eliminate the effect of the vocal cords and lips during the process of occurrence, to compensate for the high-frequency part of the sound sample system that is suppressed by the sound system, and to highlight the high-frequency formant. The implementation method is that the sampled specific sound sample signal s(n) is pre-emphasized by a first-order finite-length unit impulse response (FIR) high-pass digital filter, and the transfer function is:

H(z)=1-a·z ^-1 (1)

Where z is the input signal, the time domain representation is the specific sound sample signal s(n), and a is the pre-emphasis coefficient, which is generally a constant from 0.9 to 1.0.

2 frames

Each P sample points in a particular sound sample signal s(n) are grouped into one unit of observation, called a frame. The value of P can be 256 or 512, and the time covered is about 20 to 30 ms. In order to avoid the change of the adjacent two frames is too large, there may be an overlapping area between two adjacent frames, the overlapping area contains G sampling points, and the value of G may be about 1/2 or 1/3 of P. The sampling frequency of the specific sound sample signal may be 8KHz or 16KHz. In the case of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000×1000=32ms.

3 window

Multiply each frame by a Hamming window to increase the continuity of the left and right ends of the frame. Suppose the signal after the frame is S(n), n=0,1...,P-1,P is the size of the frame, then multiply the Hamming window: S'(n)=S(n)×W( n), among them,

Where l is the window length.

4 Fast Fourier Transform (FFT)

Since the signal is usually difficult to see the characteristics of the signal in the time domain, it is usually converted to the energy distribution in the frequency domain to observe, and different energy distributions can represent the characteristics of different sounds. Therefore, after multiplying the Hamming window, each frame must also undergo a fast Fourier transform to obtain the energy distribution in the spectrum. Performing fast Fourier transform on each frame signal after the frame is windowed to obtain the spectrum of each frame. The modulo square of the spectrum of the particular sound sample signal yields the power spectrum of the particular sound sample signal.

5 triangle band pass filter

The energy spectrum is filtered through a set of Mel scale triangular filter banks. Define a filter bank with M filters (the number of filters is close to the number of critical bands). The filter used is a triangular filter with a center frequency of f(m), m=1, 2,. .., M. M can take 22-26. The interval between each f(m) decreases as the value of m decreases, and widens as the value of m increases. Please refer to FIG. 3.

The frequency response of the triangular filter is defined as:

among them

6 discrete cosine transform

Calculate the logarithmic energy of each filter bank output as:

The MFCC is obtained by discrete cosine transform (DCT) for the logarithmic energy s(m):

Step 102: Extract the feature parameter from a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;

It can be seen from equation (5) that MFCC is an N*L coefficient matrix, where N is the number of sound signal frames and L is the MFCC length. Due to the high dimension of the MFCC feature parameter matrix and the inconsistent length of the sound signal, the number of matrix rows N is different. The MFCC feature parameter matrix cannot be used as a direct input to obtain a specific sound feature model based on DNN. Therefore, further extraction from the MFCC feature parameter matrix is needed. Characteristic parameter. The purpose of extracting the feature parameters is to extract the characteristics of the specific sound sample signal to mark the segment of the specific sound sample signal, and use the feature parameter as an input to train the specific sound feature model based on the DNN. The feature parameters can be extracted from the MFCC feature parameter matrix in combination with the time domain or frequency domain characteristics of the particular sound signal.

Taking a specific sound signal as a coughing sound signal, for example, please refer to FIG. 4. FIG. 4 is a time-amplitude diagram (time domain diagram) of the coughing sound signal. As can be seen from FIG. 4, the coughing sound signal is generated in a short process. Obviously sudden, the duration of a single coughing sound is usually less than 550ms, and even patients with severe throat and bronchial diseases generally maintain a duration of 1000ms. From the energy point of view, the energy of the coughing sound signal is mainly concentrated in the first half of the signal. Therefore, after the MFCC calculation process, the main characteristic information of the cough sound sample signal is basically concentrated in the first half of the cough sound sample signal. Entering the characteristic parameters of the depth neural network should cover the main information of the cough sound sample signal as much as possible, and ensure that the feature parameters extracted from the MFCC feature parameter matrix are useful information, not redundant information.

In the MFCC characteristic parameter matrix of the coughing sound sample signal, the characteristic parameter of the cough sound sample signal of the previous fixed frame number may be selected as the input of the deep neural network, and the main characteristic information of the cough sound sample signal is basically concentrated on the cough sound sample signal. In the first half of the section, the cough sound sample signal of the fixed number of frames should contain as much as possible the first half of each cough sound sample signal. In order to make full use of the data, the remaining feature data in the MFCC feature parameter matrix can also be used as the input of the deep neural network. The MFCC feature parameter matrix can be segmented according to the fixed frame number, and then the segmented data are used together as the input of the deep neural network. .

Specifically, as shown in FIG. 8, the feature parameters are extracted from the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal, including:

Step 1021: The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a vector;

Step 1022: The vector is segmented from the vector header to the vector tail according to a preset step size (unit is a frame), and the obtained vector includes a set length of a preset length (ie, a fixed number of frames). Characteristic parameters of the subvectors, each subvector having the same label.

The MFCC feature parameter matrix frame and the frame are connected in series to form a vector X, with a preset length e as a basic unit, and a predetermined step d is moved from the vector X header to the tail to form a set of data Xi of the same label, wherein , i = 1, 2, ..., m, m represents the signal of each specific sound sample after the segmentation process The number of subfeature vectors included. See Figure 5 for the specific processing.

In a practical application, if the specific sound is a coughing sound, the number of frames in the first half of the general coughing sound signal can be statistically calculated, and then the preset length is taken according to the number of frames, and the preset step size can be combined with the actual application. Value. If the specific sound is other sounds, such as humming or snoring, the preset length and the preset step size may also be used according to the time domain and the frequency domain characteristics.

By dividing the MFCC feature parameter matrix of a specific sound sample signal into a plurality of fixed-length sub-feature vectors, the sub-feature vector is adapted to the requirement of the input data of the deep neural network, and can be directly used as an input of the deep neural network. Moreover, each sub-feature vector of the plurality of sub-feature vectors is set to the same label, that is, a set of sub-feature vectors are used to express the same specific sound sample signal, thereby increasing the number of data samples and avoiding loss of information when extracting the feature parameters. . Using the above sub-feature vectors and their corresponding labels, a specific sound feature model based on the deep neural network is established, and the specific sound feature model is used to identify the specific sound, which reduces the false recognition rate and improves the accuracy of the specific sound recognition. When the specific voice recognition method provided by the embodiment of the present application is used for identifying a coughing sound, the recognition rate of the coughing voice can reach 95% or more without increasing the calculation amount.

Step 103: Taking a feature parameter of the specific sound sample signal as an input, training based on a depth neural network model to obtain the specific sound feature model based on the depth neural network.

DNN is an extension of shallow neural networks. It utilizes the expression of multi-layer neural networks in function, and has very good feature extraction, learning and generalization ability for nonlinear and high-dimensional data processing. The DNN model generally includes an input layer, a hidden layer, and an output layer. Please refer to FIG. 6, where the first layer is the input layer, the middle layer is the hidden layer, and the last layer is the output layer (Figure 6 shows only three hidden layers). , in fact, will include more hidden layers), the layers are fully connected, that is, any one of the neurons in the Qth layer must be connected to any one of the Q+1th layers.

Each connection established between neurons has a linear weight, and each neuron in each layer has an offset (except for the input layer). The linear weight of the kth neuron of the l-1th layer to the jth neuron of the 1st layer is defined as w ^l _jk , where the superscript l represents the number of layers in which the linear weight is located, and the subscript corresponds to the output. The first layer index j and the input l-1 layer index k, for example, the linear weight of the fourth neuron of the second layer to the second neuron of the third layer is defined as w ³ ₂₄ . The offset of the i-th neuron of the first layer is b ^l _i , where the superscript l represents the number of layers, and the subscript i represents the index of the neuron where the bias is located, for example, the third layer of the second layer. The corresponding bias of each neuron is defined as b ² ₃ .

A series of w ^l _jk and b ^l _i can be randomly initialized, and the feature parameters of the specific sound sample signal are used as the data of the input layer by using the forward propagation algorithm, and then the first hidden layer is calculated by the input layer, and then the first layer is used. The hidden layer calculates the second hidden layer, and so on, until the output layer. Then use the back propagation algorithm to fine tune w ^l _jk and b ^l _i to obtain a specific sound feature model based on the deep neural network.

It is also possible to first obtain the initial parameters w ^l _jk and b ^l _i by using a Deep Belief Network (DBN) algorithm, and then use the gradient descent and back propagation algorithms to fine tune w ^l _jk and b ^l _i , The values of the final w ^l _jk and b ^l _i are obtained. That is, referring to FIG. 9, the feature parameter of the specific sound sample signal is input as input, and the training based on the depth neural network model to obtain the specific sound feature model based on the depth neural network includes:

Step 1031: Taking a characteristic parameter of the specific sound sample signal as an input, performing model training based on a deep confidence network algorithm, and obtaining each initial parameter of the specific sound feature model based on the depth neural network;

DBN is a deep learning model that pre-processes the model layer by layer in an unsupervised way. This unsupervised preprocessing method is the Restricted Boltzmann machine (RBM). As shown in Figure 7(b), the DBN is a stack of RBMs. As shown in Figure 7(a), RBM is a two-layer structure, v is the visible layer, h is the hidden layer, and the connection between the visible layer and the hidden layer is non-directional (values can be seen from the visible layer -> hidden layer or Implicit layer -> visible layer arbitrarily transmitted) and fully connected. Wherein, the visible layer v and the hidden layer h are connected by a linear weight, and the linear weight of the i-th neuron of the visible layer and the j-th neuron of the hidden layer is defined as w _ij , and the i-th neuron corresponding to the visible layer corresponds to The offset is b _i , the j-th neuron of the hidden layer corresponds to the offset a _j , and the subscripts i and j represent the index of the neuron.

RBM performs a one-step Gibbs sampling by comparing the divergence algorithm, and optimizes the weights w _ij , b _i and a _j to obtain another state expression of the input sample data (ie, the characteristic parameter of the specific sound sample signal) v. h, the output h1 of the RBM can be used as the input of the next RBM, continue to optimize in the same way to obtain the hidden state h2, and so on, the multi-layer DBN model can use the layer-by-layer preprocessing method for the weights w _ij , b _i and a _j is initialized, and each layer's features are an expression of the first layer of data v. After this unsupervised preprocessing, various initial parameters are obtained.

Specifically, the RBM is an energy model, and the energy of the entire RBM is expressed by the following formula (6).

Where E is the total energy of the RBM model, v is the visible layer data, h is the hidden layer data, θ is the model parameter, m is the number of visible layer neurons, n is the number of hidden layer neurons, and b is the visible layer offset. a indicates the hidden layer offset.

The RBM model samples based on the conditional probability of visible layer data and hidden layer data. For the Bernoulli-Bernoulli RBM model, the conditional probability formulas are formula (7) and formula (8), respectively.

Where σ represents the activation function sigmoid function, σ(x)=(1+e ^-x ) ^-1 .

According to the above formula, the contrast divergence algorithm is used to sample the Gibbs of the RBM, and the samples of the joint distribution of v and h are obtained, and then the parameters are optimized by maximizing the likelihood logarithm function (9) of the observed samples.

Δw _ij ≈<v _i h _j > ₀ -<v _i h _j > ₁

(10)

The optimization parameters adopt a one-step contrast divergence algorithm, which uses the mean field approximation method to directly generate sampling samples, and uses the formula (10) to iteratively optimize the parameters multiple times, and finally obtains the weights between the neurons and the bias of the neurons. Initial parameters. Where N represents the number of visible layer neurons in the RBM model, that is, the dimension of the input data of the RBM model.

Step 1032: Perform fine-tuning of each of the initial parameters based on a gradient neural network-based gradient descent and backpropagation algorithm to obtain various parameters of a specific sound feature model based on the deep neural network.

After the optimization process of the DBN is completed, the weights w and the biases b of the neurons between the layers (input layer, hidden layer and output layer) based on the DNN specific sound feature model are obtained, and the final multi-class logistic regression layer ( Softmax) uses a random initialization method, and then the DNN fine-tunes the specific sound feature model using a supervised gradient descent algorithm.

Specifically, in a supervised manner, the DNN-specific sound feature model is fine-tuned by optimizing the parameters (Equation 12) by minimizing the cost function (Equation 11).

Where J represents the cost function, h _W,b (x) represents the output of the DNN, and y represents the label corresponding to the input data.

Where α represents the learning rate and takes values from 0.5 to 0.01.

The partial derivative of each node of the deep neural network calculated in the above formula (12) can adopt the back propagation algorithm of the formula (13).

Where δ represents sensitivity and a represents the output value of each neuron node. When l represents the output layer,

When l represents another layer

Where σ represents the activation function. Then, through multiple iterations, the formula (13) is updated, and the entire DNN model is optimized layer by layer. Finally, various parameters are obtained, and a trained DNN-based specific sound feature model is obtained.

Through the combination of DBN-based unsupervised learning and supervised learning methods, compared with the randomly initialized deep neural network, after unsupervised preprocessing and supervised learning, the obtained DNN model is significantly better than the performance of ordinary deep neural networks. The MFCC feature parameter of the specific sound sample signal is used as the input of the DNN model to obtain a specific sound feature model based on the DNN, and the specific sound feature model is used to identify the specific sound, thereby effectively improving the recognition rate of the specific sound.

FIG. 10 is a schematic flowchart of a specific voice recognition method according to an embodiment of the present disclosure. As shown in FIG. 10, the specific voice recognition method includes:

Step 201: sampling a sound signal and acquiring a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;

In a practical application, a sound input unit (for example, a microphone) may be disposed on a specific sound recognition device 20 to collect a sound signal, and the sound signal is amplified, filtered, and the like, and then converted into a digital signal. The digital signal may be sampled and processed in an operation processing unit local to the specific voice recognition device 20, or may be uploaded to a cloud server, a smart terminal, or other server for processing through a network.

For the technical details of obtaining the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal, refer to step 101, and details are not described herein again.

Step 202: Extract a feature parameter from a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal.

For the specific calculation method for extracting the feature parameters from the characteristic parameter matrix of the frequency coefficient of the frequency coefficient of the sound signal, refer to step 102, and details are not described herein again.

Step 203: Input the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound.

Specifically, the feature parameter is input into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound, including:

When the characteristic parameter of the sound signal is input to the trained DNN-based specific sound feature model, it is obtained whether the sound signal is a predicted result of the specific sound. Since the characteristic parameter of the same sound signal contains a plurality of sub-feature vectors, each sub-feature vector will obtain a prediction result, so that each sound signal will obtain a plurality of prediction results, and the prediction results represent whether the sound signal is a specific sound. may. A specific sound feature model based on DNN votes all prediction results of the same sound signal, that is, in the prediction result of all sub-feature vectors, if the positive prediction result is more than the negative prediction result, the sound signal is confirmed as a specific sound. If the positive prediction result is less than the negative prediction result, it is confirmed that the sound signal is not a specific sound.

The specific voice recognition method provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. And because of the identification based on MFCC feature parameters and DNN model The algorithm has low algorithm complexity and low computational complexity, which has low hardware requirements and reduces product manufacturing costs.

It should be noted that the specific voice recognition method based on the MFCC feature parameter and the DNN model provided by the embodiment of the present application is also applicable to identify the snoring sound, the sneezing sound, the breathing sound, the laugh sound, the firecracker sound, in addition to the cough sound. And other specific sounds such as crying.

Correspondingly, as shown in FIG. 11 , the embodiment of the present application further provides a specific voice recognition device for a specific voice recognition device 20, where the device includes:

The sampling and feature parameter obtaining module 301 is configured to sample the sound signal and obtain a characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal;

The feature parameter extraction module 302 is configured to extract feature parameters from the Mel frequency cepstral coefficient feature parameter matrix of the sound signal;

The identification module 303 is configured to input the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound.

The specific voice recognition device provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. Because the recognition algorithm based on MFCC feature parameters and DNN model is adopted, the algorithm has low complexity and less calculation, which has low hardware requirements and reduces product manufacturing costs.

Optionally, in other embodiments of the device, as shown in FIG. 12, the device further includes:

The feature model preset module 304 is configured to acquire the specific sound feature model based on the depth neural network in advance.

Optionally, in some embodiments of the device, the feature model preset module 304 is specifically configured to:

Optionally, in some embodiments of the device, the feature model preset module 304 is further configured to:

The feature parameter extraction module 302 is also specifically configured to:

Based on the gradient descent and back propagation algorithms of the deep neural network, each of the initial parameters is fine-tuned to obtain various parameters of a specific sound feature model based on the deep neural network.

Optionally, in some embodiments of the apparatus, the identification module 303 is specifically configured to:

Optionally, in certain embodiments of the device, the particular sound comprises any one of cough, snoring, and sneezing.

It should be noted that the foregoing apparatus can perform the method provided by the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiments of the present application.

The embodiment of the present application also provides a specific voice recognition device. As shown in FIG. 8, the specific voice recognition device 20 includes a voice input unit 21, a signal processing unit 22, and an operation processing unit 23. The sound input unit 21 is configured to receive a sound signal, and the sound input unit may be, for example, a microphone or the like. a signal processing unit 22, configured to perform signal processing on the sound signal; the signal processing unit 22 may The obtained digital signal is sent to the arithmetic processing unit 23 by analog signal processing such as amplification, filtering, and digital-to-analog conversion of the sound signal.

The signal processing unit 22 is connected to an arithmetic processing unit 23 built in or externally to a specific sound recognition device (FIG. 13 is described by way of example in which the arithmetic processing unit is built in a specific sound recognition device), and the arithmetic processing unit 23 can be built in a specific sound. The identification device 20 may be external to the specific voice recognition device 20, and the operation processing unit 23 may also be a remotely set server, for example, a cloud server or a smart terminal that is communicably connected to the specific voice recognition device 20 through a network. Or other servers.

The operation processing unit 23 includes:

At least one processor 232 (illustrated by a processor in FIG. 13) and a memory 231, the processor 232 and the memory 231 may be connected by a bus or the like, and the bus connection is taken as an example in FIG.

The memory 231 is configured to store a non-volatile software program, a non-volatile computer executable program, and a software module, such as a program instruction/module corresponding to a specific sound recognition method in the embodiment of the present application (for example, as shown in FIG. 11) Sampling and feature parameter acquisition module 301). The processor 232 executes various functional applications and data processing by executing non-volatile software programs, instructions, and modules stored in the memory 231, that is, implementing the specific sound recognition method of the above-described method embodiments.

The memory 231 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to the use of the specific sound recognition device, and the like. Further, the memory 231 may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, or other nonvolatile solid state storage device. In some embodiments, memory 231 can optionally include memory remotely located relative to processor 232, which can be connected to a particular voice recognition device over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 231, and when executed by the one or more processors 232, perform a specific sound recognition method in any of the above method embodiments, for example, performing FIG. 2 described above Method steps 101-103, method steps 1021 to 1022 in FIG. 8, method steps 1031 to 1032 in FIG. 9, step 201 to step 203 in FIG. 10; implementing modules 301-303 and FIG. 12 in FIG. The function of modules 301-304 in .

The specific voice recognition device provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. And because of the identification based on MFCC feature parameters and DNN model The algorithm has low algorithm complexity and low computational complexity, which has low hardware requirements and reduces product manufacturing costs.

The specific voice recognition device can perform the method provided by the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiments of the present application.

Embodiments of the present application provide a storage medium storing computer executable instructions that are executed by one or more processors (eg, one processor 232 in FIG. 13), such that The one or more processors may perform the specific sound recognition method in any of the above method embodiments, for example, performing the method steps 101-103 of FIG. 2 described above, the method steps 1021 to 1022 of FIG. 8, and the method of FIG. Method steps 1031 to 1032, steps 201 to 203 in FIG. 10; functions of modules 301-303 in FIG. 11 and modules 301-304 in FIG. 12 are implemented.

The embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, ie may be located in one Places, or they can be distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Through the description of the above embodiments, those skilled in the art can clearly understand that the embodiments can be implemented by means of software plus a general hardware platform, and of course, by hardware. A person skilled in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, and are not limited thereto; in the idea of the present application, the technical features in the above embodiments or different embodiments may also be combined. The steps may be carried out in any order, and there are many other variations of the various aspects of the present application as described above, which are not provided in the details for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, The skilled person should understand that the technical solutions described in the foregoing embodiments may be modified, or some of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the embodiments of the present application. The scope of the technical solution.

Claims

A specific voice recognition method, characterized in that the method comprises:

Sampling the sound signal and acquiring a characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal;

Extracting a feature parameter from a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;

The feature parameter is input to a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound.
The specific voice recognition method according to claim 1, wherein the method further comprises: acquiring the specific sound feature model based on the depth neural network in advance.
The specific sound recognition method according to claim 2, wherein the pre-acquiring the specific sound feature model based on the depth neural network comprises:

Collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;

Extracting the feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the specific sound sample signal;

Taking the characteristic parameters of the specific sound sample signal as input, training based on the depth neural network model to obtain the specific sound feature model based on the deep neural network.
The specific voice recognition method according to claim 3, wherein the extracting the feature parameters from a matrix of the frequency coefficient of the frequency coefficient of the specific frequency sample signal includes:

The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a feature vector;

And dividing the feature vector from the feature vector header to the feature vector tail to segment the feature vector according to a preset step size, to obtain a feature parameter including a set of sub-feature vectors whose lengths are preset lengths, and each The sub-feature vectors have the same label, and the preset step size is an integral multiple of the length of the cepstral coefficient of each frame, and the preset length is an integer multiple of the length of the cepstral coefficient of each frame;

Extracting the feature parameters from the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal, including:

The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal are sequentially connected end to end to form a feature vector;

And passing the feature vector from the feature vector header to the feature vector tail at the preset step size The feature vector is segmented to obtain a feature parameter including a set of sub-feature vectors each having the predetermined length.
The specific voice recognition method according to claim 4, wherein the taking the feature parameter of the specific sound sample signal as an input, training based on a depth neural network model to obtain the specific sound feature based on the depth neural network Models, including:

Taking the characteristic parameters of the specific sound sample signal as input, performing model training based on a deep confidence network algorithm, and obtaining respective initial parameters of the specific sound feature model based on the depth neural network;

Based on the gradient descent and back propagation algorithms of the deep neural network, the respective initial parameters are fine-tuned to obtain various parameters of the specific sound feature model based on the deep neural network.
The specific voice recognition method according to claim 4, wherein the character parameter is input into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound, include:

Inputting a set of sub-feature vectors included in the feature parameter into a pre-acquired specific sound feature model based on a depth neural network, and obtaining a prediction result corresponding to a set of sub-feature vectors;

If the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.
The specific sound recognition method according to any one of claims 1 to 6, wherein the specific sound includes any one of a cough sound, a click sound, and a sneeze sound.
A specific voice recognition device, characterized in that the device comprises:

a sampling and feature parameter obtaining module, configured to sample a sound signal and obtain a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;

a feature parameter extraction module, configured to extract a feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the sound signal;

a feature matching module, configured to confirm whether the feature parameter matches a pre-acquired deep neural network-based specific sound feature model;

And a confirmation module, configured to confirm that the sound signal is a specific sound if the feature parameter matches a pre-acquired deep neural network-based specific sound feature model.
The specific voice recognition device according to claim 8, wherein the device further comprises:

a feature model preset module, configured to pre-acquire the specific sound feature based on the deep neural network model;

The feature model preset module is specifically configured to:

Collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;

Extracting the feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the specific sound sample signal;

Taking the characteristic parameters of the specific sound sample signal as input, training based on the depth neural network model to obtain the specific sound feature model based on the deep neural network.
A specific voice recognition device, characterized in that the specific voice recognition device comprises:

a sound input unit for receiving a sound signal;

a signal processing unit, configured to perform analog signal processing on the sound signal;

The signal processing unit is connected to an operation processing unit built in or externally to a specific sound recognition device, and the operation processing unit includes:

At least one processor; and,

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method of any of claims 1-6 method.
A storage medium, wherein the storage medium stores executable instructions that, when executed by a specific sound recognition device, cause the specific sound recognition device to perform any one of claims 1-7 Methods.