WO2019079972A1 - Specific sound recognition method and apparatus, and storage medium - Google Patents

Specific sound recognition method and apparatus, and storage medium

Info

Publication number
WO2019079972A1
WO2019079972A1 PCT/CN2017/107505 CN2017107505W WO2019079972A1 WO 2019079972 A1 WO2019079972 A1 WO 2019079972A1 CN 2017107505 W CN2017107505 W CN 2017107505W WO 2019079972 A1 WO2019079972 A1 WO 2019079972A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
sound
specific sound
specific
signal
Prior art date
Application number
PCT/CN2017/107505
Other languages
French (fr)
Chinese (zh)
Inventor
刘洪涛
王伟
孟亚彬
Original Assignee
深圳和而泰智能控制股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳和而泰智能控制股份有限公司 filed Critical 深圳和而泰智能控制股份有限公司
Priority to PCT/CN2017/107505 priority Critical patent/WO2019079972A1/en
Priority to CN201780009004.8A priority patent/CN109074822B/en
Publication of WO2019079972A1 publication Critical patent/WO2019079972A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Definitions

  • the embodiments of the present application relate to sound processing technologies, and in particular, to a specific sound recognition method, device, and storage medium.
  • the inventors have found that at least the following problems exist in the related art: the existing specific voice recognition algorithm has a large amount of calculation and high requirements on hardware devices.
  • the purpose of the present application is to provide a specific voice recognition method, device and storage medium, which can identify a specific sound, and has a simple algorithm, a small amount of calculation, and low requirements on hardware devices.
  • an embodiment of the present application provides a specific voice recognition method, where the method includes:
  • the method further includes: acquiring the specific sound feature model based on the depth neural network in advance.
  • the pre-acquiring the specific sound feature model based on the deep neural network includes:
  • the extracting the feature parameter from a matrix of the frequency coefficient of the frequency coefficient of the cepstral coefficient of the specific sound sample signal comprises:
  • the Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a feature vector;
  • Extracting the feature parameters from the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal including:
  • the Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal are sequentially connected end to end to form a feature vector;
  • the taking the feature parameter of the specific sound sample signal as an input, training the depth neural network model to obtain the specific sound feature model based on the deep neural network including:
  • the step of inputting the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound comprises:
  • the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.
  • the specific sound includes any one of a coughing sound, a snoring sound, and a sneezing sound.
  • the embodiment of the present application further provides a specific voice recognition device, where the device includes:
  • a sampling and feature parameter obtaining module configured to sample a sound signal and obtain a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal
  • a feature parameter extraction module configured to extract a feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the sound signal
  • a feature matching module configured to confirm whether the feature parameter matches a pre-acquired deep neural network-based specific sound feature model
  • a confirmation module configured to confirm that the sound signal is a specific sound if the feature parameter matches a pre-acquired deep neural network-based specific sound feature model.
  • the device further includes:
  • a feature model preset module configured to pre-acquire the specific sound feature model based on the depth neural network
  • the feature model preset module is specifically configured to:
  • the embodiment of the present application further provides a specific voice recognition device, where the specific voice recognition device includes:
  • a sound input unit for receiving a sound signal
  • a signal processing unit configured to perform signal processing on the sound signal
  • the signal processing unit is connected to an operation processing unit built in or externally to a specific sound recognition device, and the operation processing unit includes:
  • At least one processor and,
  • the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described above.
  • the embodiment of the present application further provides a storage medium, where the storage medium stores executable instructions, when the executable instructions are executed by a specific voice recognition device, causing the specific voice recognition device to perform the foregoing method. .
  • the embodiment of the present application further provides a program product, where the program product includes a program stored on a storage medium, where the program includes program instructions, when the program instruction is executed by a specific voice recognition device, The particular voice recognition device performs the method described above.
  • the specific voice recognition method, device and storage medium provided by the embodiments of the present application adopt a recognition algorithm based on the characteristic parameters of the Mel frequency cepstrum coefficient and the deep neural network model, and the algorithm has low complexity and small calculation amount, so that the hardware requirement is low. Reduced product manufacturing costs.
  • FIG. 1 is a schematic structural diagram of an application environment of each embodiment of the present application.
  • FIG. 2 is a schematic flowchart of pre-acquiring a specific sound feature model based on a deep neural network in a specific voice recognition method provided by an embodiment of the present application;
  • FIG. 3 is a schematic diagram of a Meyer frequency filtering process in the MFCC coefficient calculation process
  • Figure 4 is a time-amplitude diagram of a coughing sound signal
  • FIG. 5 is a schematic diagram of the step of extracting a feature parameter to divide a feature vector into individual sub-feature vectors
  • Figure 6 is a schematic diagram of a general deep neural network structure
  • FIG. 7 is a schematic diagram of a general deep confidence network structure
  • FIG. 8 is a flowchart of a step of extracting feature parameters in a specific voice recognition method according to an embodiment of the present application; schematic diagram;
  • FIG. 9 is a schematic flowchart of steps of training a specific sound feature model based on a deep neural network in a specific voice recognition method provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a specific voice recognition method provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a specific voice recognition apparatus according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a specific voice recognition apparatus according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a specific voice recognition device according to an embodiment of the present application.
  • the embodiment of the present application proposes a specific sound recognition scheme based on the Mel Frequency Cepstral Coefficients (MFCC) characteristic parameter and the Deep Neural Network (DNN) algorithm, which is applicable to the application shown in FIG. 1 . surroundings.
  • the application environment includes a user 10 and a specific voice recognition device 20 for receiving a sound from the user 10 and identifying the sound to determine if the sound is a particular sound.
  • MFCC Mel Frequency Cepstral Coefficients
  • DNN Deep Neural Network
  • the specific recognition device 20 can also record and process the specific sound to output the situation information that the user 10 issues a specific sound.
  • the situation information for the particular sound may include the number of times a particular sound, the length of a particular sound, and the decibel of a particular sound.
  • a counter may be included in a specific sound recognition device for counting a specific sound when a specific sound is detected; by including a timer in a specific sound recognition device for detecting a specific sound, The duration of the particular sound is counted; the decibel detection means can be included in the particular sound recognition device for detecting the decibel of the particular sound when a particular sound is detected.
  • the recognition principle of the specific sound in the embodiment of the present application is similar to the principle of the speech recognition.
  • the input sound is processed and then input into the sound model to be recognized, thereby obtaining the recognition result. It can be divided into two phases, namely a specific sound model training phase and a specific sound recognition phase.
  • Specific sound model The training phase mainly collects a certain number of specific sound sample signals, calculates the MFCC feature parameter matrix of the specific sound sample signal, extracts the feature parameters from the MFCC feature parameter matrix, and trains the feature parameters based on the DNN algorithm to obtain specific sound features. model.
  • the MFCC feature parameter matrix is calculated for the sound signal that needs to be judged, and the corresponding feature parameter is extracted from the MFCC feature parameter matrix of the sound signal, and then the feature parameter is input into the specific sound feature model for identification, Determine if the sound signal is a specific sound.
  • the identification process mainly includes steps of preprocessing, feature extraction, model training, pattern matching and decision.
  • the pre-processing step sampling a specific sound sample signal and calculating a MFCC feature parameter matrix of the specific sound sample signal.
  • feature extraction step feature parameters are extracted from the MFCC feature parameter matrix.
  • model training step the feature parameters extracted from the MFCC feature parameter matrix of the specific sound sample signal are taken as inputs, and a specific sound feature model based on the deep neural network is trained.
  • a specific sound feature model is utilized to identify whether the new sound signal is a particular sound.
  • identifying whether the new sound signal is a specific sound comprises: first calculating a MFCC feature parameter matrix of the sound signal, and then extracting a characteristic parameter of the sound signal from the MFCC feature parameter matrix, and then inputting the characteristic parameter of the sound signal into the specific sound feature The model is identified to determine if the sound signal is a particular sound.
  • the combination of MFCC and DNN to identify specific sounds can simplify the complexity of the algorithm, reduce the amount of computation, and significantly improve the accuracy of specific voice recognition.
  • the embodiment of the present application provides a specific voice recognition method, which can be used in the specific voice recognition device 20, where the specific voice recognition method needs to obtain a specific sound feature model based on the DNN in advance, and the specific sound feature model based on the DNN can be It is pre-configured, and can also be trained by the method in the following steps 101 to 103. After the DNN-based specific sound feature model is trained, the specific sound can be identified based on the specific sound feature model based on the DNN, and further Alternatively, if the accuracy of the DNN-based specific sound feature model for identifying a particular sound is unacceptable due to scene change or other reasons, the DNN-based specific sound feature model may be reconfigured or trained.
  • the pre-obtaining DNN-based specific sound feature model includes:
  • Step 101 Acquire a preset number of specific sound sample signals and acquire a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;
  • a specific sound sample signal s(n) is sampled, and an MFCC feature parameter matrix of the specific sound sample signal is obtained according to the specific sound sample signal.
  • Mel frequency cepstrum coefficient is mainly used for sound data Feature extraction and reduction of operational dimensions. For example, for a frame with 512 dimensions (sampling points), after processing by MFCC, the most important 40-dimensional data can be extracted, and the purpose of dimensionality reduction is also achieved.
  • the calculation of the Mel frequency cepstral coefficient generally includes: pre-emphasis, framing, windowing, fast Fourier transform, mel filter bank and discrete cosine transform.
  • Obtaining the MFCC feature parameter matrix of the specific sound sample signal includes the following steps:
  • pre-emphasis The purpose of pre-emphasis is to raise the high-frequency portion, flatten the spectrum of the signal, and maintain the spectrum in the same frequency-to-noise ratio in the entire frequency band from low frequency to high frequency. At the same time, it is also to eliminate the effect of the vocal cords and lips during the process of occurrence, to compensate for the high-frequency part of the sound sample system that is suppressed by the sound system, and to highlight the high-frequency formant.
  • the implementation method is that the sampled specific sound sample signal s(n) is pre-emphasized by a first-order finite-length unit impulse response (FIR) high-pass digital filter, and the transfer function is:
  • the time domain representation is the specific sound sample signal s(n), and a is the pre-emphasis coefficient, which is generally a constant from 0.9 to 1.0.
  • Each P sample points in a particular sound sample signal s(n) are grouped into one unit of observation, called a frame.
  • the value of P can be 256 or 512, and the time covered is about 20 to 30 ms.
  • the overlapping area contains G sampling points, and the value of G may be about 1/2 or 1/3 of P.
  • each frame must also undergo a fast Fourier transform to obtain the energy distribution in the spectrum. Performing fast Fourier transform on each frame signal after the frame is windowed to obtain the spectrum of each frame.
  • the modulo square of the spectrum of the particular sound sample signal yields the power spectrum of the particular sound sample signal.
  • the energy spectrum is filtered through a set of Mel scale triangular filter banks.
  • a filter bank with M filters (the number of filters is close to the number of critical bands).
  • the interval between each f(m) decreases as the value of m decreases, and widens as the value of m increases. Please refer to FIG. 3.
  • the frequency response of the triangular filter is defined as:
  • the MFCC is obtained by discrete cosine transform (DCT) for the logarithmic energy s(m):
  • Step 102 Extract the feature parameter from a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal
  • MFCC is an N*L coefficient matrix, where N is the number of sound signal frames and L is the MFCC length. Due to the high dimension of the MFCC feature parameter matrix and the inconsistent length of the sound signal, the number of matrix rows N is different.
  • the MFCC feature parameter matrix cannot be used as a direct input to obtain a specific sound feature model based on DNN. Therefore, further extraction from the MFCC feature parameter matrix is needed.
  • Characteristic parameter The purpose of extracting the feature parameters is to extract the characteristics of the specific sound sample signal to mark the segment of the specific sound sample signal, and use the feature parameter as an input to train the specific sound feature model based on the DNN.
  • the feature parameters can be extracted from the MFCC feature parameter matrix in combination with the time domain or frequency domain characteristics of the particular sound signal.
  • FIG. 4 is a time-amplitude diagram (time domain diagram) of the coughing sound signal.
  • the coughing sound signal is generated in a short process. Obviously sudden, the duration of a single coughing sound is usually less than 550ms, and even patients with severe throat and bronchial diseases generally maintain a duration of 1000ms. From the energy point of view, the energy of the coughing sound signal is mainly concentrated in the first half of the signal. Therefore, after the MFCC calculation process, the main characteristic information of the cough sound sample signal is basically concentrated in the first half of the cough sound sample signal. Entering the characteristic parameters of the depth neural network should cover the main information of the cough sound sample signal as much as possible, and ensure that the feature parameters extracted from the MFCC feature parameter matrix are useful information, not redundant information.
  • the characteristic parameter of the cough sound sample signal of the previous fixed frame number may be selected as the input of the deep neural network, and the main characteristic information of the cough sound sample signal is basically concentrated on the cough sound sample signal.
  • the cough sound sample signal of the fixed number of frames should contain as much as possible the first half of each cough sound sample signal.
  • the remaining feature data in the MFCC feature parameter matrix can also be used as the input of the deep neural network.
  • the MFCC feature parameter matrix can be segmented according to the fixed frame number, and then the segmented data are used together as the input of the deep neural network. .
  • the feature parameters are extracted from the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal, including:
  • Step 1021 The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a vector;
  • Step 1022 The vector is segmented from the vector header to the vector tail according to a preset step size (unit is a frame), and the obtained vector includes a set length of a preset length (ie, a fixed number of frames). Characteristic parameters of the subvectors, each subvector having the same label.
  • the specific sound is a coughing sound
  • the number of frames in the first half of the general coughing sound signal can be statistically calculated, and then the preset length is taken according to the number of frames, and the preset step size can be combined with the actual application. Value.
  • the specific sound is other sounds, such as humming or snoring
  • the preset length and the preset step size may also be used according to the time domain and the frequency domain characteristics.
  • the sub-feature vector is adapted to the requirement of the input data of the deep neural network, and can be directly used as an input of the deep neural network.
  • each sub-feature vector of the plurality of sub-feature vectors is set to the same label, that is, a set of sub-feature vectors are used to express the same specific sound sample signal, thereby increasing the number of data samples and avoiding loss of information when extracting the feature parameters. .
  • a specific sound feature model based on the deep neural network is established, and the specific sound feature model is used to identify the specific sound, which reduces the false recognition rate and improves the accuracy of the specific sound recognition.
  • the recognition rate of the coughing voice can reach 95% or more without increasing the calculation amount.
  • Step 103 Taking a feature parameter of the specific sound sample signal as an input, training based on a depth neural network model to obtain the specific sound feature model based on the depth neural network.
  • the DNN is an extension of shallow neural networks. It utilizes the expression of multi-layer neural networks in function, and has very good feature extraction, learning and generalization ability for nonlinear and high-dimensional data processing.
  • the DNN model generally includes an input layer, a hidden layer, and an output layer. Please refer to FIG. 6, where the first layer is the input layer, the middle layer is the hidden layer, and the last layer is the output layer (Figure 6 shows only three hidden layers). , in fact, will include more hidden layers), the layers are fully connected, that is, any one of the neurons in the Qth layer must be connected to any one of the Q+1th layers.
  • Each connection established between neurons has a linear weight, and each neuron in each layer has an offset (except for the input layer).
  • the linear weight of the kth neuron of the l-1th layer to the jth neuron of the 1st layer is defined as w l jk , where the superscript l represents the number of layers in which the linear weight is located, and the subscript corresponds to the output.
  • the first layer index j and the input l-1 layer index k, for example, the linear weight of the fourth neuron of the second layer to the second neuron of the third layer is defined as w 3 24 .
  • the offset of the i-th neuron of the first layer is b l i , where the superscript l represents the number of layers, and the subscript i represents the index of the neuron where the bias is located, for example, the third layer of the second layer.
  • the corresponding bias of each neuron is defined as b 2 3 .
  • a series of w l jk and b l i can be randomly initialized, and the feature parameters of the specific sound sample signal are used as the data of the input layer by using the forward propagation algorithm, and then the first hidden layer is calculated by the input layer, and then the first layer is used.
  • the hidden layer calculates the second hidden layer, and so on, until the output layer. Then use the back propagation algorithm to fine tune w l jk and b l i to obtain a specific sound feature model based on the deep neural network.
  • the feature parameter of the specific sound sample signal is input as input, and the training based on the depth neural network model to obtain the specific sound feature model based on the depth neural network includes:
  • Step 1031 Taking a characteristic parameter of the specific sound sample signal as an input, performing model training based on a deep confidence network algorithm, and obtaining each initial parameter of the specific sound feature model based on the depth neural network;
  • DBN is a deep learning model that pre-processes the model layer by layer in an unsupervised way.
  • This unsupervised preprocessing method is the Restricted Boltzmann machine (RBM).
  • RBM Restricted Boltzmann machine
  • the DBN is a stack of RBMs.
  • RBM is a two-layer structure, v is the visible layer, h is the hidden layer, and the connection between the visible layer and the hidden layer is non-directional (values can be seen from the visible layer -> hidden layer or Implicit layer -> visible layer arbitrarily transmitted) and fully connected.
  • the visible layer v and the hidden layer h are connected by a linear weight
  • the linear weight of the i-th neuron of the visible layer and the j-th neuron of the hidden layer is defined as w ij
  • the i-th neuron corresponding to the visible layer corresponds to
  • the offset is b i
  • the j-th neuron of the hidden layer corresponds to the offset a j
  • the subscripts i and j represent the index of the neuron.
  • the RBM performs a one-step Gibbs sampling by comparing the divergence algorithm, and optimizes the weights w ij , b i and a j to obtain another state expression of the input sample data (ie, the characteristic parameter of the specific sound sample signal) v. h, the output h1 of the RBM can be used as the input of the next RBM, continue to optimize in the same way to obtain the hidden state h2, and so on, the multi-layer DBN model can use the layer-by-layer preprocessing method for the weights w ij , b i and a j is initialized, and each layer's features are an expression of the first layer of data v. After this unsupervised preprocessing, various initial parameters are obtained.
  • the RBM is an energy model, and the energy of the entire RBM is expressed by the following formula (6).
  • E is the total energy of the RBM model
  • v is the visible layer data
  • h is the hidden layer data
  • is the model parameter
  • m is the number of visible layer neurons
  • n is the number of hidden layer neurons
  • b is the visible layer offset.
  • a indicates the hidden layer offset.
  • the RBM model samples based on the conditional probability of visible layer data and hidden layer data.
  • the conditional probability formulas are formula (7) and formula (8), respectively.
  • represents the activation function sigmoid function
  • ⁇ (x) (1+e -x ) -1 .
  • the contrast divergence algorithm is used to sample the Gibbs of the RBM, and the samples of the joint distribution of v and h are obtained, and then the parameters are optimized by maximizing the likelihood logarithm function (9) of the observed samples.
  • the optimization parameters adopt a one-step contrast divergence algorithm, which uses the mean field approximation method to directly generate sampling samples, and uses the formula (10) to iteratively optimize the parameters multiple times, and finally obtains the weights between the neurons and the bias of the neurons.
  • Initial parameters N represents the number of visible layer neurons in the RBM model, that is, the dimension of the input data of the RBM model.
  • Step 1032 Perform fine-tuning of each of the initial parameters based on a gradient neural network-based gradient descent and backpropagation algorithm to obtain various parameters of a specific sound feature model based on the deep neural network.
  • the weights w and the biases b of the neurons between the layers (input layer, hidden layer and output layer) based on the DNN specific sound feature model are obtained, and the final multi-class logistic regression layer ( Softmax) uses a random initialization method, and then the DNN fine-tunes the specific sound feature model using a supervised gradient descent algorithm.
  • the DNN-specific sound feature model is fine-tuned by optimizing the parameters (Equation 12) by minimizing the cost function (Equation 11).
  • J represents the cost function
  • h W,b (x) represents the output of the DNN
  • y represents the label corresponding to the input data
  • represents the learning rate and takes values from 0.5 to 0.01.
  • the partial derivative of each node of the deep neural network calculated in the above formula (12) can adopt the back propagation algorithm of the formula (13).
  • represents sensitivity and a represents the output value of each neuron node.
  • l represents the output layer
  • represents the activation function.
  • the formula (13) is updated, and the entire DNN model is optimized layer by layer. Finally, various parameters are obtained, and a trained DNN-based specific sound feature model is obtained.
  • the obtained DNN model is significantly better than the performance of ordinary deep neural networks.
  • the MFCC feature parameter of the specific sound sample signal is used as the input of the DNN model to obtain a specific sound feature model based on the DNN, and the specific sound feature model is used to identify the specific sound, thereby effectively improving the recognition rate of the specific sound.
  • FIG. 10 is a schematic flowchart of a specific voice recognition method according to an embodiment of the present disclosure. As shown in FIG. 10, the specific voice recognition method includes:
  • Step 201 sampling a sound signal and acquiring a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;
  • a sound input unit for example, a microphone
  • a specific sound recognition device 20 may be disposed on a specific sound recognition device 20 to collect a sound signal, and the sound signal is amplified, filtered, and the like, and then converted into a digital signal.
  • the digital signal may be sampled and processed in an operation processing unit local to the specific voice recognition device 20, or may be uploaded to a cloud server, a smart terminal, or other server for processing through a network.
  • step 101 For the technical details of obtaining the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal, refer to step 101, and details are not described herein again.
  • Step 202 Extract a feature parameter from a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal.
  • step 102 For the specific calculation method for extracting the feature parameters from the characteristic parameter matrix of the frequency coefficient of the frequency coefficient of the sound signal, refer to step 102, and details are not described herein again.
  • Step 203 Input the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound.
  • the feature parameter is input into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound, including:
  • the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.
  • the characteristic parameter of the sound signal When the characteristic parameter of the sound signal is input to the trained DNN-based specific sound feature model, it is obtained whether the sound signal is a predicted result of the specific sound. Since the characteristic parameter of the same sound signal contains a plurality of sub-feature vectors, each sub-feature vector will obtain a prediction result, so that each sound signal will obtain a plurality of prediction results, and the prediction results represent whether the sound signal is a specific sound. may.
  • a specific sound feature model based on DNN votes all prediction results of the same sound signal, that is, in the prediction result of all sub-feature vectors, if the positive prediction result is more than the negative prediction result, the sound signal is confirmed as a specific sound. If the positive prediction result is less than the negative prediction result, it is confirmed that the sound signal is not a specific sound.
  • the specific voice recognition method provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. And because of the identification based on MFCC feature parameters and DNN model The algorithm has low algorithm complexity and low computational complexity, which has low hardware requirements and reduces product manufacturing costs.
  • the specific voice recognition method based on the MFCC feature parameter and the DNN model provided by the embodiment of the present application is also applicable to identify the snoring sound, the sneezing sound, the breathing sound, the laugh sound, the firecracker sound, in addition to the cough sound. And other specific sounds such as crying.
  • the embodiment of the present application further provides a specific voice recognition device for a specific voice recognition device 20, where the device includes:
  • the sampling and feature parameter obtaining module 301 is configured to sample the sound signal and obtain a characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal;
  • the feature parameter extraction module 302 is configured to extract feature parameters from the Mel frequency cepstral coefficient feature parameter matrix of the sound signal
  • the identification module 303 is configured to input the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound.
  • the specific voice recognition device provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. Because the recognition algorithm based on MFCC feature parameters and DNN model is adopted, the algorithm has low complexity and less calculation, which has low hardware requirements and reduces product manufacturing costs.
  • the device further includes:
  • the feature model preset module 304 is configured to acquire the specific sound feature model based on the depth neural network in advance.
  • the feature model preset module 304 is specifically configured to:
  • the feature model preset module 304 is further configured to:
  • the Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a feature vector;
  • the feature parameter extraction module 302 is also specifically configured to:
  • the Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal are sequentially connected end to end to form a feature vector;
  • the feature model preset module 304 is further configured to:
  • each of the initial parameters is fine-tuned to obtain various parameters of a specific sound feature model based on the deep neural network.
  • the identification module 303 is specifically configured to:
  • the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.
  • the particular sound comprises any one of cough, snoring, and sneezing.
  • the foregoing apparatus can perform the method provided by the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method.
  • the foregoing apparatus can perform the method provided by the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method.
  • the specific voice recognition device 20 includes a voice input unit 21, a signal processing unit 22, and an operation processing unit 23.
  • the sound input unit 21 is configured to receive a sound signal, and the sound input unit may be, for example, a microphone or the like.
  • the signal processing unit 22 is connected to an arithmetic processing unit 23 built in or externally to a specific sound recognition device (FIG. 13 is described by way of example in which the arithmetic processing unit is built in a specific sound recognition device), and the arithmetic processing unit 23 can be built in a specific sound.
  • the identification device 20 may be external to the specific voice recognition device 20, and the operation processing unit 23 may also be a remotely set server, for example, a cloud server or a smart terminal that is communicably connected to the specific voice recognition device 20 through a network. Or other servers.
  • the operation processing unit 23 includes:
  • At least one processor 232 (illustrated by a processor in FIG. 13) and a memory 231, the processor 232 and the memory 231 may be connected by a bus or the like, and the bus connection is taken as an example in FIG.
  • the memory 231 is configured to store a non-volatile software program, a non-volatile computer executable program, and a software module, such as a program instruction/module corresponding to a specific sound recognition method in the embodiment of the present application (for example, as shown in FIG. 11) Sampling and feature parameter acquisition module 301).
  • the processor 232 executes various functional applications and data processing by executing non-volatile software programs, instructions, and modules stored in the memory 231, that is, implementing the specific sound recognition method of the above-described method embodiments.
  • the memory 231 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to the use of the specific sound recognition device, and the like. Further, the memory 231 may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, or other nonvolatile solid state storage device. In some embodiments, memory 231 can optionally include memory remotely located relative to processor 232, which can be connected to a particular voice recognition device over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the one or more modules are stored in the memory 231, and when executed by the one or more processors 232, perform a specific sound recognition method in any of the above method embodiments, for example, performing FIG. 2 described above Method steps 101-103, method steps 1021 to 1022 in FIG. 8, method steps 1031 to 1032 in FIG. 9, step 201 to step 203 in FIG. 10; implementing modules 301-303 and FIG. 12 in FIG. The function of modules 301-304 in .
  • the specific voice recognition device provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. And because of the identification based on MFCC feature parameters and DNN model The algorithm has low algorithm complexity and low computational complexity, which has low hardware requirements and reduces product manufacturing costs.
  • the specific voice recognition device can perform the method provided by the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method.
  • the specific voice recognition device can perform the method provided by the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method.
  • Embodiments of the present application provide a storage medium storing computer executable instructions that are executed by one or more processors (eg, one processor 232 in FIG. 13), such that The one or more processors may perform the specific sound recognition method in any of the above method embodiments, for example, performing the method steps 101-103 of FIG. 2 described above, the method steps 1021 to 1022 of FIG. 8, and the method of FIG. Method steps 1031 to 1032, steps 201 to 203 in FIG. 10; functions of modules 301-303 in FIG. 11 and modules 301-304 in FIG. 12 are implemented.
  • processors eg, one processor 232 in FIG. 13
  • the embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, ie may be located in one Places, or they can be distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the embodiments can be implemented by means of software plus a general hardware platform, and of course, by hardware.
  • a person skilled in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Image Analysis (AREA)

Abstract

A specific sound recognition method and apparatus, and a storage medium. The method comprises: sampling a sound signal and obtaining a Mel-Frequency Cepstral Coefficient (MFCC) characteristic parameter matrix of the sound signal (201); extracting a characteristic parameter from the MFCC characteristic parameter matrix of the sound signal (202); and inputting the characteristic parameter into a pre-obtained specific sound characteristic model based on a depth neural network for recognition to determine whether the sound signal is a specific sound (203). The method and apparatus adopt a recognition algorithm based on an MFCC characteristic parameter and a depth neural network model, and the algorithm has low complexity and a small amount of calculation, so that the hardware requirement is low and the manufacturing cost of the product is reduced.

Description

特定声音识别方法、设备和存储介质Specific sound recognition method, device and storage medium 技术领域Technical field
本申请实施例涉及声音处理技术,尤其涉及一种特定声音识别方法、设备和存储介质。The embodiments of the present application relate to sound processing technologies, and in particular, to a specific sound recognition method, device, and storage medium.
背景技术Background technique
在生活中,我们每天都可以听到一些特定的、没有实际语义的声音。如:鼾声、咳嗽声、喷嚏声等等,尽管它们没有实际的语义,但是却能够准确的反应人们的生理需求、状态或者物质的品质。例如:医生能够通过病人的鼾声、咳嗽声、喷嚏声等辨别人们的健康情况。这类特定声音内容比较简单、重复,却是我们生活中不可或缺的一部分,有效的识别和判断各种特定声音信号意义重大。In life, we can hear certain sounds with no actual semantics every day. Such as: snoring, coughing, sneezing, etc., although they have no actual semantics, they can accurately reflect people's physiological needs, state or material quality. For example, a doctor can distinguish people's health through the patient's snoring, coughing, sneezing and so on. This kind of specific sound content is relatively simple and repetitive, but it is an indispensable part of our life. It is of great significance to effectively identify and judge various specific sound signals.
目前,有研究通过语音识别技术识别特定声音。例如有针对咳嗽声音的识别方法,将咳嗽声音的特性和语音识别技术相结合,建立咳嗽模型,采用基于动态时间规整算法(Dynamic Time Warping,DTW)的模型匹配方法对特定人的孤立咳嗽声音进行识别。At present, research has identified specific sounds by speech recognition technology. For example, there is a method for recognizing coughing sounds, combining the characteristics of coughing sounds with speech recognition technology to establish a cough model, and using a model matching method based on Dynamic Time Warping (DTW) to perform the coughing sound of a specific person. Identification.
实现本申请过程中,发明人发现相关技术中至少存在如下问题:现有的特定声音识别算法,计算量大、对硬件设备要求高。In the process of implementing the present application, the inventors have found that at least the following problems exist in the related art: the existing specific voice recognition algorithm has a large amount of calculation and high requirements on hardware devices.
发明内容Summary of the invention
本申请的目的在于提供一种特定声音识别方法、设备和存储介质,能对特定声音进行识别,且算法简单、计算量小,对硬件设备要求低。The purpose of the present application is to provide a specific voice recognition method, device and storage medium, which can identify a specific sound, and has a simple algorithm, a small amount of calculation, and low requirements on hardware devices.
为实现上述目的,第一方面,本申请实施例提供了一种特定声音识别方法,所述方法包括:To achieve the above objective, in a first aspect, an embodiment of the present application provides a specific voice recognition method, where the method includes:
采样声音信号并获取所述声音信号的梅尔频率倒谱系数特征参数矩阵;Sampling the sound signal and acquiring a characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal;
从所述声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数;Extracting a feature parameter from a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;
将所述特征参数输入预先获取的基于深度神经网络的特定声音特征模型进 行识别,以确定所述声音信号是否为特定声音。Entering the characteristic parameter into a pre-acquired deep neural network-based specific sound feature model Line recognition to determine if the sound signal is a particular sound.
可选的,所述方法还包括:预先获取所述基于深度神经网络的特定声音特征模型。Optionally, the method further includes: acquiring the specific sound feature model based on the depth neural network in advance.
可选的,所述预先获取所述基于深度神经网络的特定声音特征模型,包括:Optionally, the pre-acquiring the specific sound feature model based on the deep neural network includes:
采集预设数量的特定声音样本信号并获取所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵;Collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;
从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取所述特征参数;Extracting the feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the specific sound sample signal;
将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型。Taking the characteristic parameters of the specific sound sample signal as input, training based on the depth neural network model to obtain the specific sound feature model based on the deep neural network.
可选的,所述从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取所述特征参数,包括:Optionally, the extracting the feature parameter from a matrix of the frequency coefficient of the frequency coefficient of the cepstral coefficient of the specific sound sample signal comprises:
将特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中各信号帧的梅尔频率倒谱系数依次首尾相连组成一特征向量;The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a feature vector;
将所述特征向量按预设步长从所述特征向量头部到所述特征向量尾部对所述特征向量进行分割,获得包括一组长度均为预设长度的子特征向量的特征参数,每个子特征向量具有相同的标签,所述预设步长为每帧梅尔频率倒谱系数长度的整数倍,所述预设长度为所述每帧梅尔频率倒谱系数长度的整数倍;And dividing the feature vector from the feature vector header to the feature vector tail to segment the feature vector according to a preset step size, to obtain a feature parameter including a set of sub-feature vectors whose lengths are preset lengths, and each The sub-feature vectors have the same label, and the preset step size is an integral multiple of the length of the cepstral coefficient of each frame, and the preset length is an integer multiple of the length of the cepstral coefficient of each frame;
所述从所述声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数,包括:Extracting the feature parameters from the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal, including:
将声音信号的梅尔频率倒谱系数特征参数矩阵中各信号帧的梅尔频率倒谱系数依次首尾相连组成一特征向量;The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal are sequentially connected end to end to form a feature vector;
将所述特征向量按所述预设步长从所述特征向量头部到所述特征向量尾部对所述特征向量进行分割,获得包括一组长度均为所述预设长度的子特征向量的特征参数。And dividing the feature vector from the feature vector header to the feature vector tail to segment the feature vector according to the preset step size, to obtain a set of sub-feature vectors each having a length of the preset length. Characteristic Parameters.
可选的,所述将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型,包括:Optionally, the taking the feature parameter of the specific sound sample signal as an input, training the depth neural network model to obtain the specific sound feature model based on the deep neural network, including:
将所述特定声音样本信号的特征参数作为输入,基于深度置信网络算法进行模型训练,获得所述基于深度神经网络的特定声音特征模型的各个初始参数;Taking the characteristic parameters of the specific sound sample signal as input, performing model training based on a deep confidence network algorithm, and obtaining respective initial parameters of the specific sound feature model based on the depth neural network;
基于深度神经网络的梯度下降和反向传播算法,对所述各个初始参数进行 微调,获得基于深度神经网络的特定声音特征模型的各个参数。Deep gradient neural network based gradient descent and back propagation algorithms for each of the initial parameters Fine-tuning, obtaining various parameters of a specific sound feature model based on deep neural networks.
可选的,所述将所述特征参数输入预先获取的基于深度神经网络的特定声音特征模型进行识别,以确定所述声音信号是否为特定声音,包括:Optionally, the step of inputting the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound comprises:
将所述特征参数包含的一组子特征向量输入预先获取的基于深度神经网络的特定声音特征模型,获得一组子特征向量对应的预测结果;Inputting a set of sub-feature vectors included in the feature parameter into a pre-acquired specific sound feature model based on a depth neural network, and obtaining a prediction result corresponding to a set of sub-feature vectors;
如果所述预测结果中,肯定的预测结果多于否定的预测结果,则确认所述声音信号为特定声音,否则,确认所述声音信号不是特定声音。If the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.
可选的,所述特定声音包括咳嗽声、鼾声和喷嚏声中的任意一种。Optionally, the specific sound includes any one of a coughing sound, a snoring sound, and a sneezing sound.
第二方面,本申请实施例还提供了一种特定声音识别装置,所述装置包括:In a second aspect, the embodiment of the present application further provides a specific voice recognition device, where the device includes:
采样及特征参数获取模块,用于采样声音信号并获取所述声音信号的梅尔频率倒谱系数特征参数矩阵;a sampling and feature parameter obtaining module, configured to sample a sound signal and obtain a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;
特征参数提取模块,用于从所述声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数;a feature parameter extraction module, configured to extract a feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the sound signal;
特征匹配模块,用于确认所述特征参数是否匹配预先获取的基于深度神经网络的特定声音特征模型;a feature matching module, configured to confirm whether the feature parameter matches a pre-acquired deep neural network-based specific sound feature model;
确认模块,用于如果所述特征参数匹配预先获取的基于深度神经网络的特定声音特征模型,则确认所述声音信号为特定声音。And a confirmation module, configured to confirm that the sound signal is a specific sound if the feature parameter matches a pre-acquired deep neural network-based specific sound feature model.
可选的,所述装置还包括:Optionally, the device further includes:
特征模型预设模块,用于预先获取所述基于深度神经网络的特定声音特征模型;a feature model preset module, configured to pre-acquire the specific sound feature model based on the depth neural network;
所述特征模型预设模块,具体用于:The feature model preset module is specifically configured to:
采集预设数量的特定声音样本信号并获取所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵;Collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;
从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取所述特征参数;Extracting the feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the specific sound sample signal;
将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型。Taking the characteristic parameters of the specific sound sample signal as input, training based on the depth neural network model to obtain the specific sound feature model based on the deep neural network.
第三方面,本申请实施例还提供了一种特定声音识别设备,所述特定声音识别设备包括:In a third aspect, the embodiment of the present application further provides a specific voice recognition device, where the specific voice recognition device includes:
声音输入单元,用于接收声音信号; a sound input unit for receiving a sound signal;
信号处理单元,用于对所述声音信号进行信号处理;a signal processing unit, configured to perform signal processing on the sound signal;
所述信号处理单元与内置或者外置于特定声音识别设备的运算处理单元相连,所述运算处理单元包括:The signal processing unit is connected to an operation processing unit built in or externally to a specific sound recognition device, and the operation processing unit includes:
至少一个处理器;以及,At least one processor; and,
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described above.
第四方面,本申请实施例还提供了一种存储介质,所述存储介质存储有可执行指令,所述可执行指令被特定声音识别设备执行时,使所述特定声音识别设备执行上述的方法。In a fourth aspect, the embodiment of the present application further provides a storage medium, where the storage medium stores executable instructions, when the executable instructions are executed by a specific voice recognition device, causing the specific voice recognition device to perform the foregoing method. .
第五方面,本申请实施例还提供了一种程序产品,所述程序产品包括存储在存储介质上的程序,所述程序包括程序指令,当所述程序指令被特定声音识别设备执行时,使所述特定声音识别设备执行上述的方法。In a fifth aspect, the embodiment of the present application further provides a program product, where the program product includes a program stored on a storage medium, where the program includes program instructions, when the program instruction is executed by a specific voice recognition device, The particular voice recognition device performs the method described above.
本申请实施例提供的特定声音识别方法、设备和存储介质,采用基于梅尔频率倒谱系数特征参数和深度神经网络模型的识别算法,算法复杂度低、计算量少,从而对硬件要求低,降低了产品制造成本。The specific voice recognition method, device and storage medium provided by the embodiments of the present application adopt a recognition algorithm based on the characteristic parameters of the Mel frequency cepstrum coefficient and the deep neural network model, and the algorithm has low complexity and small calculation amount, so that the hardware requirement is low. Reduced product manufacturing costs.
附图说明DRAWINGS
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。The one or more embodiments are exemplified by the accompanying drawings in the accompanying drawings, and FIG. The figures in the drawings do not constitute a scale limitation unless otherwise stated.
图1是本申请各实施例的应用环境的结构示意图;1 is a schematic structural diagram of an application environment of each embodiment of the present application;
图2是本申请实施例提供的特定声音识别方法中预先获取基于深度神经网络的特定声音特征模型的流程示意图;2 is a schematic flowchart of pre-acquiring a specific sound feature model based on a deep neural network in a specific voice recognition method provided by an embodiment of the present application;
图3是MFCC系数计算过程中梅尔频率滤波处理示意图;3 is a schematic diagram of a Meyer frequency filtering process in the MFCC coefficient calculation process;
图4是咳嗽声音信号的时间-幅度图;Figure 4 is a time-amplitude diagram of a coughing sound signal;
图5是提取特征参数步骤将特征向量分割成各个子特征向量的示意图;5 is a schematic diagram of the step of extracting a feature parameter to divide a feature vector into individual sub-feature vectors;
图6是一般深度神经网络结构的示意图;Figure 6 is a schematic diagram of a general deep neural network structure;
图7是一般深度置信网络结构的示意图;7 is a schematic diagram of a general deep confidence network structure;
图8是本申请实施例提供的特定声音识别方法中提取特征参数步骤的流程 示意图;FIG. 8 is a flowchart of a step of extracting feature parameters in a specific voice recognition method according to an embodiment of the present application; schematic diagram;
图9是本申请实施例提供的特定声音识别方法中训练基于深度神经网络的特定声音特征模型步骤的流程示意图;9 is a schematic flowchart of steps of training a specific sound feature model based on a deep neural network in a specific voice recognition method provided by an embodiment of the present application;
图10是本申请实施例提供的特定声音识别方法的流程示意图;10 is a schematic flowchart of a specific voice recognition method provided by an embodiment of the present application;
图11是本申请实施例提供的特定声音识别装置的结构示意图;11 is a schematic structural diagram of a specific voice recognition apparatus according to an embodiment of the present application;
图12是本申请实施例提供的特定声音识别装置的结构示意图;FIG. 12 is a schematic structural diagram of a specific voice recognition apparatus according to an embodiment of the present application; FIG.
图13是本申请实施例提供的特定声音识别设备的结构示意图。FIG. 13 is a schematic structural diagram of a specific voice recognition device according to an embodiment of the present application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. It is a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
本申请实施例提出一种基于梅尔频率倒谱系数(Mel Frequency Cepstral Coefficients,MFCC)特征参数和深度神经网络(Deep Neural Network,DNN)算法的特定声音识别方案,适用于图1所示的应用环境。所述应用环境包括用户10和特定声音识别设备20,特定声音识别设备20用于接收用户10发出的声音,并对该声音进行识别,以确定该声音是否为特定声音。The embodiment of the present application proposes a specific sound recognition scheme based on the Mel Frequency Cepstral Coefficients (MFCC) characteristic parameter and the Deep Neural Network (DNN) algorithm, which is applicable to the application shown in FIG. 1 . surroundings. The application environment includes a user 10 and a specific voice recognition device 20 for receiving a sound from the user 10 and identifying the sound to determine if the sound is a particular sound.
进一步的,在识别出该声音为特定声音之后,所述特定识别设备20还可以对特定声音进行记录和处理,以输出用户10发出特定声音的情况信息。该特定声音的情况信息可以包括特定声音的次数、特定声音的时长以及特定声音的分贝。例如,可以通过在特定声音识别设备中包括计数器,用于在检测到特定声音时,对特定声音进行计数统计;可以通过在特定声音识别设备中包括计时器,用于在检测到特定声音时,对特定声音的持续时长进行统计;可以通过在特定声音识别设备中包括分贝检测装置,用于在检测到特定声音时,检测该特定声音的分贝。Further, after recognizing that the sound is a specific sound, the specific recognition device 20 can also record and process the specific sound to output the situation information that the user 10 issues a specific sound. The situation information for the particular sound may include the number of times a particular sound, the length of a particular sound, and the decibel of a particular sound. For example, a counter may be included in a specific sound recognition device for counting a specific sound when a specific sound is detected; by including a timer in a specific sound recognition device for detecting a specific sound, The duration of the particular sound is counted; the decibel detection means can be included in the particular sound recognition device for detecting the decibel of the particular sound when a particular sound is detected.
本申请实施例对特定声音的识别原理与语音识别的原理相似,都是将输入的声音经过处理后将其输入声音模型进行识别,从而得到识别结果。其可分为两个阶段,分别为特定声音模型训练阶段和特定声音识别阶段。特定声音模型 训练阶段主要是采集一定数量的特定声音样本信号,计算特定声音样本信号的MFCC特征参数矩阵,从MFCC特征参数矩阵中提取特征参数,将所述特征参数基于DNN算法进行模型训练,得到特定声音特征模型。在特定声音识别阶段,对需要判断的声音信号,计算其MFCC特征参数矩阵,并从声音信号的MFCC特征参数矩阵中提取对应的特征参数,然后将该特征参数输入特定声音特征模型进行识别,以确定该声音信号是否为特定声音。其识别过程主要包括预处理、特征提取、模型训练、模式匹配及判决等步骤。The recognition principle of the specific sound in the embodiment of the present application is similar to the principle of the speech recognition. The input sound is processed and then input into the sound model to be recognized, thereby obtaining the recognition result. It can be divided into two phases, namely a specific sound model training phase and a specific sound recognition phase. Specific sound model The training phase mainly collects a certain number of specific sound sample signals, calculates the MFCC feature parameter matrix of the specific sound sample signal, extracts the feature parameters from the MFCC feature parameter matrix, and trains the feature parameters based on the DNN algorithm to obtain specific sound features. model. In the specific voice recognition stage, the MFCC feature parameter matrix is calculated for the sound signal that needs to be judged, and the corresponding feature parameter is extracted from the MFCC feature parameter matrix of the sound signal, and then the feature parameter is input into the specific sound feature model for identification, Determine if the sound signal is a specific sound. The identification process mainly includes steps of preprocessing, feature extraction, model training, pattern matching and decision.
其中,在预处理步骤,包括采样特定声音样本信号以及计算所述特定声音样本信号的MFCC特征参数矩阵。在特征提取步骤,从MFCC特征参数矩阵中提取特征参数。在模型训练步骤,将从特定声音样本信号的MFCC特征参数矩阵中提取的特征参数作为输入,训练出基于深度神经网络的特定声音特征模型。在模式匹配及判决步骤,利用特定声音特征模型来识别新的声音信号是否为特定声音。其中,识别新的声音信号是否为特定声音,包括:首先计算声音信号的MFCC特征参数矩阵,然后从MFCC特征参数矩阵中提取声音信号的特征参数,再将该声音信号的特征参数输入特定声音特征模型进行识别,以确定该声音信号是否为特定声音。Wherein, in the pre-processing step, sampling a specific sound sample signal and calculating a MFCC feature parameter matrix of the specific sound sample signal. In the feature extraction step, feature parameters are extracted from the MFCC feature parameter matrix. In the model training step, the feature parameters extracted from the MFCC feature parameter matrix of the specific sound sample signal are taken as inputs, and a specific sound feature model based on the deep neural network is trained. In the pattern matching and decision step, a specific sound feature model is utilized to identify whether the new sound signal is a particular sound. Wherein, identifying whether the new sound signal is a specific sound comprises: first calculating a MFCC feature parameter matrix of the sound signal, and then extracting a characteristic parameter of the sound signal from the MFCC feature parameter matrix, and then inputting the characteristic parameter of the sound signal into the specific sound feature The model is identified to determine if the sound signal is a particular sound.
MFCC结合DNN识别特定声音的方案可以简化算法的复杂度,减少计算量,并能够显著提高特定声音识别的准确性。The combination of MFCC and DNN to identify specific sounds can simplify the complexity of the algorithm, reduce the amount of computation, and significantly improve the accuracy of specific voice recognition.
本申请实施例提供了一种特定声音识别方法,可以用于上述的特定声音识别设备20,所述特定声音识别方法需要预先获得基于DNN的特定声音特征模型,该基于DNN的特定声音特征模型可以是预先配置的,也可以通过下述步骤101至步骤103中的方法训练得到,在训练得到基于DNN的特定声音特征模型后,后续可基于该基于DNN的特定声音特征模型识别特定声音,更进一步地,若由于场景变换或其它原因导致该基于DNN的特定声音特征模型用于识别特定声音时准确率不合格,可重新配置或训练基于DNN的特定声音特征模型。The embodiment of the present application provides a specific voice recognition method, which can be used in the specific voice recognition device 20, where the specific voice recognition method needs to obtain a specific sound feature model based on the DNN in advance, and the specific sound feature model based on the DNN can be It is pre-configured, and can also be trained by the method in the following steps 101 to 103. After the DNN-based specific sound feature model is trained, the specific sound can be identified based on the specific sound feature model based on the DNN, and further Alternatively, if the accuracy of the DNN-based specific sound feature model for identifying a particular sound is unacceptable due to scene change or other reasons, the DNN-based specific sound feature model may be reconfigured or trained.
其中,如图2所示,所述预先获得基于DNN的特定声音特征模型包括:Wherein, as shown in FIG. 2, the pre-obtaining DNN-based specific sound feature model includes:
步骤101:采集预设数量的特定声音样本信号并获取所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵;Step 101: Acquire a preset number of specific sound sample signals and acquire a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;
采样得到特定声音样本信号s(n),并根据所述特定声音样本信号获取所述特定声音样本信号的MFCC特征参数矩阵。梅尔频率倒谱系数主要用于声音数据 特征提取和降低运算维度。例如:对于一帧有512维(采样点)的数据,经过MFCC处理后可以提取出最重要的40维数据,同时也达到了降维的目的。梅尔频率倒谱系数计算一般包括:预加重、分帧、加窗、快速傅里叶变换、梅尔滤波器组和离散余弦变换。A specific sound sample signal s(n) is sampled, and an MFCC feature parameter matrix of the specific sound sample signal is obtained according to the specific sound sample signal. Mel frequency cepstrum coefficient is mainly used for sound data Feature extraction and reduction of operational dimensions. For example, for a frame with 512 dimensions (sampling points), after processing by MFCC, the most important 40-dimensional data can be extracted, and the purpose of dimensionality reduction is also achieved. The calculation of the Mel frequency cepstral coefficient generally includes: pre-emphasis, framing, windowing, fast Fourier transform, mel filter bank and discrete cosine transform.
获取所述特定声音样本信号的MFCC特征参数矩阵,具体包括以下步骤:Obtaining the MFCC feature parameter matrix of the specific sound sample signal includes the following steps:
①预加重1 pre-emphasis
预加重的目的是提升高频部分,使信号的频谱变得平坦,保持在低频到高频的整个频带中,能用同样的信噪比求频谱。同时,也是为了消除发生过程中声带和嘴唇的效应,来补偿特定声音样本信号受到发音系统所抑制的高频部分,也为了突出高频的共振峰。其实现方法是将经采样后的特定声音样本信号s(n)通过一个一阶有限长单位冲激响应(Finite Impulse Response,FIR)高通数字滤波器来进行预加重,其传递函数为:The purpose of pre-emphasis is to raise the high-frequency portion, flatten the spectrum of the signal, and maintain the spectrum in the same frequency-to-noise ratio in the entire frequency band from low frequency to high frequency. At the same time, it is also to eliminate the effect of the vocal cords and lips during the process of occurrence, to compensate for the high-frequency part of the sound sample system that is suppressed by the sound system, and to highlight the high-frequency formant. The implementation method is that the sampled specific sound sample signal s(n) is pre-emphasized by a first-order finite-length unit impulse response (FIR) high-pass digital filter, and the transfer function is:
H(z)=1-a·z-1   (1)H(z)=1-a·z -1 (1)
其中,z表示输入信号,时域表示即为特定声音样本信号s(n),a表示预加重系数,一般取0.9~1.0中的常数。Where z is the input signal, the time domain representation is the specific sound sample signal s(n), and a is the pre-emphasis coefficient, which is generally a constant from 0.9 to 1.0.
②分帧2 frames
将特定声音样本信号s(n)中每P个采样点集合成一个观测单位,称为帧。P的值可以取256或512,涵盖的时间约为20~30ms左右。为了避免相邻两帧的变化过大,可以让两相邻帧之间有一段重叠区域,此重叠区域包含了G个取样点,G的值可以约为P的1/2或1/3。特定声音样本信号的采样频率可以为8KHz或16KHz,以8KHz来说,若帧长度为256个采样点,则对应的时间长度是256/8000×1000=32ms。Each P sample points in a particular sound sample signal s(n) are grouped into one unit of observation, called a frame. The value of P can be 256 or 512, and the time covered is about 20 to 30 ms. In order to avoid the change of the adjacent two frames is too large, there may be an overlapping area between two adjacent frames, the overlapping area contains G sampling points, and the value of G may be about 1/2 or 1/3 of P. The sampling frequency of the specific sound sample signal may be 8KHz or 16KHz. In the case of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000×1000=32ms.
③加窗3 window
将每一帧乘以汉明窗,以增加帧左端和右端的连续性。假设分帧后的信号为S(n),n=0,1…,P-1,P为帧的大小,那么乘上汉明窗后:S′(n)=S(n)×W(n),其中,Multiply each frame by a Hamming window to increase the continuity of the left and right ends of the frame. Suppose the signal after the frame is S(n), n=0,1...,P-1,P is the size of the frame, then multiply the Hamming window: S'(n)=S(n)×W( n), among them,
Figure PCTCN2017107505-appb-000001
Figure PCTCN2017107505-appb-000001
其中,l表示窗长。Where l is the window length.
④快速傅里叶变换(Fast Fourier Transform,FFT) 4 Fast Fourier Transform (FFT)
由于信号在时域上的变换通常很难看出信号的特性,所以通常将它转换为频域上的能量分布来观察,不同的能量分布,就能代表不同声音的特性。所以在乘上汉明窗后,每帧还必须再经过快速傅里叶变换以得到在频谱上的能量分布。对分帧加窗后的各帧信号进行快速傅里叶变换得到各帧的频谱。并对特定声音样本信号的频谱取模平方得到特定声音样本信号的功率谱。Since the signal is usually difficult to see the characteristics of the signal in the time domain, it is usually converted to the energy distribution in the frequency domain to observe, and different energy distributions can represent the characteristics of different sounds. Therefore, after multiplying the Hamming window, each frame must also undergo a fast Fourier transform to obtain the energy distribution in the spectrum. Performing fast Fourier transform on each frame signal after the frame is windowed to obtain the spectrum of each frame. The modulo square of the spectrum of the particular sound sample signal yields the power spectrum of the particular sound sample signal.
⑤三角带通滤波器滤波5 triangle band pass filter
将能量谱通过一组梅尔尺度的三角形滤波器组进行滤波。定义一个有M个滤波器的滤波器组(滤波器的个数和临界带的个数相近),采用的滤波器为三角滤波器,中心频率为f(m),m=1,2,...,M。M可以取22-26。各f(m)之间的间隔随着m值的减小而缩小,随着m值的增大而增宽,请参照图3。The energy spectrum is filtered through a set of Mel scale triangular filter banks. Define a filter bank with M filters (the number of filters is close to the number of critical bands). The filter used is a triangular filter with a center frequency of f(m), m=1, 2,. .., M. M can take 22-26. The interval between each f(m) decreases as the value of m decreases, and widens as the value of m increases. Please refer to FIG. 3.
三角滤波器的频率响应定义为:The frequency response of the triangular filter is defined as:
Figure PCTCN2017107505-appb-000002
Figure PCTCN2017107505-appb-000002
其中
Figure PCTCN2017107505-appb-000003
among them
Figure PCTCN2017107505-appb-000003
⑥离散余弦变换6 discrete cosine transform
计算每个滤波器组输出的对数能量为:Calculate the logarithmic energy of each filter bank output as:
Figure PCTCN2017107505-appb-000004
Figure PCTCN2017107505-appb-000004
对对数能量s(m)经离散余弦变换(Dual Clutch Transmission,DCT)得到MFCC:The MFCC is obtained by discrete cosine transform (DCT) for the logarithmic energy s(m):
Figure PCTCN2017107505-appb-000005
Figure PCTCN2017107505-appb-000005
步骤102:从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取所述特征参数;Step 102: Extract the feature parameter from a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;
由式(5)可知,MFCC为一个N*L的系数矩阵,其中,N为声音信号帧数,L为MFCC长度。由于MFCC特征参数矩阵维度较高,且声音信号长度不一致导致矩阵行数N不同,MFCC特征参数矩阵无法作为直接输入获得基于DNN的特定声音特征模型,因此,需要进一步的从MFCC特征参数矩阵中提取特征 参数。提取特征参数的目的是提取出特定声音样本信号的特性来标示该段特定声音样本信号,并以该特征参数作为输入,训练基于DNN的特定声音特征模型。可以结合特定声音信号的时域或频域特性,从MFCC特征参数矩阵中提取特征参数。It can be seen from equation (5) that MFCC is an N*L coefficient matrix, where N is the number of sound signal frames and L is the MFCC length. Due to the high dimension of the MFCC feature parameter matrix and the inconsistent length of the sound signal, the number of matrix rows N is different. The MFCC feature parameter matrix cannot be used as a direct input to obtain a specific sound feature model based on DNN. Therefore, further extraction from the MFCC feature parameter matrix is needed. Characteristic parameter. The purpose of extracting the feature parameters is to extract the characteristics of the specific sound sample signal to mark the segment of the specific sound sample signal, and use the feature parameter as an input to train the specific sound feature model based on the DNN. The feature parameters can be extracted from the MFCC feature parameter matrix in combination with the time domain or frequency domain characteristics of the particular sound signal.
以特定声音信号为咳嗽声音信号为例,请参考图4,图4为咳嗽声音信号的时间-幅度图(时域图),从图4可以看出,咳嗽声音信号的发生过程很短,具有明显的突发性,单声咳嗽声音所持续的时长通常小于550ms,甚至患上严重的咽喉和支气管疾病的病人,他们的单声咳嗽声音的时长也一般维持在1000ms左右。从能量上看,咳嗽声音信号的能量主要集中在信号的前半部分。因此,MFCC计算处理后,咳嗽声音样本信号的主要特性信息基本集中在咳嗽声音样本信号的前半部分。输入深度神经网络的特征参数,应该尽可能多的涵盖咳嗽声音样本信号的主要信息,保证从MFCC特征参数矩阵中提取的特征参数是有用信息,而不是冗余信息。Taking a specific sound signal as a coughing sound signal, for example, please refer to FIG. 4. FIG. 4 is a time-amplitude diagram (time domain diagram) of the coughing sound signal. As can be seen from FIG. 4, the coughing sound signal is generated in a short process. Obviously sudden, the duration of a single coughing sound is usually less than 550ms, and even patients with severe throat and bronchial diseases generally maintain a duration of 1000ms. From the energy point of view, the energy of the coughing sound signal is mainly concentrated in the first half of the signal. Therefore, after the MFCC calculation process, the main characteristic information of the cough sound sample signal is basically concentrated in the first half of the cough sound sample signal. Entering the characteristic parameters of the depth neural network should cover the main information of the cough sound sample signal as much as possible, and ensure that the feature parameters extracted from the MFCC feature parameter matrix are useful information, not redundant information.
可以在咳嗽声音样本信号的MFCC特征参数矩阵中,选择前面固定帧数的咳嗽声音样本信号的特征参数,作为深度神经网络的输入,鉴于咳嗽声音样本信号的主要特性信息基本集中在咳嗽声音样本信号的前半部分,该固定帧数的咳嗽声音样本信号应尽量包含各个咳嗽声音样本信号的前半部分。为了充分利用数据,MFCC特征参数矩阵中剩余的特征数据也可以作为深度神经网络的输入,可以根据该固定帧数对MFCC特征参数矩阵进行分割,然后将分割后的数据一起作为深度神经网络的输入。In the MFCC characteristic parameter matrix of the coughing sound sample signal, the characteristic parameter of the cough sound sample signal of the previous fixed frame number may be selected as the input of the deep neural network, and the main characteristic information of the cough sound sample signal is basically concentrated on the cough sound sample signal. In the first half of the section, the cough sound sample signal of the fixed number of frames should contain as much as possible the first half of each cough sound sample signal. In order to make full use of the data, the remaining feature data in the MFCC feature parameter matrix can also be used as the input of the deep neural network. The MFCC feature parameter matrix can be segmented according to the fixed frame number, and then the segmented data are used together as the input of the deep neural network. .
具体的,如图8所示,从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数,包括:Specifically, as shown in FIG. 8, the feature parameters are extracted from the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal, including:
步骤1021:将特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中各信号帧的梅尔频率倒谱系数依次首尾相连组成一向量;Step 1021: The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a vector;
步骤1022:将所述向量按预设步长(单位为帧)从所述向量头部到所述向量尾部对所述向量进行分割,获得包括一组长度均为预设长度(即固定帧数)的子向量的特征参数,每个子向量具有相同的标签。Step 1022: The vector is segmented from the vector header to the vector tail according to a preset step size (unit is a frame), and the obtained vector includes a set length of a preset length (ie, a fixed number of frames). Characteristic parameters of the subvectors, each subvector having the same label.
即将MFCC特征参数矩阵帧与帧之间串联起来形成一个向量X,以预设长度e为基本单位,以预设步长d从向量X首部移动到尾部,形成一组标签相同的数据Xi,其中,i=1,2,...,m,m表示经过分割处理后每个特定声音样本信号所 包含的子特征向量的数量。其具体处理过程请参见图5。The MFCC feature parameter matrix frame and the frame are connected in series to form a vector X, with a preset length e as a basic unit, and a predetermined step d is moved from the vector X header to the tail to form a set of data Xi of the same label, wherein , i = 1, 2, ..., m, m represents the signal of each specific sound sample after the segmentation process The number of subfeature vectors included. See Figure 5 for the specific processing.
在实际应用中,如果特定声音为咳嗽声音,可以统计计算出一般咳嗽声音信号的前半段的帧数,然后根据该帧数为所述预设长度取值,预设步长可以结合实际应用进行取值。如果特定声音为其他声音,例如鼾声或者喷嚏声等,也可以根据其时域与频域特性为预设长度和预设步长取值。In a practical application, if the specific sound is a coughing sound, the number of frames in the first half of the general coughing sound signal can be statistically calculated, and then the preset length is taken according to the number of frames, and the preset step size can be combined with the actual application. Value. If the specific sound is other sounds, such as humming or snoring, the preset length and the preset step size may also be used according to the time domain and the frequency domain characteristics.
通过将特定声音样本信号的MFCC特征参数矩阵分割成多个固定长度的子特征向量,使该子特征向量适应了深度神经网络输入数据一致的要求,可以直接作为深度神经网络的输入。而且,将多个子特征向量中的各个子特征向量设置成相同的标签,即用一组子特征向量来表达同一特定声音样本信号,增加了数据样本的数量,避免了特征参数提取时信息的损失。利用上述子特征向量及其对应的标签,建立基于深度神经网络的特定声音特征模型,并利用该特定声音特征模型识别特定声音,降低了误识别率,提高了特定声音识别的准确率。本申请实施例提供的特定声音识别方法在用于识别咳嗽声音时,在不增加计算量的基础上,咳嗽声音的识别率可以达到95%以上。By dividing the MFCC feature parameter matrix of a specific sound sample signal into a plurality of fixed-length sub-feature vectors, the sub-feature vector is adapted to the requirement of the input data of the deep neural network, and can be directly used as an input of the deep neural network. Moreover, each sub-feature vector of the plurality of sub-feature vectors is set to the same label, that is, a set of sub-feature vectors are used to express the same specific sound sample signal, thereby increasing the number of data samples and avoiding loss of information when extracting the feature parameters. . Using the above sub-feature vectors and their corresponding labels, a specific sound feature model based on the deep neural network is established, and the specific sound feature model is used to identify the specific sound, which reduces the false recognition rate and improves the accuracy of the specific sound recognition. When the specific voice recognition method provided by the embodiment of the present application is used for identifying a coughing sound, the recognition rate of the coughing voice can reach 95% or more without increasing the calculation amount.
步骤103:将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型。Step 103: Taking a feature parameter of the specific sound sample signal as an input, training based on a depth neural network model to obtain the specific sound feature model based on the depth neural network.
DNN是对浅层神经网络的拓展,在功能上利用了多层神经网络的表达,对非线性、高维数据的处理有非常好的特征提取、学习以及泛化能力。DNN模型一般包括输入层、隐藏层和输出层,请参照图6,其中,第一层是输入层,中间的是隐藏层,最后一层是输出层(图6只示出了三层隐藏层,实际上会包括更多的隐藏层),其层与层之间是全连接的,即第Q层的任意一个神经元一定与第Q+1层的任意一个神经元相连。DNN is an extension of shallow neural networks. It utilizes the expression of multi-layer neural networks in function, and has very good feature extraction, learning and generalization ability for nonlinear and high-dimensional data processing. The DNN model generally includes an input layer, a hidden layer, and an output layer. Please refer to FIG. 6, where the first layer is the input layer, the middle layer is the hidden layer, and the last layer is the output layer (Figure 6 shows only three hidden layers). , in fact, will include more hidden layers), the layers are fully connected, that is, any one of the neurons in the Qth layer must be connected to any one of the Q+1th layers.
每条建立在神经元之间的连接都有一个线性权重,每层的每个神经元都有一个偏置(输入层除外)。第l-1层的第k个神经元到第l层的第j个神经元的线性权重定义为wl jk,其中,上标l代表线性权重所在的层数,而下标对应的是输出的第l层索引j和输入的第l-1层索引k,例如,第二层的第4个神经元到第三层的第2个神经元的线性权重定义为w3 24。第l层的第i个神经元对应的偏置为bl i,其中,上标l代表所在的层数,下标i代表偏置所在的神经元的索引,例如,第二层的第三个神经元对应的偏置定义为b2 3Each connection established between neurons has a linear weight, and each neuron in each layer has an offset (except for the input layer). The linear weight of the kth neuron of the l-1th layer to the jth neuron of the 1st layer is defined as w l jk , where the superscript l represents the number of layers in which the linear weight is located, and the subscript corresponds to the output. The first layer index j and the input l-1 layer index k, for example, the linear weight of the fourth neuron of the second layer to the second neuron of the third layer is defined as w 3 24 . The offset of the i-th neuron of the first layer is b l i , where the superscript l represents the number of layers, and the subscript i represents the index of the neuron where the bias is located, for example, the third layer of the second layer. The corresponding bias of each neuron is defined as b 2 3 .
可以随机初始化选择一系列wl jk和bl i,利用前向传播算法,将特定声音样 本信号的特征参数作为输入层的数据,然后用输入层计算出第一个隐藏层,再用第一个隐藏层计算出第二个隐藏层,依次类推,直到输出层。然后再利用反向传播算法,对wl jk和bl i进行微调,获得最终基于深度神经网络的特定声音特征模型。A series of w l jk and b l i can be randomly initialized, and the feature parameters of the specific sound sample signal are used as the data of the input layer by using the forward propagation algorithm, and then the first hidden layer is calculated by the input layer, and then the first layer is used. The hidden layer calculates the second hidden layer, and so on, until the output layer. Then use the back propagation algorithm to fine tune w l jk and b l i to obtain a specific sound feature model based on the deep neural network.
也可以先通过基于深度置信网络(Deep Belief Network,DBN)算法获得各个初始参数wl jk和bl i,然后再利用梯度下降和反向传播算法,对wl jk和bl i进行微调,获得最终wl jk和bl i的取值。即请参照图9,所述将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型包括:It is also possible to first obtain the initial parameters w l jk and b l i by using a Deep Belief Network (DBN) algorithm, and then use the gradient descent and back propagation algorithms to fine tune w l jk and b l i , The values of the final w l jk and b l i are obtained. That is, referring to FIG. 9, the feature parameter of the specific sound sample signal is input as input, and the training based on the depth neural network model to obtain the specific sound feature model based on the depth neural network includes:
步骤1031:将所述特定声音样本信号的特征参数作为输入,基于深度置信网络算法进行模型训练,获得所述基于深度神经网络的特定声音特征模型的各个初始参数;Step 1031: Taking a characteristic parameter of the specific sound sample signal as an input, performing model training based on a deep confidence network algorithm, and obtaining each initial parameter of the specific sound feature model based on the depth neural network;
DBN是一种深度学习模型,用非监督的方式对模型逐层做预处理,这种非监督的预处理方式就是受限玻尔兹曼机(Restricted Boltzmann machine,RBM)。如图7(b)所示,DBN是由一系列RBM堆叠而成的。如图7(a)所示,RBM是双层结构,v表示可见层,h表示隐藏层,可见层和隐藏层之间的连接是无方向性(值可以从可见层->隐含层或隐含层->可见层任意传输)且全连接的。其中,可见层v和隐藏层h之间通过线性权重连接,可见层的第i个神经元和隐藏层的第j个神经元的线性权重定义为wij,可见层的第i个神经元对应的偏置为bi,隐藏层的第j个神经元对应的偏置为aj,下标i和j代表神经元的索引。DBN is a deep learning model that pre-processes the model layer by layer in an unsupervised way. This unsupervised preprocessing method is the Restricted Boltzmann machine (RBM). As shown in Figure 7(b), the DBN is a stack of RBMs. As shown in Figure 7(a), RBM is a two-layer structure, v is the visible layer, h is the hidden layer, and the connection between the visible layer and the hidden layer is non-directional (values can be seen from the visible layer -> hidden layer or Implicit layer -> visible layer arbitrarily transmitted) and fully connected. Wherein, the visible layer v and the hidden layer h are connected by a linear weight, and the linear weight of the i-th neuron of the visible layer and the j-th neuron of the hidden layer is defined as w ij , and the i-th neuron corresponding to the visible layer corresponds to The offset is b i , the j-th neuron of the hidden layer corresponds to the offset a j , and the subscripts i and j represent the index of the neuron.
RBM通过对比散度算法进行一步吉布斯(Gibbs)采样,优化权重wij、bi和aj,就可以得到输入样本数据(即特定声音样本信号的特征参数)v的另一种状态表达h,RBM的输出h1可以作为下一个RBM的输入,用同一种方式继续优化得到隐藏状态h2,以此类推,多层的DBN模型可以通过逐层预处理的方式对权重wij、bi和aj进行初始化,每一层的特征都是第一层数据v的一种表达方式,经过这种非监督的预处理后,获得各项初始参数。RBM performs a one-step Gibbs sampling by comparing the divergence algorithm, and optimizes the weights w ij , b i and a j to obtain another state expression of the input sample data (ie, the characteristic parameter of the specific sound sample signal) v. h, the output h1 of the RBM can be used as the input of the next RBM, continue to optimize in the same way to obtain the hidden state h2, and so on, the multi-layer DBN model can use the layer-by-layer preprocessing method for the weights w ij , b i and a j is initialized, and each layer's features are an expression of the first layer of data v. After this unsupervised preprocessing, various initial parameters are obtained.
具体的,RBM是一种能量模型,整个RBM的能量表示如下式(6)所示。Specifically, the RBM is an energy model, and the energy of the entire RBM is expressed by the following formula (6).
Figure PCTCN2017107505-appb-000006
Figure PCTCN2017107505-appb-000006
其中,E表示RBM模型的总能量,v表示可见层数据,h表示隐藏层数据,θ表示模型参数,m表示可见层神经元数量,n表示隐藏层神经元数量,b表示可见层偏置,a表示隐藏层偏置。Where E is the total energy of the RBM model, v is the visible layer data, h is the hidden layer data, θ is the model parameter, m is the number of visible layer neurons, n is the number of hidden layer neurons, and b is the visible layer offset. a indicates the hidden layer offset.
RBM模型根据可见层数据和隐藏层数据的条件概率进行采样,对于伯努利-伯努利RBM模型,条件概率公式分别为公式(7)和公式(8),The RBM model samples based on the conditional probability of visible layer data and hidden layer data. For the Bernoulli-Bernoulli RBM model, the conditional probability formulas are formula (7) and formula (8), respectively.
Figure PCTCN2017107505-appb-000007
Figure PCTCN2017107505-appb-000007
Figure PCTCN2017107505-appb-000008
Figure PCTCN2017107505-appb-000008
其中,σ表示激活函数sigmoid函数,σ(x)=(1+e-x)-1Where σ represents the activation function sigmoid function, σ(x)=(1+e -x ) -1 .
根据以上公式利用对比散度算法对RBM进行Gibbs采样,得到v和h联合分布的样本,然后通过最大化观测样本的似然对数函数(9)优化参数,According to the above formula, the contrast divergence algorithm is used to sample the Gibbs of the RBM, and the samples of the joint distribution of v and h are obtained, and then the parameters are optimized by maximizing the likelihood logarithm function (9) of the observed samples.
Figure PCTCN2017107505-appb-000009
Figure PCTCN2017107505-appb-000009
Δwij≈<vihj>0-<vihj>1 Δw ij ≈<v i h j > 0 -<v i h j > 1
                                        (10)(10)
优化参数采用一步对比散度算法,采用平均场逼近的方式直接生成采样样本,利用公式(10)多次迭代优化参数,最终获得各神经元之间的权重、以及神经元的偏置等各项初始参数。其中,N代表RBM模型可见层神经元的数量,亦即RBM模型输入数据的维度。The optimization parameters adopt a one-step contrast divergence algorithm, which uses the mean field approximation method to directly generate sampling samples, and uses the formula (10) to iteratively optimize the parameters multiple times, and finally obtains the weights between the neurons and the bias of the neurons. Initial parameters. Where N represents the number of visible layer neurons in the RBM model, that is, the dimension of the input data of the RBM model.
步骤1032:基于深度神经网络的梯度下降和反向传播算法,对各个所述初始参数进行微调,获得基于深度神经网络的特定声音特征模型的各个参数。Step 1032: Perform fine-tuning of each of the initial parameters based on a gradient neural network-based gradient descent and backpropagation algorithm to obtain various parameters of a specific sound feature model based on the deep neural network.
DBN的优化过程完成后,获得基于DNN特定声音特征模型的各层(输入层、隐藏层和输出层)神经元之间的权重w和神经元的偏置b,最后的多类别逻辑回归层(softmax)采用随机的初始化方式,然后,DNN采用有监督的梯度下降算法对该特定声音特征模型进行微调。 After the optimization process of the DBN is completed, the weights w and the biases b of the neurons between the layers (input layer, hidden layer and output layer) based on the DNN specific sound feature model are obtained, and the final multi-class logistic regression layer ( Softmax) uses a random initialization method, and then the DNN fine-tunes the specific sound feature model using a supervised gradient descent algorithm.
具体的,利用有监督的方式,通过最小化代价函数(公式11)的方式优化参数(公式12)微调整个DNN特定声音特征模型。Specifically, in a supervised manner, the DNN-specific sound feature model is fine-tuned by optimizing the parameters (Equation 12) by minimizing the cost function (Equation 11).
Figure PCTCN2017107505-appb-000010
Figure PCTCN2017107505-appb-000010
其中,J表示代价函数,hW,b(x)表示DNN的输出,y表示输入数据对应的标签。Where J represents the cost function, h W,b (x) represents the output of the DNN, and y represents the label corresponding to the input data.
Figure PCTCN2017107505-appb-000011
Figure PCTCN2017107505-appb-000011
其中,α表示学习率,取值0.5~0.01。Where α represents the learning rate and takes values from 0.5 to 0.01.
上述公式(12)中计算深度神经网络各个节点的偏导数可以采用公式(13)的反向传播算法。The partial derivative of each node of the deep neural network calculated in the above formula (12) can adopt the back propagation algorithm of the formula (13).
Figure PCTCN2017107505-appb-000012
Figure PCTCN2017107505-appb-000012
其中,δ表示灵敏度,a表示每个神经元节点的输出值。当l表示输出层时,
Figure PCTCN2017107505-appb-000013
当l表示其他层时
Figure PCTCN2017107505-appb-000014
其中σ表示激活函数。然后通过多次迭代,更新公式(13),逐层优化整个DNN模型,最终获得各个参数,得到训练好的基于DNN的特定声音特征模型。
Where δ represents sensitivity and a represents the output value of each neuron node. When l represents the output layer,
Figure PCTCN2017107505-appb-000013
When l represents another layer
Figure PCTCN2017107505-appb-000014
Where σ represents the activation function. Then, through multiple iterations, the formula (13) is updated, and the entire DNN model is optimized layer by layer. Finally, various parameters are obtained, and a trained DNN-based specific sound feature model is obtained.
通过基于DBN的非监督学习和监督学习方法的结合,相对于随机初始化的深度神经网络,经过无监督预处理后进行监督学习,获得的DNN模型有明显优于普通深度神经网络的性能。以特定声音样本信号的MFCC特征参数作为DNN模型的输入进行建模获得基于DNN的特定声音特征模型,再利用该特定声音特征模型对特定声音进行识别,有效提高了特定声音的识别率。Through the combination of DBN-based unsupervised learning and supervised learning methods, compared with the randomly initialized deep neural network, after unsupervised preprocessing and supervised learning, the obtained DNN model is significantly better than the performance of ordinary deep neural networks. The MFCC feature parameter of the specific sound sample signal is used as the input of the DNN model to obtain a specific sound feature model based on the DNN, and the specific sound feature model is used to identify the specific sound, thereby effectively improving the recognition rate of the specific sound.
图10是本申请实施例提供的特定声音识别方法的流程示意图,如图10所示,所述特定声音识别方法包括:FIG. 10 is a schematic flowchart of a specific voice recognition method according to an embodiment of the present disclosure. As shown in FIG. 10, the specific voice recognition method includes:
步骤201:采样声音信号并获取所述声音信号的梅尔频率倒谱系数特征参数矩阵; Step 201: sampling a sound signal and acquiring a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;
在实际应用中,可以在特定声音识别设备20上设置声音输入单元(例如麦克风)来采集声音信号,对声音信号进行放大、滤波等处理后转换成数字信号。该数字信号可以在特定声音识别设备20本地的运算处理单元中进行采样及其他计算处理,也可以通过网络上传到云端服务器、智能终端或者其他服务器中进行处理。In a practical application, a sound input unit (for example, a microphone) may be disposed on a specific sound recognition device 20 to collect a sound signal, and the sound signal is amplified, filtered, and the like, and then converted into a digital signal. The digital signal may be sampled and processed in an operation processing unit local to the specific voice recognition device 20, or may be uploaded to a cloud server, a smart terminal, or other server for processing through a network.
其中,获取声音信号的梅尔频率倒谱系数特征参数矩阵的技术细节请参照步骤101,在此不再赘述。For the technical details of obtaining the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal, refer to step 101, and details are not described herein again.
步骤202:从所述声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数;Step 202: Extract a feature parameter from a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal.
其中,从声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数的具体计算方法请参照步骤102,在此不再赘述。For the specific calculation method for extracting the feature parameters from the characteristic parameter matrix of the frequency coefficient of the frequency coefficient of the sound signal, refer to step 102, and details are not described herein again.
步骤203:将所述特征参数输入预先获取的基于深度神经网络的特定声音特征模型进行识别,以确定所述声音信号是否为特定声音。Step 203: Input the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound.
具体的,将所述特征参数输入预先获取的基于深度神经网络的特定声音特征模型进行识别,以确定所述声音信号是否为特定声音,包括:Specifically, the feature parameter is input into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound, including:
将所述特征参数包含的一组子特征向量输入预先获取的基于深度神经网络的特定声音特征模型,获得一组子特征向量对应的预测结果;Inputting a set of sub-feature vectors included in the feature parameter into a pre-acquired specific sound feature model based on a depth neural network, and obtaining a prediction result corresponding to a set of sub-feature vectors;
如果所述预测结果中,肯定的预测结果多于否定的预测结果,则确认所述声音信号为特定声音,否则,确认所述声音信号不是特定声音。If the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.
当声音信号的特征参数输入训练好的基于DNN的特定声音特征模型时,就会得到该声音信号是否为特定声音的预测结果。由于同一个声音信号的特征参数包含多个子特征向量,每一个子特征向量都会得到一个预测结果,这样每一个声音信号就会得到多个预测结果,这些预测结果代表了声音信号是否是特定声音的可能。基于DNN的特定声音特征模型会对同一个声音信号的所有预测结果进行投票,即所有子特征向量的预测结果中,如果肯定的预测结果多于否定的预测结果,则确认该声音信号为特定声音;如果肯定的预测结果少于否定的预测结果,则确认该声音信号不是特定声音。When the characteristic parameter of the sound signal is input to the trained DNN-based specific sound feature model, it is obtained whether the sound signal is a predicted result of the specific sound. Since the characteristic parameter of the same sound signal contains a plurality of sub-feature vectors, each sub-feature vector will obtain a prediction result, so that each sound signal will obtain a plurality of prediction results, and the prediction results represent whether the sound signal is a specific sound. may. A specific sound feature model based on DNN votes all prediction results of the same sound signal, that is, in the prediction result of all sub-feature vectors, if the positive prediction result is more than the negative prediction result, the sound signal is confirmed as a specific sound. If the positive prediction result is less than the negative prediction result, it is confirmed that the sound signal is not a specific sound.
本申请实施例提供的特定声音识别方法,能对特定声音进行识别,从而能够通过监测使用者发出的声音对使用者发出的特定声音情况进行监测,无需使用者佩戴任何检测部件。且由于采用基于MFCC特征参数和DNN模型的识别 算法,算法复杂度低、计算量少,从而对硬件要求低,降低了产品制造成本。The specific voice recognition method provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. And because of the identification based on MFCC feature parameters and DNN model The algorithm has low algorithm complexity and low computational complexity, which has low hardware requirements and reduces product manufacturing costs.
需要说明的是,本申请实施例提供的基于MFCC特征参数和DNN模型的特定声音识别方法,除用于识别咳嗽声音之外,同样适用于识别鼾声、喷嚏声、呼吸声、笑声、鞭炮声和哭声等其他特定声音。It should be noted that the specific voice recognition method based on the MFCC feature parameter and the DNN model provided by the embodiment of the present application is also applicable to identify the snoring sound, the sneezing sound, the breathing sound, the laugh sound, the firecracker sound, in addition to the cough sound. And other specific sounds such as crying.
相应的,如图11所示,本申请实施例还提供了一种特定声音识别装置,用于特定声音识别设备20,所述装置包括:Correspondingly, as shown in FIG. 11 , the embodiment of the present application further provides a specific voice recognition device for a specific voice recognition device 20, where the device includes:
采样及特征参数获取模块301,用于采样声音信号并获取所述声音信号的梅尔频率倒谱系数特征参数矩阵;The sampling and feature parameter obtaining module 301 is configured to sample the sound signal and obtain a characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal;
特征参数提取模块302,用于从所述声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数;The feature parameter extraction module 302 is configured to extract feature parameters from the Mel frequency cepstral coefficient feature parameter matrix of the sound signal;
识别模块303,用于将所述特征参数输入预先获取的基于深度神经网络的特定声音特征模型进行识别,以确定所述声音信号是否为特定声音。The identification module 303 is configured to input the feature parameter into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound.
本申请实施例提供的特定声音识别装置,能对特定声音进行识别,从而能够通过监测使用者发出的声音对使用者发出的特定声音情况进行监测,无需使用者佩戴任何检测部件。且由于采用基于MFCC特征参数和DNN模型的识别算法,算法复杂度低、计算量少,从而对硬件要求低,降低了产品制造成本。The specific voice recognition device provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. Because the recognition algorithm based on MFCC feature parameters and DNN model is adopted, the algorithm has low complexity and less calculation, which has low hardware requirements and reduces product manufacturing costs.
可选的,在所述装置的其他实施例中,如图12所示,所述装置还包括:Optionally, in other embodiments of the device, as shown in FIG. 12, the device further includes:
特征模型预设模块304,用于预先获取所述基于深度神经网络的特定声音特征模型。The feature model preset module 304 is configured to acquire the specific sound feature model based on the depth neural network in advance.
可选的,在所述装置的某些实施例中,特征模型预设模块304具体用于:Optionally, in some embodiments of the device, the feature model preset module 304 is specifically configured to:
采集预设数量的特定声音样本信号并获取所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵;Collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;
从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取所述特征参数;Extracting the feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the specific sound sample signal;
将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型。Taking the characteristic parameters of the specific sound sample signal as input, training based on the depth neural network model to obtain the specific sound feature model based on the deep neural network.
可选的,在所述装置的某些实施例中,特征模型预设模块304还具体用于:Optionally, in some embodiments of the device, the feature model preset module 304 is further configured to:
将特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中各信号帧的梅尔频率倒谱系数依次首尾相连组成一特征向量; The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a feature vector;
将所述特征向量按预设步长从所述特征向量头部到所述特征向量尾部对所述特征向量进行分割,获得包括一组长度均为预设长度的子特征向量的特征参数,每个子特征向量具有相同的标签,所述预设步长为每帧梅尔频率倒谱系数长度的整数倍,所述预设长度为所述每帧梅尔频率倒谱系数长度的整数倍;And dividing the feature vector from the feature vector header to the feature vector tail to segment the feature vector according to a preset step size, to obtain a feature parameter including a set of sub-feature vectors whose lengths are preset lengths, and each The sub-feature vectors have the same label, and the preset step size is an integral multiple of the length of the cepstral coefficient of each frame, and the preset length is an integer multiple of the length of the cepstral coefficient of each frame;
特征参数提取模块302还具体用于:The feature parameter extraction module 302 is also specifically configured to:
将声音信号的梅尔频率倒谱系数特征参数矩阵中各信号帧的梅尔频率倒谱系数依次首尾相连组成一特征向量;The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal are sequentially connected end to end to form a feature vector;
将所述特征向量按所述预设步长从所述特征向量头部到所述特征向量尾部对所述特征向量进行分割,获得包括一组长度均为所述预设长度的子特征向量的特征参数。And dividing the feature vector from the feature vector header to the feature vector tail to segment the feature vector according to the preset step size, to obtain a set of sub-feature vectors each having a length of the preset length. Characteristic Parameters.
可选的,在所述装置的某些实施例中,特征模型预设模块304还具体用于:Optionally, in some embodiments of the device, the feature model preset module 304 is further configured to:
将所述特定声音样本信号的特征参数作为输入,基于深度置信网络算法进行模型训练,获得所述基于深度神经网络的特定声音特征模型的各个初始参数;Taking the characteristic parameters of the specific sound sample signal as input, performing model training based on a deep confidence network algorithm, and obtaining respective initial parameters of the specific sound feature model based on the depth neural network;
基于深度神经网络的梯度下降和反向传播算法,对各个所述初始参数进行微调,获得基于深度神经网络的特定声音特征模型的各个参数。Based on the gradient descent and back propagation algorithms of the deep neural network, each of the initial parameters is fine-tuned to obtain various parameters of a specific sound feature model based on the deep neural network.
可选的,在所述装置的某些实施例中,识别模块303具体用于:Optionally, in some embodiments of the apparatus, the identification module 303 is specifically configured to:
将所述特征参数包含的一组子特征向量输入预先获取的基于深度神经网络的特定声音特征模型,获得一组子特征向量对应的预测结果;Inputting a set of sub-feature vectors included in the feature parameter into a pre-acquired specific sound feature model based on a depth neural network, and obtaining a prediction result corresponding to a set of sub-feature vectors;
如果所述预测结果中,肯定的预测结果多于否定的预测结果,则确认所述声音信号为特定声音,否则,确认所述声音信号不是特定声音。If the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.
可选的,在所述装置的某些实施例中,所述特定声音包括咳嗽声、鼾声和喷嚏声中的任意一种。Optionally, in certain embodiments of the device, the particular sound comprises any one of cough, snoring, and sneezing.
需要说明的是,上述装置可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。It should be noted that the foregoing apparatus can perform the method provided by the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiments of the present application.
本申请实施例还提供了一种特定声音识别设备,如图8所示,特定声音识别设备20包括声音输入单元21、信号处理单元22和运算处理单元23。其中:声音输入单元21,用于接收声音信号,所述声音输入单元可以例如是麦克风等。信号处理单元22,用于对所述声音信号进行信号处理;所述信号处理单元22可 以对所述声音信号进行放大、滤波、数模转换等模拟信号处理,将获得的数字信号发送给运算处理单元23。The embodiment of the present application also provides a specific voice recognition device. As shown in FIG. 8, the specific voice recognition device 20 includes a voice input unit 21, a signal processing unit 22, and an operation processing unit 23. The sound input unit 21 is configured to receive a sound signal, and the sound input unit may be, for example, a microphone or the like. a signal processing unit 22, configured to perform signal processing on the sound signal; the signal processing unit 22 may The obtained digital signal is sent to the arithmetic processing unit 23 by analog signal processing such as amplification, filtering, and digital-to-analog conversion of the sound signal.
所述信号处理单元22与内置或者外置于特定声音识别设备的运算处理单元23相连(图13以运算处理单元内置在特定声音识别设备中为例说明),运算处理单元23可以内置在特定声音识别设备20内部,也可以外置在特定声音识别设备20外部,所述运算处理单元23还可以是远程设置的服务器,例如可以是通过网络与特定声音识别设备20通信连接的云端服务器、智能终端或者其他服务器。The signal processing unit 22 is connected to an arithmetic processing unit 23 built in or externally to a specific sound recognition device (FIG. 13 is described by way of example in which the arithmetic processing unit is built in a specific sound recognition device), and the arithmetic processing unit 23 can be built in a specific sound. The identification device 20 may be external to the specific voice recognition device 20, and the operation processing unit 23 may also be a remotely set server, for example, a cloud server or a smart terminal that is communicably connected to the specific voice recognition device 20 through a network. Or other servers.
所述运算处理单元23包括:The operation processing unit 23 includes:
至少一个处理器232(图13中以一个处理器举例说明)和存储器231,处理器232和存储器231可以通过总线或者其他方式连接,图13中以通过总线连接为例。At least one processor 232 (illustrated by a processor in FIG. 13) and a memory 231, the processor 232 and the memory 231 may be connected by a bus or the like, and the bus connection is taken as an example in FIG.
存储器231用于存储非易失性软件程序、非易失性计算机可执行程序以及软件模块,如本申请实施例中的特定声音识别方法对应的程序指令/模块(例如,附图11所示的采样及特征参数获取模块301)。处理器232通过运行存储在存储器231中的非易失性软件程序、指令以及模块,从而执行各种功能应用以及数据处理,即实现上述方法实施例的特定声音识别方法。The memory 231 is configured to store a non-volatile software program, a non-volatile computer executable program, and a software module, such as a program instruction/module corresponding to a specific sound recognition method in the embodiment of the present application (for example, as shown in FIG. 11) Sampling and feature parameter acquisition module 301). The processor 232 executes various functional applications and data processing by executing non-volatile software programs, instructions, and modules stored in the memory 231, that is, implementing the specific sound recognition method of the above-described method embodiments.
存储器231可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据特定声音识别装置使用所创建的数据等。此外,存储器231可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器231可选包括相对于处理器232远程设置的存储器,这些远程存储器可以通过网络连接至特定声音识别装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 231 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to the use of the specific sound recognition device, and the like. Further, the memory 231 may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, or other nonvolatile solid state storage device. In some embodiments, memory 231 can optionally include memory remotely located relative to processor 232, which can be connected to a particular voice recognition device over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
所述一个或者多个模块存储在所述存储器231中,当被所述一个或者多个处理器232执行时,执行上述任意方法实施例中的特定声音识别方法,例如,执行以上描述的图2中的方法步骤101-103,图8中的方法步骤1021至1022,图9中的方法步骤1031至1032,图10中的步骤201至步骤203;实现图11中的模块301-303、图12中的模块301-304的功能。The one or more modules are stored in the memory 231, and when executed by the one or more processors 232, perform a specific sound recognition method in any of the above method embodiments, for example, performing FIG. 2 described above Method steps 101-103, method steps 1021 to 1022 in FIG. 8, method steps 1031 to 1032 in FIG. 9, step 201 to step 203 in FIG. 10; implementing modules 301-303 and FIG. 12 in FIG. The function of modules 301-304 in .
本申请实施例提供的特定声音识别设备,能对特定声音进行识别,从而能够通过监测使用者发出的声音对使用者发出的特定声音情况进行监测,无需使用者佩戴任何检测部件。且由于采用基于MFCC特征参数和DNN模型的识别 算法,算法复杂度低、计算量少,从而对硬件要求低,降低了产品制造成本。The specific voice recognition device provided by the embodiment of the present application can identify a specific sound, so that the specific sound condition sent by the user can be monitored by monitoring the sound emitted by the user, without the user wearing any detecting component. And because of the identification based on MFCC feature parameters and DNN model The algorithm has low algorithm complexity and low computational complexity, which has low hardware requirements and reduces product manufacturing costs.
上述特定声音识别设备可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。The specific voice recognition device can perform the method provided by the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiments of the present application.
本申请实施例提供了一种存储介质,所述存储介质存储有计算机可执行指令,该计算机可执行指令被一个或多个处理器执行(例如图13中的一个处理器232),可使得上述一个或多个处理器可执行上述任意方法实施例中的特定声音识别方法,例如,执行以上描述的图2中的方法步骤101-103,图8中的方法步骤1021至1022,图9中的方法步骤1031至1032,图10中的步骤201至步骤203;实现图11中的模块301-303、图12中的模块301-304的功能。Embodiments of the present application provide a storage medium storing computer executable instructions that are executed by one or more processors (eg, one processor 232 in FIG. 13), such that The one or more processors may perform the specific sound recognition method in any of the above method embodiments, for example, performing the method steps 101-103 of FIG. 2 described above, the method steps 1021 to 1022 of FIG. 8, and the method of FIG. Method steps 1031 to 1032, steps 201 to 203 in FIG. 10; functions of modules 301-303 in FIG. 11 and modules 301-304 in FIG. 12 are implemented.
以上所描述的实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, ie may be located in one Places, or they can be distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
通过以上的实施例的描述,本领域普通技术人员可以清楚地了解到各实施例可借助软件加通用硬件平台的方式来实现,当然也可以通过硬件。本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。Through the description of the above embodiments, those skilled in the art can clearly understand that the embodiments can be implemented by means of software plus a general hardware platform, and of course, by hardware. A person skilled in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;在本申请的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,步骤可以以任意顺序实现,并存在如上所述的本申请的不同方面的许多其它变化,为了简明,它们没有在细节中提供;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, and are not limited thereto; in the idea of the present application, the technical features in the above embodiments or different embodiments may also be combined. The steps may be carried out in any order, and there are many other variations of the various aspects of the present application as described above, which are not provided in the details for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, The skilled person should understand that the technical solutions described in the foregoing embodiments may be modified, or some of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the embodiments of the present application. The scope of the technical solution.

Claims (11)

  1. 一种特定声音识别方法,其特征在于,所述方法包括:A specific voice recognition method, characterized in that the method comprises:
    采样声音信号并获取所述声音信号的梅尔频率倒谱系数特征参数矩阵;Sampling the sound signal and acquiring a characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal;
    从所述声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数;Extracting a feature parameter from a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;
    将所述特征参数输入预先获取的基于深度神经网络的特定声音特征模型进行识别,以确定所述声音信号是否为特定声音。The feature parameter is input to a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound.
  2. 根据权利要求1所述的特定声音识别方法,其特征在于,所述方法还包括:预先获取所述基于深度神经网络的特定声音特征模型。The specific voice recognition method according to claim 1, wherein the method further comprises: acquiring the specific sound feature model based on the depth neural network in advance.
  3. 根据权利要求2所述的特定声音识别方法,其特征在于,所述预先获取所述基于深度神经网络的特定声音特征模型,包括:The specific sound recognition method according to claim 2, wherein the pre-acquiring the specific sound feature model based on the depth neural network comprises:
    采集预设数量的特定声音样本信号并获取所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵;Collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;
    从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取所述特征参数;Extracting the feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the specific sound sample signal;
    将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型。Taking the characteristic parameters of the specific sound sample signal as input, training based on the depth neural network model to obtain the specific sound feature model based on the deep neural network.
  4. 根据权利要求3所述的特定声音识别方法,其特征在于,所述从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取所述特征参数,包括:The specific voice recognition method according to claim 3, wherein the extracting the feature parameters from a matrix of the frequency coefficient of the frequency coefficient of the specific frequency sample signal includes:
    将特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中各信号帧的梅尔频率倒谱系数依次首尾相连组成一特征向量;The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the specific sound sample signal are sequentially connected end to end to form a feature vector;
    将所述特征向量按预设步长从所述特征向量头部到所述特征向量尾部对所述特征向量进行分割,获得包括一组长度均为预设长度的子特征向量的特征参数,每个子特征向量具有相同的标签,所述预设步长为每帧梅尔频率倒谱系数长度的整数倍,所述预设长度为所述每帧梅尔频率倒谱系数长度的整数倍;And dividing the feature vector from the feature vector header to the feature vector tail to segment the feature vector according to a preset step size, to obtain a feature parameter including a set of sub-feature vectors whose lengths are preset lengths, and each The sub-feature vectors have the same label, and the preset step size is an integral multiple of the length of the cepstral coefficient of each frame, and the preset length is an integer multiple of the length of the cepstral coefficient of each frame;
    所述从所述声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数,包括:Extracting the feature parameters from the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal, including:
    将声音信号的梅尔频率倒谱系数特征参数矩阵中各信号帧的梅尔频率倒谱系数依次首尾相连组成一特征向量;The Mel frequency cepstral coefficients of each signal frame in the characteristic parameter matrix of the Mel frequency cepstral coefficient of the sound signal are sequentially connected end to end to form a feature vector;
    将所述特征向量按所述预设步长从所述特征向量头部到所述特征向量尾部 对所述特征向量进行分割,获得包括一组长度均为所述预设长度的子特征向量的特征参数。And passing the feature vector from the feature vector header to the feature vector tail at the preset step size The feature vector is segmented to obtain a feature parameter including a set of sub-feature vectors each having the predetermined length.
  5. 根据权利要求4所述的特定声音识别方法,其特征在于,所述将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型,包括:The specific voice recognition method according to claim 4, wherein the taking the feature parameter of the specific sound sample signal as an input, training based on a depth neural network model to obtain the specific sound feature based on the depth neural network Models, including:
    将所述特定声音样本信号的特征参数作为输入,基于深度置信网络算法进行模型训练,获得所述基于深度神经网络的特定声音特征模型的各个初始参数;Taking the characteristic parameters of the specific sound sample signal as input, performing model training based on a deep confidence network algorithm, and obtaining respective initial parameters of the specific sound feature model based on the depth neural network;
    基于深度神经网络的梯度下降和反向传播算法,对所述各个初始参数进行微调,获得基于深度神经网络的特定声音特征模型的各个参数。Based on the gradient descent and back propagation algorithms of the deep neural network, the respective initial parameters are fine-tuned to obtain various parameters of the specific sound feature model based on the deep neural network.
  6. 根据权利要求4所述的特定声音识别方法,其特征在于,所述将所述特征参数输入预先获取的基于深度神经网络的特定声音特征模型进行识别,以确定所述声音信号是否为特定声音,包括:The specific voice recognition method according to claim 4, wherein the character parameter is input into a pre-acquired deep neural network-based specific sound feature model to determine whether the sound signal is a specific sound, include:
    将所述特征参数包含的一组子特征向量输入预先获取的基于深度神经网络的特定声音特征模型,获得一组子特征向量对应的预测结果;Inputting a set of sub-feature vectors included in the feature parameter into a pre-acquired specific sound feature model based on a depth neural network, and obtaining a prediction result corresponding to a set of sub-feature vectors;
    如果所述预测结果中,肯定的预测结果多于否定的预测结果,则确认所述声音信号为特定声音,否则,确认所述声音信号不是特定声音。If the positive prediction result is more than the negative prediction result among the prediction results, it is confirmed that the sound signal is a specific sound, otherwise, it is confirmed that the sound signal is not a specific sound.
  7. 根据权利要求1-6任意一项所述的特定声音识别方法,其特征在于,所述特定声音包括咳嗽声、鼾声和喷嚏声中的任意一种。The specific sound recognition method according to any one of claims 1 to 6, wherein the specific sound includes any one of a cough sound, a click sound, and a sneeze sound.
  8. 一种特定声音识别装置,其特征在于,所述装置包括:A specific voice recognition device, characterized in that the device comprises:
    采样及特征参数获取模块,用于采样声音信号并获取所述声音信号的梅尔频率倒谱系数特征参数矩阵;a sampling and feature parameter obtaining module, configured to sample a sound signal and obtain a characteristic parameter matrix of a Mel frequency cepstral coefficient of the sound signal;
    特征参数提取模块,用于从所述声音信号的梅尔频率倒谱系数特征参数矩阵中提取特征参数;a feature parameter extraction module, configured to extract a feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the sound signal;
    特征匹配模块,用于确认所述特征参数是否匹配预先获取的基于深度神经网络的特定声音特征模型;a feature matching module, configured to confirm whether the feature parameter matches a pre-acquired deep neural network-based specific sound feature model;
    确认模块,用于如果所述特征参数匹配预先获取的基于深度神经网络的特定声音特征模型,则确认所述声音信号为特定声音。And a confirmation module, configured to confirm that the sound signal is a specific sound if the feature parameter matches a pre-acquired deep neural network-based specific sound feature model.
  9. 根据权利要求8所述的特定声音识别装置,其特征在于,所述装置还包括:The specific voice recognition device according to claim 8, wherein the device further comprises:
    特征模型预设模块,用于预先获取所述基于深度神经网络的特定声音特征 模型;a feature model preset module, configured to pre-acquire the specific sound feature based on the deep neural network model;
    所述特征模型预设模块,具体用于:The feature model preset module is specifically configured to:
    采集预设数量的特定声音样本信号并获取所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵;Collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstral coefficient characteristic parameter matrix of the specific sound sample signal;
    从所述特定声音样本信号的梅尔频率倒谱系数特征参数矩阵中提取所述特征参数;Extracting the feature parameter from a Mel frequency cepstral coefficient feature parameter matrix of the specific sound sample signal;
    将所述特定声音样本信号的特征参数作为输入,训练基于深度神经网络模型,以获取所述基于深度神经网络的特定声音特征模型。Taking the characteristic parameters of the specific sound sample signal as input, training based on the depth neural network model to obtain the specific sound feature model based on the deep neural network.
  10. 一种特定声音识别设备,其特征在于,所述特定声音识别设备包括:A specific voice recognition device, characterized in that the specific voice recognition device comprises:
    声音输入单元,用于接收声音信号;a sound input unit for receiving a sound signal;
    信号处理单元,用于对所述声音信号进行模拟信号处理;a signal processing unit, configured to perform analog signal processing on the sound signal;
    所述信号处理单元与内置或者外置于特定声音识别设备的运算处理单元相连,所述运算处理单元包括:The signal processing unit is connected to an operation processing unit built in or externally to a specific sound recognition device, and the operation processing unit includes:
    至少一个处理器;以及,At least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-6任意一项所述的方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method of any of claims 1-6 method.
  11. 一种存储介质,其特征在于,所述存储介质存储有可执行指令,所述可执行指令被特定声音识别设备执行时,使所述特定声音识别设备执行权利要求1-7任意一项所述的方法。 A storage medium, wherein the storage medium stores executable instructions that, when executed by a specific sound recognition device, cause the specific sound recognition device to perform any one of claims 1-7 Methods.
PCT/CN2017/107505 2017-10-24 2017-10-24 Specific sound recognition method and apparatus, and storage medium WO2019079972A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2017/107505 WO2019079972A1 (en) 2017-10-24 2017-10-24 Specific sound recognition method and apparatus, and storage medium
CN201780009004.8A CN109074822B (en) 2017-10-24 2017-10-24 Specific voice recognition method, apparatus and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/107505 WO2019079972A1 (en) 2017-10-24 2017-10-24 Specific sound recognition method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2019079972A1 true WO2019079972A1 (en) 2019-05-02

Family

ID=64678057

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/107505 WO2019079972A1 (en) 2017-10-24 2017-10-24 Specific sound recognition method and apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN109074822B (en)
WO (1) WO2019079972A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185347A (en) * 2020-09-27 2021-01-05 北京达佳互联信息技术有限公司 Language identification method, language identification device, server and storage medium
CN112382302A (en) * 2020-12-02 2021-02-19 漳州立达信光电子科技有限公司 Baby cry identification method and terminal equipment
CN112541533A (en) * 2020-12-07 2021-03-23 阜阳师范大学 Modified vehicle identification method based on neural network and feature fusion
CN112668556A (en) * 2021-01-21 2021-04-16 广州联智信息科技有限公司 Breath sound identification method and system
CN113111786A (en) * 2021-04-15 2021-07-13 西安电子科技大学 Underwater target identification method based on small sample training image convolutional network
CN113126027A (en) * 2019-12-31 2021-07-16 财团法人工业技术研究院 Method for positioning specific sound source
CN113516154A (en) * 2021-04-09 2021-10-19 北京小米移动软件有限公司 Method, device and storage medium for identifying human voice dubbing type in media file
CN113571092A (en) * 2021-07-14 2021-10-29 东软集团股份有限公司 Method for identifying abnormal sound of engine and related equipment thereof
CN113782048A (en) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 Multi-modal voice separation method, training method and related device
CN114005460A (en) * 2021-10-28 2022-02-01 广州艾美网络科技有限公司 Method and device for separating voice of music file
CN114398925A (en) * 2021-12-31 2022-04-26 厦门大学 Multi-feature-based ship radiation noise sample length selection method and system
CN116264620A (en) * 2023-04-21 2023-06-16 深圳市声菲特科技技术有限公司 Live broadcast recorded audio data acquisition and processing method and related device
EP4226883A1 (en) * 2022-02-15 2023-08-16 Koninklijke Philips N.V. Apparatuses and methods for use with a treatment device

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545226B (en) * 2019-01-04 2022-11-22 平安科技(深圳)有限公司 Voice recognition method, device and computer readable storage medium
CN109767784B (en) * 2019-01-31 2020-02-07 龙马智芯(珠海横琴)科技有限公司 Snore identification method and device, storage medium and processor
CN110111797A (en) * 2019-04-04 2019-08-09 湖北工业大学 Method for distinguishing speek person based on Gauss super vector and deep neural network
CN110338797A (en) * 2019-08-12 2019-10-18 苏州小蓝医疗科技有限公司 A kind of intermediate frequency snore stopper data processing method based on the sound of snoring and blood oxygen
CN110558944A (en) * 2019-09-09 2019-12-13 成都智能迭迦科技合伙企业(有限合伙) Heart sound processing method and device, electronic equipment and computer readable storage medium
CN110767239A (en) * 2019-09-20 2020-02-07 平安科技(深圳)有限公司 Voiceprint recognition method, device and equipment based on deep learning
CN110933235B (en) * 2019-11-06 2021-07-27 杭州哲信信息技术有限公司 Noise identification method in intelligent calling system based on machine learning
CN111009261B (en) * 2019-12-10 2022-11-15 Oppo广东移动通信有限公司 Arrival reminding method, device, terminal and storage medium
CN111243619B (en) * 2020-01-06 2023-09-22 平安科技(深圳)有限公司 Training method and device for speech signal segmentation model and computer equipment
CN111488485B (en) * 2020-04-16 2023-11-17 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN112418173A (en) * 2020-12-08 2021-02-26 北京声智科技有限公司 Abnormal sound identification method and device and electronic equipment
CN113241093A (en) * 2021-04-02 2021-08-10 深圳达实智能股份有限公司 Method and device for recognizing voice in emergency state of subway station and electronic equipment
CN115064244A (en) * 2022-08-16 2022-09-16 深圳市奋达智能技术有限公司 Method and system for reminding medicine taking for needleless injection based on voice recognition

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976564A (en) * 2010-10-15 2011-02-16 中国林业科学研究院森林生态环境与保护研究所 Method for identifying insect voice
CN103325382A (en) * 2013-06-07 2013-09-25 大连民族学院 Method for automatically identifying Chinese national minority traditional instrument audio data
CN104706321A (en) * 2015-02-06 2015-06-17 四川长虹电器股份有限公司 MFCC heart sound type recognition method based on improvement
CN106251880A (en) * 2015-06-03 2016-12-21 创心医电股份有限公司 Identify method and the system of physiological sound
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
US20170103776A1 (en) * 2015-10-12 2017-04-13 Gwangju Institute Of Science And Technology Sound Detection Method for Recognizing Hazard Situation
CN106847293A (en) * 2017-01-19 2017-06-13 内蒙古农业大学 Facility cultivation sheep stress behavior acoustical signal monitoring method
CN107910020A (en) * 2017-10-24 2018-04-13 深圳和而泰智能控制股份有限公司 Sound of snoring detection method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6487650B2 (en) * 2014-08-18 2019-03-20 日本放送協会 Speech recognition apparatus and program
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network
CN105702250B (en) * 2016-01-06 2020-05-19 福建天晴数码有限公司 Speech recognition method and device
CN106710599A (en) * 2016-12-02 2017-05-24 深圳撒哈拉数据科技有限公司 Particular sound source detection method and particular sound source detection system based on deep neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976564A (en) * 2010-10-15 2011-02-16 中国林业科学研究院森林生态环境与保护研究所 Method for identifying insect voice
CN103325382A (en) * 2013-06-07 2013-09-25 大连民族学院 Method for automatically identifying Chinese national minority traditional instrument audio data
CN104706321A (en) * 2015-02-06 2015-06-17 四川长虹电器股份有限公司 MFCC heart sound type recognition method based on improvement
CN106251880A (en) * 2015-06-03 2016-12-21 创心医电股份有限公司 Identify method and the system of physiological sound
US20170103776A1 (en) * 2015-10-12 2017-04-13 Gwangju Institute Of Science And Technology Sound Detection Method for Recognizing Hazard Situation
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN106847293A (en) * 2017-01-19 2017-06-13 内蒙古农业大学 Facility cultivation sheep stress behavior acoustical signal monitoring method
CN107910020A (en) * 2017-10-24 2018-04-13 深圳和而泰智能控制股份有限公司 Sound of snoring detection method, device, equipment and storage medium

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113126027A (en) * 2019-12-31 2021-07-16 财团法人工业技术研究院 Method for positioning specific sound source
CN112185347A (en) * 2020-09-27 2021-01-05 北京达佳互联信息技术有限公司 Language identification method, language identification device, server and storage medium
CN112382302A (en) * 2020-12-02 2021-02-19 漳州立达信光电子科技有限公司 Baby cry identification method and terminal equipment
CN112541533A (en) * 2020-12-07 2021-03-23 阜阳师范大学 Modified vehicle identification method based on neural network and feature fusion
CN112668556A (en) * 2021-01-21 2021-04-16 广州联智信息科技有限公司 Breath sound identification method and system
CN112668556B (en) * 2021-01-21 2024-06-07 广东白云学院 Breathing sound identification method and system
CN113516154A (en) * 2021-04-09 2021-10-19 北京小米移动软件有限公司 Method, device and storage medium for identifying human voice dubbing type in media file
CN113111786B (en) * 2021-04-15 2024-02-09 西安电子科技大学 Underwater target identification method based on small sample training diagram convolutional network
CN113111786A (en) * 2021-04-15 2021-07-13 西安电子科技大学 Underwater target identification method based on small sample training image convolutional network
CN113571092A (en) * 2021-07-14 2021-10-29 东软集团股份有限公司 Method for identifying abnormal sound of engine and related equipment thereof
CN113571092B (en) * 2021-07-14 2024-05-17 东软集团股份有限公司 Engine abnormal sound identification method and related equipment thereof
CN113782048A (en) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 Multi-modal voice separation method, training method and related device
CN114005460A (en) * 2021-10-28 2022-02-01 广州艾美网络科技有限公司 Method and device for separating voice of music file
CN114398925A (en) * 2021-12-31 2022-04-26 厦门大学 Multi-feature-based ship radiation noise sample length selection method and system
EP4226883A1 (en) * 2022-02-15 2023-08-16 Koninklijke Philips N.V. Apparatuses and methods for use with a treatment device
WO2023156174A1 (en) 2022-02-15 2023-08-24 Koninklijke Philips N.V. Apparatuses and methods for use with a treatment device
CN116264620A (en) * 2023-04-21 2023-06-16 深圳市声菲特科技技术有限公司 Live broadcast recorded audio data acquisition and processing method and related device

Also Published As

Publication number Publication date
CN109074822A (en) 2018-12-21
CN109074822B (en) 2023-04-21

Similar Documents

Publication Publication Date Title
WO2019079972A1 (en) Specific sound recognition method and apparatus, and storage medium
Lokesh et al. An automatic tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map
Sailor et al. Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification.
CN108369813B (en) Specific voice recognition method, apparatus and storage medium
Lidy et al. CQT-based Convolutional Neural Networks for Audio Scene Classification.
Peddinti et al. A time delay neural network architecture for efficient modeling of long temporal contexts.
Sainath et al. Learning filter banks within a deep neural network framework
CN108701469B (en) Cough sound recognition method, device, and storage medium
Bhattacharjee A comparative study of LPCC and MFCC features for the recognition of Assamese phonemes
CN106782511A (en) Amendment linear depth autoencoder network audio recognition method
Srinivasan et al. Artificial neural network based pathological voice classification using MFCC features
Boulmaiz et al. Robust acoustic bird recognition for habitat monitoring with wireless sensor networks
Leonid et al. Retracted article: statistical–model based voice activity identification for human-elephant conflict mitigation
Imtiaz et al. Isolated word automatic speech recognition (ASR) system using MFCC, DTW & KNN
Deb et al. Detection of common cold from speech signals using deep neural network
Pellegrini et al. Inferring phonemic classes from CNN activation maps using clustering techniques
Shetty et al. Classification of healthy and pathological voices using MFCC and ANN
Chattopadhyay et al. Optimizing speech emotion recognition using manta-ray based feature selection
Al Bashit et al. A mel-filterbank and MFCC-based neural network approach to train the Houston toad call detection system design
Peng et al. An acoustic signal processing system for identification of queen-less beehives
Wang Supervised speech separation using deep neural networks
Vecchiotti et al. Convolutional neural networks with 3-d kernels for voice activity detection in a multiroom environment
Cakir Multilabel sound event classification with neural networks
Mendelev et al. Robust voice activity detection with deep maxout neural networks
Abbiyansyah et al. Voice recognition on humanoid robot darwin OP using Mel frequency cepstrum coefficients (MFCC) feature and artificial neural networks (ANN) method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17929708

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10.09.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17929708

Country of ref document: EP

Kind code of ref document: A1