CN108369813B

CN108369813B - Specific voice recognition method, apparatus and storage medium

Info

Publication number: CN108369813B
Application number: CN201780004160.5A
Authority: CN
Inventors: 刘洪涛; 冯澍婷; 欧朋
Original assignee: Shenzhen H&t Smart Home Technology Co ltd
Current assignee: Shenzhen H&T Intelligent Control Co Ltd
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2022-10-25
Anticipated expiration: 2037-07-31
Also published as: WO2019023877A1; CN108369813A

Abstract

A specific sound recognition method, apparatus, and storage medium, the method comprising: sampling a sound signal and acquiring a mel frequency cepstrum coefficient characteristic parameter matrix (201) of the sound signal; extracting signal features (202) from a mel-frequency cepstrum coefficient feature parameter matrix of the sound signal; confirming whether the signal features match a specific sound feature model (203) which is acquired in advance and is based on a support vector data description algorithm; if so, the sound signal is confirmed to be a particular sound (204). The specific sound identification method, the specific sound identification equipment and the specific sound identification storage medium adopt the identification algorithm based on the MFCC characteristic parameters and the SVDD model to identify the specific sound, can be suitable for various different specific sounds, and have the advantages of low algorithm complexity, small calculation amount, low requirement on hardware and reduction of the manufacturing cost of products.

Description

Specific voice recognition method, apparatus and storage medium

Technical Field

Embodiments of the present invention relate to sound processing technologies, and in particular, to a specific sound recognition method, device, and storage medium.

Background

In life, we can hear some specific sound without actual semantics every day. Such as: crying, snoring, coughing, breathing, laughing, firecrackers, etc., although they have no practical semantics, can accurately reflect a person's physiological needs, status or quality of matter. For example: parents can distinguish the full or hungry state of the baby through the crying sound signal of the baby, doctors can distinguish the health condition of people through snore, cough, breath and the like of patients, and people can judge the quality condition of firecrackers through the sound or the frequency of the firecrackers. The content of the specific sound is simple and repeated, but is an indispensable part in our life, and the significance of effectively identifying and judging various specific sound signals is great.

There are studies on recognizing a specific sound by a voice recognition technique. For example, there is a recognition method for cough sound, in which characteristics of cough sound are combined with a speech recognition technology to establish a cough model, and a model matching method based on a Dynamic Time Warping (DTW) is used to recognize isolated cough sound of a specific person.

The method comprises the steps of using the MFCC parameters which are most widely adopted at present as the voice characteristic parameters, dividing an initial sample by using the maximum Euclidean distance, and continuously performing iterative optimization by using an LBG algorithm to obtain a final codebook. And in the stage of identifying the baby crying, extracting the MFCC characteristic parameters of the voice to be identified and the existing codebook to calculate vector quantization errors, and if the vector quantization errors are lower than a judgment value twice, outputting a result of judging the baby crying.

In the process of implementing the present application, the inventors found that at least the following problems exist in the related art: the existing identification algorithm has large calculation amount and high requirement on hardware equipment. And different specific sounds need to adopt different algorithms, so that the uniform algorithm cannot be used for identification, and the algorithm is complex.

Disclosure of Invention

The application aims to provide a specific sound identification method, equipment and a storage medium, which can identify various specific sounds by adopting a uniform algorithm, and have the advantages of simple algorithm, small calculated amount and low requirement on hardware equipment.

To achieve the above object, in a first aspect, an embodiment of the present application provides a specific voice recognition method for recognizing a device, where the method includes:

collecting a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

extracting signal characteristics from a mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

confirming whether the signal features are matched with a specific sound feature model which is obtained in advance and is based on a support vector data description algorithm;

and if the sound signal is matched with the specific sound, confirming that the sound signal is the specific sound.

Optionally, the method further includes:

and acquiring the specific sound characteristic model based on the support vector data description algorithm in advance.

Optionally, the pre-obtaining the specific acoustic feature model based on the support vector data description algorithm includes:

collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;

extracting the signal feature from a mel-frequency cepstrum coefficient feature parameter matrix of the specific sound sample signal;

and training a support vector data description algorithm model by taking the signal characteristics of the specific sound sample signal as input so as to obtain the specific sound characteristic model based on the support vector data description algorithm.

Optionally, the specific sound includes any one of cough, snore, breath, laughing, firecracker and crying.

Optionally, the signal feature includes one or more sub-signal features of an energy feature, a local feature, a global frequency domain feature and a zero-crossing rate feature.

Optionally, the specific acoustic feature model based on the support vector data description algorithm includes one or more sub-feature models based on the support vector data description algorithm, among an energy feature model based on the support vector data description algorithm, a local feature model based on the support vector data description algorithm, a global frequency domain feature model based on the support vector data description algorithm, and a zero-crossing rate feature model based on the support vector data description algorithm;

if the specific acoustic feature model based on the SVM includes multiple sub-feature models based on SVM, the determining whether the signal features match a pre-obtained feature model based on SVM, includes:

and respectively confirming whether each sub-signal feature in the signal features is matched with the plurality of pre-acquired sub-feature models based on the support vector data description algorithm.

In a second aspect, an embodiment of the present application provides a specific sound recognition apparatus, including:

a sound input unit for receiving a sound signal;

a signal processing unit for performing analog signal processing on the sound signal;

the signal processing unit is connected with an operation processing unit which is arranged inside or outside the specific voice recognition device, and the operation processing unit comprises:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform:

Optionally, the at least one processor is further capable of:

if the specific acoustic feature model based on the SVM includes multiple sub-feature models based on SVM, the determining whether the signal features match a pre-obtained specific acoustic feature model based on SVM includes:

and respectively confirming whether each sub-signal feature in the signal features is matched with the multiple pre-acquired sub-feature models based on the support vector data description algorithm.

In a third aspect, the present application further provides a storage medium storing executable instructions, which when executed by a specific sound recognition device, cause the specific sound recognition device to execute the above method

In a fourth aspect, embodiments of the present application further provide a program product including a program stored on a storage medium, the program including program instructions that, when executed by a specific sound recognition apparatus, cause the specific sound recognition apparatus to perform the above-mentioned method.

The specific sound identification method, the specific sound identification device and the specific sound identification storage medium provided by the embodiment of the application adopt the identification algorithm based on the MFCC characteristic parameters and the SVDD model to identify the specific sound, can be suitable for various different specific sounds, and have the advantages of low algorithm complexity, less calculation amount, low requirement on hardware and reduction of the manufacturing cost of products.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a schematic diagram of an application environment according to various embodiments of the present application;

FIG. 2 is a time-amplitude diagram of a cough sound signal;

FIG. 3 is a time-frequency plot of a cough sound signal;

FIG. 4 is a schematic diagram of Mel frequency filtering during MFCC coefficient calculation;

FIG. 5 is a schematic flow chart of pre-obtaining a specific acoustic feature model based on a support vector data description algorithm in a specific acoustic recognition method provided in an embodiment of the present application;

FIG. 6 is a flow chart illustrating a specific voice recognition method provided by an embodiment of the present application;

FIG. 7 is a block diagram of a specific example of a voice recognition device;

fig. 8 is a schematic structural diagram of a specific voice recognition device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the application provides a specific voice identification scheme based on Mel-Frequency Cepstral coeffients (MFCCs) characteristic parameters and a Support Vector Data Description algorithm (SVDD) model, and the scheme is suitable for the application environment shown in fig. 1. Wherein, the specific sound comprises crying sound, snore sound, cough sound, breath sound, laughing sound, firecracker sound and other sounds with repetitive characteristics. The application environment includes a user 10 and a specific voice recognition device 20, and the specific voice recognition device 20 is configured to receive a voice uttered by the user 10 and recognize the voice to determine whether the voice is a specific voice.

Further, after recognizing that the sound is a specific sound, the specific recognition device 20 may also record and process the specific sound to output information on the condition that the user 10 uttered the specific sound. The condition information of the specific sound may include the number of times of the specific sound, the duration of the specific sound, and the decibel of the specific sound. For example, counting statistics may be performed on a specific sound when the specific sound is detected by including a counter in the specific sound recognition apparatus; the method may further comprise the step of including a timer in the specific sound recognition apparatus for counting a duration of the specific sound when the specific sound is detected; it is possible to detect the decibel of a specific sound by including decibel detection means in the specific sound recognition apparatus for detecting the decibel of the specific sound when the specific sound is detected.

The recognition principle of the specific sound is similar to the speech recognition principle, and the recognition result is obtained by comparing the processed input sound with the sound model. It can be divided into two stages, a specific voice model training stage and a specific voice recognition stage. The specific sound model training stage mainly comprises the steps of collecting a certain number of specific sound samples, calculating MFCC characteristic parameters of specific sound signals, extracting signal characteristics from the MFCC characteristic parameters, and performing model training on the signal characteristics based on an SVDD algorithm to obtain a reference characteristic model of the specific sound. In the specific sound identification stage, MFCC characteristic parameters of the sound to be judged are calculated, signal characteristics corresponding to the characteristic models are extracted, whether the signal characteristics are matched with the characteristic models or not is judged, if yes, the sound is judged to be specific sound, and if not, the sound is judged to be non-specific sound. The identification process mainly comprises preprocessing, feature extraction, model training, pattern matching, judgment and the like.

Wherein, in the preprocessing step, sampling a specific sound signal and calculating MFCC coefficients of the specific sound signal are included. In the feature extraction step, signal features are selected from a MFCC coefficient matrix for a particular sound. In the model training step, training is performed according to a characteristic signal extracted from an MFCC coefficient matrix of a specific sound, and an SVDD model corresponding to the characteristic signal is trained. In the step of pattern matching and judging, firstly, an MFCC coefficient matrix of the sound signal is calculated, then the signal characteristics of the sound signal are extracted from the MFCC coefficient matrix, and whether the signal characteristics are matched with the SVDD model or not is judged, if so, the sound signal is judged to be a specific sound signal, otherwise, the sound signal is judged not to be the specific sound signal.

The scheme of recognizing the specific sound by combining the MFCC with the SVDD can simplify the complexity of the algorithm, reduce the calculation amount and can obviously improve the accuracy of recognizing the specific sound.

The embodiment of the present application provides a specific sound identification method, which can be used in the specific sound identification apparatus 20 described above, where the specific sound identification method needs to obtain in advance a specific sound feature model based on a support vector data description algorithm, that is, a specific sound feature model based on an SVDD model, and as shown in fig. 5, the obtaining in advance the specific sound feature model based on the support vector data description algorithm includes:

step 101: collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;

sampling to obtain a specific sound sample signal s (n), and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal according to the specific sound sample signal. The mel frequency cepstrum coefficient is mainly used for sound data feature extraction and operation dimensionality reduction. For example: for data with 512 dimensions (sampling points) in one frame, the most important 40-dimensional data can be extracted after MFCC processing, and the purpose of reducing dimensions is also achieved. The mel-frequency cepstrum coefficient calculation typically includes: pre-emphasis, framing, windowing, fast fourier transform, mel filter bank, and discrete cosine transform.

The method for acquiring the characteristic parameter matrix of the mel frequency cepstrum coefficient of the specific sound sample signal comprises the following steps:

(1) pre-emphasis

The pre-emphasis is to boost the high frequency part to flatten the spectrum of the signal, and to maintain the spectrum in the whole frequency band from low frequency to high frequency, so that the spectrum can be obtained with the same signal-to-noise ratio. Meanwhile, the method is also used for eliminating the vocal cords and lip effects in the generation process, compensating the high-frequency part of the sound signal which is restrained by the pronunciation system, and highlighting the formants of the high frequency. The method is realized by pre-emphasizing a sampled specific sound sample signal s (n) through a first-order Finite-length unit Impulse Response (FIR) high-pass digital filter, wherein a transfer function is as follows:

H(z)＝1-a·z ^-1 (1)

wherein z represents an input signal, the time domain representation is a cough sound sample signal s (n), and a represents a pre-emphasis coefficient, which is generally a constant of 0.9-1.0.

(2) Framing

Each P sampling points in the cough sound sample signal s (n) are grouped into an observation unit, called a frame. The value of P may be 256 or 512, covering a time of about 20 to 30 ms. To avoid excessive variation between two adjacent frames, an overlap region may be formed between two adjacent frames, the overlap region includes M sampling points, and M may be about 1/2 or 1/3 of P. The sampling frequency of the sound signal is usually 8KHz or 16KHz, and for 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000=32ms.

(3) Window with window

Each frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), n =0,1 \8230, P-1, P is the size of the frame, then after multiplication with a hamming window S' (n) = S (n) × W (n), where,

where l represents the window length.

(4) Fast Fourier Transform (FFT)

Since the signal is usually difficult to see by the transformation in the time domain, it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different sounds. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame. And the power spectrum of the sound signal is obtained by performing a modulus square on the frequency spectrum of the sound signal.

(5) Triangular band-pass filter filtering

The energy spectrum is filtered through a set of mel-scale triangular filter banks, and a filter bank with M filters (the number of the filters is similar to the number of critical bands) is defined, wherein the adopted filters are triangular filters, the center frequency is f (M), and M =1, 2. M may be 22-26. The interval between f (m) decreases with decreasing m value and increases with increasing m value, see fig. 4.

The frequency response of the triangular filter is defined as:

wherein

(6) Discrete cosine transform

Calculating the logarithmic energy output by each filter bank as:

the logarithmic energy s (m) is subjected to Discrete Cosine Transform (DCT) to obtain an MFCC coefficient:

step 102: extracting signal characteristics from a mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;

wherein the signal features may include one or more sub-features of an energy feature, a local feature, a global frequency domain feature, and a zero-crossing rate feature.

From equation (5), the MFCC coefficients are a coefficient matrix of N × L, where N is the number of frames of the audio signal and L is the MFCC coefficient length. The MFCC coefficient matrix has high dimensionality and different matrix row numbers N caused by inconsistent sound signal lengths, so that the MFCC coefficient matrix cannot be used as direct input to obtain an SVDD model. Therefore, further extraction of valid features from the MFCC coefficient matrix is needed for direct input into the SVDD model.

In order to further extract the valid features from the MFCC coefficient matrix, the MFCC coefficient matrix needs to be reduced in dimension. And the direct dimension reduction of the MFCC matrix may lose effective features of specific sound signals, and the effective features can be extracted from the MFCC coefficient matrix by combining time domain and frequency domain characteristics of the specific sound signals.

Taking a specific sound signal as an example of a cough sound signal, referring to fig. 2, fig. 2 is a time-amplitude diagram (time domain diagram) of the cough sound signal, and it can be seen from fig. 2 that the occurrence process of the cough sound signal is very short and has obvious paroxysmal, and the duration of the single-sound cough sound is usually less than 550ms, even for patients with severe throat and bronchial diseases, the duration of the single-sound cough sound is usually maintained around 1000 ms. Energetically, the energy of the cough sound signal is concentrated primarily in the first half of the signal. Therefore, the energy coefficients of the signal segments with relatively concentrated energy can be selected as the energy features to characterize the cough sound sample signal, for example, a group of energy coefficients of the first 1/2 part of the cough sound sample signal is selected as the energy features, and the energy features are used as input, and the SVDD model is established to identify the sound signal.

Since the different lengths of the cough sound sample signals will cause the number of rows N of the parameter matrix to be different, the lengths of the energy coefficients are also different. It is therefore necessary to normalize the energy coefficients uniformly to the same length.

Specifically, the extracting the energy feature from the mel-frequency cepstrum coefficient feature parameter matrix of the cough sound sample signal includes:

selecting the energy coefficient of the continuous frame cough sound sample signal with the largest sum of energy coefficients in a preset proportion from the mel frequency cepstrum coefficient characteristic parameter matrix of the cough sound sample signal;

and the energy coefficient of the continuous frame cough sound sample signal is integrated to a preset length based on a DTW algorithm to obtain the energy characteristic of the cough sound sample signal.

In a specific application, the cough sound sample signal of the continuous frame with the largest sum of the energy coefficients in combination with the energy distribution of the cough sound signal and the preset proportion may be the first 1/2 portion, the first 4/7 portion or the first 5/9 portion of the cough sound sample signal, and the like. The preset length can be set according to the actual application condition.

As can be seen from fig. 2, most cough sound signals (about 90%) have a substantially uniform trend, with a rapid decrease in signal energy after the cough pulse has occurred, a rapid decrease during dry cough, and a slightly slower decrease during wet cough. Therefore, the change trend of the cough sound signal can well represent the characteristics of the cough sound signal, the global frequency domain features (the global frequency domain features can reflect the change trend of the signal) can be extracted from the MFCC coefficient matrix of the cough sound signal, and the global frequency domain features are used as input to establish an SVDD model to identify the sound signal.

Specifically, the global frequency domain feature of the cough sound sample signal may be obtained by performing dimension reduction processing on a mel frequency cepstrum coefficient feature parameter matrix of the cough sound sample signal by using a linear discriminant analysis algorithm (LDA).

Fig. 3 is a time-frequency diagram (spectrum diagram) of the cough sound signal, and it can be seen from fig. 3 that the spectrum energy is also concentrated in the beginning of the signal and the frequency distribution range is wide (generally concentrated in 200-6000 Hz). Therefore, the MFCC coefficients of several frames of signals with concentrated spectral energy in the cough sound sample signal can be selected as local features to characterize the cough sound signal, and the local features are used as input to establish an SVDD model to identify the sound signal. Specifically, the local features may be obtained by: the local characteristics of the cough sound sample signals can be obtained by selecting a few frames of signals with the most concentrated energy from the cough sound sample signals, then assigning different weights to the MFCC coefficients of the frames of signals and adding the weights. Since the weight of the mel-frequency cepstrum coefficient of the cough sound sample signal is positively correlated with the energy coefficient of the cough sound sample signal, the weight value can be determined according to the energy coefficient of the cough sound sample signal. Namely: selecting a mel frequency cepstrum coefficient of the continuous S2 frame cough sound sample signal with the largest sum of energy coefficients from the mel frequency cepstrum coefficient characteristic parameter matrix of the cough sound sample signal, wherein S2 is a positive integer; then, the weight of the mel frequency cepstrum coefficient of the S2 frame cough sound sample signal is determined based on the energy coefficient of the S2 frame cough sound sample signal, and the mel frequency cepstrum coefficient of the S2 frame cough sound sample signal is weighted and summed according to the weight of the mel frequency cepstrum coefficient of the S2 frame cough sound sample signal, so that the local feature of the cough sound sample signal is obtained.

Through the above analysis, if the specific sound is a cough sound signal, the energy feature, the local feature and the global frequency domain feature can reflect the characteristics of the cough sound signal, and one or more sub-signal features of the energy feature, the global frequency domain feature and the local feature of the cough sound signal are selected in the MFCC coefficient matrix of the cough sound sample signal. And the one or more sub-signal characteristics are used as input, and an SVDD model is established to identify the sound signal, so that the accuracy of cough sound identification is greatly improved, and the error identification rate is reduced. When the energy characteristic, the local characteristic and the global frequency domain characteristic are simultaneously extracted from the MFCC coefficient matrix of the cough sound sample signal and are used as input, and the SVDD model is trained to recognize the sound signal, the recognition rate of the cough sound can reach more than 95%.

Other dimension reduction methods may also be used to reduce the MFCC coefficients of a particular sound sample signal, such as algorithms like DTW, principal Component Analysis (PCA), etc. Under the condition that the PCA algorithm is adopted to reduce the dimension of the MFCC coefficient of the cough sound sample signal, and the SVDD model is trained by using the parameters after the dimension reduction, the obtained SVDD model of the cough sound signal has small differentiation degree of the cough sound and the noise, the cough sound recognition rate is about 85 percent, and the noise error recognition rate reaches 65 percent.

Other specific speech signals, such as: crying, breathing, laughing, snoring, explosive sound and the like can also be selectively extracted from the MFCC coefficient matrix according to the time domain and frequency domain characteristics.

Step 103: and training a support vector data description algorithm model by taking the signal characteristics of the specific sound sample signal as input so as to obtain the specific sound characteristic model based on the support vector data description algorithm.

When the specific sound is a cough sound, the energy feature, the local feature and the global frequency domain feature are respectively used as input to train an SVDD model, namely an SVDD model (energy feature model) for training the energy feature, an SVDD model (local feature model) for the local feature and an SVDD model (global frequency domain feature model) for the global frequency domain feature. Thereby obtaining a specific sound characteristic model which is composed of an energy characteristic model, a local characteristic model and a global frequency domain characteristic model and is based on the support vector data description algorithm.

The SVDD basic principle is to calculate a spherical decision boundary for an input sample, and divide the whole space into two parts, wherein one part is the space in the boundary and is regarded as an acceptable part; the other part is the space outside the boundary and is considered as the rejected part. This allows the SVDD to have classification characteristics of a class of samples.

Specifically, the optimization goal of SVDD is to find a minimum sphere with a center a and a radius R:

such that this sphere satisfies (for data x over 3 dimensions) _i The spherical surface is a hypersphere. Wherein, hypersphere refers to the sphere in the space of more than 3 dimensions, is the curve in the corresponding 2 dimensions space, is the sphere in the 3 dimensions space):

satisfying this condition means that the data points in the training data set are all contained in a sphere, where x _i Representing input sample data, i.e. a particular sound sample signal.

Now that there is an object to be solved and a constraint, the following solution method can be adopted

Lagrangian multiplier method:

wherein alpha is _i ≥0,γ _i ≥0，For parameters R, a and xi respectively _i Taking the partial derivative and making the derivative equal to 0 yields:

the dual problems can be obtained by substituting the above (7), (8) and (9) into the formula (6):

wherein

The above vector inner product can be solved by a kernel function K, i.e.:

through the calculation process, the value of j center a and radius R can be obtained, and the SVDD model is also determined. The centers a1, a2 and a3 and the radiuses R1, R2 and R3 of the SVDD models can be obtained in a trainable mode by respectively utilizing the calculation process and respectively correspond to the energy characteristic model, the local characteristic model and the global frequency domain characteristic model, and the training process is completed.

In the training process, on one hand, the size and the range of the hypersphere are controlled to enable the hypersphere to contain as many sample points as possible, and on the other hand, the radius of the hypersphere is required to be minimum to enable the hypersphere to achieve the optimal classification effect.

Specifically, taking a cough sound signal as an example, each model corresponds to a hypersphere, and on the premise of containing all specific sound signals, the hypersphere boundary is optimized to minimize the radius of the hypersphere boundary, and finally, a cough signal feature model based on a support vector data description algorithm which meets the requirement best is obtained, so that the accuracy rate is high when the cough signal feature model based on the support vector data description algorithm is used for identifying the signal features of the extracted sound signals.

As shown in fig. 6, the specific voice recognition method includes:

step 201, sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

in practical applications, a sound input unit (e.g., a microphone) may be disposed on the specific sound recognition device 20 to collect a sound signal, amplify, filter, and convert the sound signal into a digital signal. The digital signal may be sampled and subjected to other calculation processing in an arithmetic processing unit local to the specific voice recognition device 20, or may be uploaded to a cloud server, an intelligent terminal, or another server via a network for processing.

For details of the technique for obtaining the mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal, please refer to step 101, which is not described herein again.

Step 202, extracting signal characteristics from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

specifically, when the cough sound is identified, the energy feature, the local feature, and the global frequency domain feature are extracted from the feature parameter matrix. Other specific sounds may select signal characteristics from the characteristic parameter matrix based on time and frequency domain characteristics of the sound signal. Please refer to step 102, which is not described herein again for a specific method for calculating the energy characteristic, the local characteristic, and the global frequency domain characteristic of the sound signal.

Step 203, confirming whether the signal characteristics are matched with a specific sound characteristic model which is obtained in advance and is based on a support vector data description algorithm;

specifically, on the occasion of identifying the cough sound, it is respectively determined whether the energy feature, the local feature and the global frequency domain feature obtained in step 202 conform to the energy feature model, the local feature model and the global frequency domain feature model in the cough sound feature model, that is, whether the energy feature conforms to the energy feature model, whether the local feature conforms to the local feature model and whether the global frequency domain feature conforms to the global frequency domain feature model. As can be seen from the discussion of step 103, the energy feature model, the local feature model, and the global frequency domain feature model are hyper-spherical models with centers a1, a2, and a3 and radii R1, R2, and R3, respectively. When judging whether the energy feature, the local feature and the global frequency domain feature conform to the feature model, distances D1, D2 and D3 from the energy feature, the local feature and the global frequency domain feature to centers a1, a2 and a3 can be respectively calculated, and the sound signal can be judged to be a cough sound only when all three features are within boundaries of the SVDD model (i.e., D1< R1, D2< R2 and D3< R3).

And 204, if the sound signal is matched with the specific sound signal, confirming that the sound signal is the specific sound.

The specific sound identification method provided by the embodiment of the application adopts the identification algorithm based on the MFCC characteristic parameters and the SVDD model to identify the specific sound, can be suitable for various different specific sounds, and has the advantages of low algorithm complexity, small calculation amount, low requirement on hardware and reduction of the manufacturing cost of products.

Accordingly, the embodiment of the present application further provides a specific voice recognition apparatus for recognizing the device 20, the apparatus includes:

a sampling and characteristic parameter obtaining module 301, configured to sample a sound signal and obtain a mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

a signal feature extraction module 302, configured to extract a signal feature from a mel-frequency cepstrum coefficient feature parameter matrix of the sound signal;

a feature matching module 303, configured to determine whether the signal features match a feature model based on a support vector data description algorithm, which is obtained in advance;

a confirming module 304, configured to confirm that the sound signal is a specific sound if the signal features match a pre-obtained specific sound feature model based on a support vector data description algorithm.

The specific sound identification device provided by the embodiment of the application adopts the identification algorithm based on the MFCC characteristic parameters and the SVDD model to identify the specific sound, can be suitable for various different specific sounds, and has the advantages of low algorithm complexity, less calculation amount, low requirement on hardware and reduction of the manufacturing cost of products.

Optionally, in another embodiment of the apparatus, the apparatus further includes:

the feature model presetting module is used for acquiring the specific sound feature model based on the support vector data description algorithm in advance;

the feature model presetting module is specifically configured to:

and taking the signal characteristics of the specific sound sample signal as input, and training a support vector data description algorithm model to obtain a specific sound characteristic model based on a support vector data description algorithm.

Optionally, in some embodiments of the apparatus, the specific sound includes any one of cough, snore, breath, laughing, firecracker, and crying;

Optionally, in some embodiments of the apparatus, the signal-specific acoustic feature model based on the support vector data description algorithm includes one or more of an energy feature model based on the support vector data description algorithm, a local feature model based on the support vector data description algorithm, a global frequency domain feature model based on the support vector data description algorithm, and a zero-crossing rate feature model based on the support vector data description algorithm;

if the specific acoustic feature model based on the SVM includes a plurality of sub-feature models based on SVM, the determining whether the signal feature matches a pre-obtained specific acoustic feature model based on SVM includes:

and respectively confirming whether each sub-signal feature in the signal features is matched with the pre-acquired sub-signal feature models based on the support vector data description algorithm.

It should be noted that the above-mentioned apparatus can execute the method provided by the embodiments of the present application, and has corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided in the embodiment of the present invention.

The embodiment of the present application also provides a specific sound recognition device, and as shown in fig. 8, the specific sound recognition device 20 includes a sound input unit 21, a signal processing unit 22, and an arithmetic processing unit 23. Wherein: a sound input unit 21 for receiving a sound signal, which may be, for example, a microphone or the like. A signal processing unit 22 for performing signal processing on the sound signal; the signal processing unit 22 may perform analog signal processing such as amplification, filtering, and digital-to-analog conversion on the sound signal, and send the obtained digital signal to the arithmetic processing unit 23.

The signal processing unit 22 is connected to an arithmetic processing unit 23 (fig. 8 illustrates that the arithmetic processing unit is built in the specific voice recognition device or is external to the specific voice recognition device), the arithmetic processing unit 23 may be built in the specific voice recognition device 20 or may be external to the specific voice recognition device 20, and the arithmetic processing unit 23 may also be a server that is remotely located, for example, a cloud server, an intelligent terminal or other server that is in communication connection with the specific voice recognition device 20 through a network.

The arithmetic processing unit 23 includes:

at least one processor 232 (one processor is illustrated in fig. 8) and a memory 231, and the processor 232 and the memory 231 may be connected by a bus or other means, and the bus connection is illustrated in fig. 8 as an example.

The memory 231 is used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to a particular voice recognition method in the embodiments of the present application (e.g., the sampling and feature parameter acquisition module 301 shown in fig. 7). The processor 232 executes various functional applications and data processing, i.e., implementing a specific voice recognition method of the above-described method embodiments, by executing nonvolatile software programs, instructions, and modules stored in the memory 231.

The memory 231 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created according to a particular voice recognition device usage, and the like. Further, the memory 231 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 231 optionally includes memory located remotely from processor 232, which may be connected to a particular voice recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules stored in the memory 231, when executed by the one or more processors 232, perform the particular voice recognition method of any of the method embodiments described above, e.g., performing the method steps 101-103 of fig. 5, 201-204 of fig. 6, described above; the functionality of modules 301-304 in figure 7 is implemented.

The specific sound recognition device 20 provided in the embodiment of the present application may be used to recognize different specific sounds, such as crying, snoring, cough, breathing, laughing, firecracker, and the like. In actual use, the operation mode of the specific sound recognition apparatus 20 needs to be switched before a different specific sound is recognized. For example, when cough sounds are recognized, in the feature extraction step, energy features, local features and global frequency domain features of sounds to be detected are extracted, in the model training step, energy feature models, local feature models and global frequency domain feature models are generated through training, and in the mode matching step, energy features and energy feature models, local features and local feature models, global frequency domain features and global frequency domain feature models are correspondingly matched. For other sounds, one or more sub-features of the energy feature, the local feature, the global frequency domain feature and the zero-crossing rate feature can be extracted according to the time domain and frequency domain characteristics of other sounds, each sub-feature model is correspondingly established, and then each sub-feature is matched with the corresponding sub-feature model in the pattern matching step.

The specific voice recognition device can execute the method provided by the embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

The present embodiments provide a storage medium storing computer-executable instructions, which when executed by one or more processors (e.g., one of the processors 232 in fig. 8) may cause the one or more processors to perform a particular voice recognition method in any of the above method embodiments, e.g., performing the above-described method steps 101-103 in fig. 5, and the method steps 201-204 in fig. 6; the functionality of modules 301-304 in figure 7 is implemented.

The above-described embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, it is clear to those skilled in the art that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes in the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the computer program can be stored in a computer readable storage medium, and when executed, the computer program can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; within the idea of the invention, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for specific voice recognition, the method comprising:

combining time domain and frequency domain characteristics of the specific sound, extracting signal features from a mel-frequency cepstrum coefficient feature parameter matrix of the sound signal to reduce the dimension of the mel-frequency cepstrum coefficient feature parameter matrix, wherein the extracting of the signal features from the mel-frequency cepstrum coefficient feature parameter matrix of the sound signal comprises:

selecting an energy coefficient of a continuous frame sound signal with the largest sum of energy coefficients in a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal at a preset proportion, and integrating the energy coefficient to a preset length to obtain the energy characteristic of the sound signal;

selecting a plurality of frame signals with most concentrated energy from the sound signals, then distributing different weights for the Mel frequency cepstrum coefficients of the frame signals and adding the weights to obtain the local characteristics of the sound signals;

taking one or more sub-signal features in the energy features and the local features as input, and establishing a support vector data description algorithm model to identify the sound signals;

2. The specific voice recognition method according to claim 1, further comprising:

3. The specific voice recognition method according to claim 2, wherein the pre-obtaining the specific voice feature model based on the support vector data description algorithm comprises:

and taking the signal characteristics of the specific sound sample signal as input, and training a support vector data description algorithm model to obtain the specific sound characteristic model based on the support vector data description algorithm.

4. The specific sound recognition method according to any one of claims 1 to 3, wherein the specific sound includes any one of a cough, a snore, a breath, a laugh, a firecracker, and a crying.

5. The specific sound recognition method of any one of claims 1-3, wherein the signal features further comprise one or more sub-signal features of a global frequency domain feature and a zero-crossing rate feature.

6. The specific voice recognition method according to claim 5, wherein the specific voice feature model based on the support vector data description algorithm comprises one or more sub-feature models based on the support vector data description algorithm selected from an energy feature model based on the support vector data description algorithm, a local feature model based on the support vector data description algorithm, a global frequency domain feature model based on the support vector data description algorithm, and a zero-crossing rate feature model based on the support vector data description algorithm;

if the specific acoustic feature model based on the SVM includes a plurality of sub-feature models based on SVM, the determining whether the signal features match a pre-obtained SVM, including:

7. A specific voice recognition apparatus characterized by comprising:

a sound input unit for receiving a sound signal;

the signal processing unit is connected with an operation processing unit which is arranged in or outside the specific voice recognition device, and the operation processing unit comprises:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein,

combining the time domain and frequency domain characteristics of the specific sound, extracting signal characteristics from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal so as to reduce the dimension of the Mel frequency cepstrum coefficient characteristic parameter matrix

The extracting of the signal feature from the mel-frequency cepstrum coefficient feature parameter matrix of the sound signal comprises:

selecting a plurality of frame signals with most concentrated energy from the sound signals, then distributing different weights to the Mel frequency cepstrum coefficients of the frame signals and adding the weights to obtain the local characteristics of the sound signals;

and if the specific sound is matched with the specific sound, confirming that the sound signal is the specific sound.

8. The specific sound recognition device of claim 7, wherein the at least one processor is further capable of performing:

9. The specific sound identification device according to claim 8, wherein the pre-obtaining of the specific sound feature model based on the support vector data description algorithm comprises:

10. The specific sound recognition apparatus according to any one of claims 7 to 9, wherein the specific sound includes any one of a cough, a snore, a breath, a laugh, a firecracker, and a crying.

11. The specific sound recognition device of any one of claims 7-9, wherein the signal features further include one or more sub-signal features of a global frequency domain feature and a zero-crossing rate feature.

12. The specific sound identification device according to claim 11, wherein the specific sound feature model based on the support vector data description algorithm comprises one or more of an energy feature model based on the support vector data description algorithm, a local feature model based on the support vector data description algorithm, a global frequency domain feature model based on the support vector data description algorithm, and a zero-crossing rate feature model based on the support vector data description algorithm;

13. A storage medium storing executable instructions that, when executed by a particular sound recognition device, cause the particular sound recognition device to perform the method of any one of claims 1-6.