CN111916107A

CN111916107A - Training method of audio classification model, and audio classification method and device

Info

Publication number: CN111916107A
Application number: CN202010673260.0A
Authority: CN
Inventors: 何维祯
Original assignee: TP Link Technologies Co Ltd
Current assignee: TP Link Technologies Co Ltd
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2020-11-10

Abstract

The invention discloses a training method of an audio classification model, which comprises the following steps: calculating audio data in a pre-collected audio sample set according to a preset Gaussian probability model to obtain an audio training set; performing Fourier transform on the audio data in the audio training set to obtain a corresponding frequency spectrum; and inputting the frequency spectrum into a preset GRU neural network for backward propagation training until loss is converged to obtain a trained audio classification model. The embodiment of the invention also discloses an audio classification method and device, which can effectively solve the problem that false detection often occurs due to the influence of environmental noise in the prior art.

Description

Training method of audio classification model, and audio classification method and device

Technical Field

The invention relates to the technical field of audio classification, in particular to a training method of an audio classification model, and an audio classification method and device.

Background

Along with the popularization of intelligent home equipment, the requirements on home safety and living convenience comfort level in modern families are higher and higher, and the classification and identification functions of audio on the intelligent home equipment can greatly improve the safety and living convenience of the families, such as functions of baby crying detection of a household camera, old people falling sound alarm, voice identification, man-machine interaction and the like. Accordingly, various audio classification detection techniques are becoming key techniques in the intelligent home (IoT).

The common audio classification method mainly comprises a template matching method and a machine learning method based on feature extraction, wherein the template matching method is a method type of simulating audio distribution based on a probabilistic language model. However, in practical conditions, the sound is affected by various factors, including environment, speech rate, spoken language, etc., so that the actual distribution does not conform well to the gaussian distribution, and therefore, the accuracy of the method is difficult to be guaranteed. The machine learning method based on feature extraction greatly depends on the quality of the training set. However, in practical applications, it is difficult to include so many audio categories, and thus the applicability of this method is not very wide. And the two methods are greatly influenced by environmental noise, and the false detection problem often occurs due to the influence of the environmental noise.

Disclosure of Invention

The embodiment of the invention provides a training method of an audio classification model, and a method and a device for classifying audio, which can effectively solve the problem that false detection often occurs due to the influence of environmental noise in the prior art.

An embodiment of the present invention provides a method for training an audio classification model, including:

calculating audio data in a pre-collected audio sample set according to a preset Gaussian probability model to obtain an audio training set; wherein the training set of audio comprises: de-noising the audio data;

performing Fourier transform on the audio data in the audio training set to obtain a corresponding frequency spectrum;

and inputting the frequency spectrum into a preset GRU neural network for backward propagation training until loss is converged to obtain a trained audio classification model.

As an improvement of the above scheme, the calculating, according to a preset gaussian probability model, audio data in a pre-collected audio sample set to obtain an audio training set specifically includes:

calculating audio features of audio data in the audio sample set of each frame;

fitting each audio frequency characteristic according to the Gaussian probability model to obtain a corresponding amplitude spectrum of the background noise;

and subtracting the amplitude spectrum of the audio data in the audio sample set of each frame from the amplitude spectrum of the corresponding background noise to obtain a first audio amplitude spectrum, and storing the first audio amplitude spectrum to the audio training set.

As an improvement of the above, the audio features include: fundamental frequency and short-term energy;

correspondingly, the respectively fitting each audio feature according to the gaussian probability model to obtain a corresponding amplitude spectrum of the background noise specifically includes:

respectively fitting a fundamental frequency and short-time energy according to the Gaussian probability model to respectively obtain a first Gaussian distribution curve corresponding to the fundamental frequency and a second Gaussian distribution curve corresponding to the short-time energy;

obtaining a range of fundamental frequency through the first Gaussian distribution curve, and obtaining a range of short-time energy through the second Gaussian distribution curve;

and fitting a corresponding amplitude spectrum of the background noise according to the range of the fundamental frequency and the range of the short-time energy.

As an improvement of the above scheme, before the training of calculating the audio data in the pre-collected audio sample set according to the preset gaussian probability model to obtain the audio, the method further includes:

and preprocessing the acquired original audio data to obtain the audio sample set.

As an improvement of the above scheme, after the audio data in the pre-collected audio sample set is calculated according to a preset gaussian probability model to obtain a training set of audio, before the audio data in the training set of audio is fourier transformed to obtain a corresponding spectrum, the method further includes:

and sequentially performing framing processing, windowing processing and overlapping processing on the audio data in the audio training set.

Another embodiment of the present invention provides an audio classification method, including:

acquiring audio data to be processed, and calculating a magnitude spectrum corresponding to the audio data to be processed;

inputting the audio data to be processed into the trained audio classification model, and calculating to obtain a corresponding audio classification result; wherein the trained audio classification model comprises: calculating audio data in a pre-collected audio sample set according to a preset Gaussian probability model to obtain an audio training set; and inputting the training set of the audio into a preset GRU neural network for backward propagation training until loss convergence so as to obtain a trained audio classification model.

As an improvement of the above scheme, the inputting the audio data to be processed into the trained audio classification model, and calculating to obtain a corresponding audio classification result specifically includes:

calculating the audio characteristics of the audio data to be processed;

the amplitude spectrum corresponding to the audio data to be processed is differenced with the amplitude spectrum of the background noise to obtain a first audio amplitude spectrum;

and inputting the first audio magnitude spectrum into a preset GRU neural network, and calculating to obtain a corresponding audio classification result.

As an improvement of the above scheme, after the audio data to be processed is input to the trained audio classification model, a corresponding audio classification result is obtained by calculation, the method further includes:

and responding to the received prompt instruction of audio classification, and sending out corresponding prompt information according to the audio classification result.

Another embodiment of the present invention provides an apparatus for training an audio classification model, including:

the denoising module is used for calculating audio data in a pre-collected audio sample set according to a preset Gaussian probability model to obtain an audio training set; wherein the training set of audio comprises: de-noising the audio data;

the computing module is used for carrying out Fourier transform on the audio data in the audio training set to obtain a corresponding frequency spectrum;

and the training module is used for inputting the frequency spectrum into a preset GRU neural network for backward propagation training until loss convergence so as to obtain a trained audio classification model.

Another embodiment of the present invention provides an audio classification apparatus, including:

the acquisition module is used for acquiring audio data to be processed and calculating a magnitude spectrum corresponding to the audio data to be processed;

the classification module is used for inputting the audio data to be processed into the trained audio classification model and calculating to obtain a corresponding audio classification result; wherein the trained audio classification model comprises: calculating audio data in a pre-collected audio sample set according to a preset Gaussian probability model to obtain an audio training set; and inputting the training set of the audio into a preset GRU neural network for backward propagation training until loss convergence so as to obtain a trained audio classification model.

Compared with the prior art, the training method of the audio classification model, the audio classification method and the device disclosed by the embodiment of the invention have the advantages that the audio data in the audio sample set collected in advance are calculated according to the preset Gaussian probability model, so that the audio data in the audio sample set are subjected to denoising treatment, the training set of the audio is obtained, then the audio data in the training set of the audio is subjected to Fourier transform, the corresponding frequency spectrum is obtained, and the corresponding frequency spectrum is input into the preset GRU neural network for backward propagation training until loss is converged, so that the trained audio classification model is obtained. Therefore, the audio data collected in advance in the audio sample set are denoised, so that the influence of environmental noise on the audio data is reduced, the audio data collected in the training set is more accurate, the classification result of the audio classification model is more accurate, and the type of the audio can be better identified.

Drawings

FIG. 1 is a flowchart illustrating a method for training an audio classification model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a GRU neural network provided by an embodiment of the present invention;

fig. 3 is a flowchart illustrating step S10 in a method for training an audio classification model according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating a method for classifying audio according to an embodiment of the present invention;

fig. 5 is a flowchart illustrating step S20' of the audio classification method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for training an audio classification model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an audio classification apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a method for training an audio classification model according to an embodiment of the present invention.

s10, calculating audio data in a pre-collected audio sample set according to a preset Gaussian probability model to obtain an audio training set; wherein the training set of audio comprises: and denoising the audio data.

It should be noted that, in the current environment, there are many noise sources, which are noise complexes from many different sources, and after a large amount of statistics, the distribution approaches to gaussian distribution, so a gaussian probability model is used for denoising.

In this embodiment, if the audio in the current environment is captured for M minutes, the M minutes of sound is divided into N seconds segments, which are overlapped for 1/4N seconds to form a pre-captured audio sample set. It will be appreciated that the captured raw audio is processed in this manner.

Specifically, the audio characteristics of the audio data are fitted through a Gaussian probability model, so that a Gaussian distribution curve of background noise is obtained, and the denoised audio data is obtained.

And S20, performing Fourier transform on the audio data in the audio training set to obtain a corresponding frequency spectrum.

In this embodiment, both modes such as scipy and numpy are used to realize fast fourier transform to obtain the frequency spectrum that audio data corresponds, thereby can input to GRU neural network and train, and then obtain audio classification model, better classify audio data.

And S30, inputting the frequency spectrum into a preset GRU neural network for backward propagation training until loss convergence to obtain a trained audio classification model.

It should be noted that, referring to fig. 2, the GRU neural network includes two gates: an update gate and a reset gate (zt and rt in the figure represent the update gate and the reset gate, respectively). The update gate is used to control the extent to which the state information at the previous time is brought into the current state, and a larger value of the update gate indicates that more state information at the previous time is brought in. The reset gate controls how much information of the previous state is written into the current candidate set h-th-t, and the smaller the reset gate is, the less information of the previous state is written.

Specifically, the frequency spectrum is input into a preset GRU neural network and is trained by backward propagation, the partial derivatives of all parameters are calculated, the parameter matrix is updated, and iteration is carried out until loss convergence. In this embodiment, the model is finally obtained by learning with the minimum loss function as a target and is stored.

In summary, the audio data in the audio sample set collected in advance is calculated according to the preset gaussian probability model to perform denoising processing on the audio data in the audio sample set, so as to obtain an audio training set, then the audio data in the audio training set is subjected to fourier transform to obtain a corresponding frequency spectrum, and the frequency spectrum is input into the preset GRU neural network to perform backward propagation training until loss convergence, so as to obtain a trained audio classification model. Therefore, the audio data collected in advance in the audio sample set are denoised, so that the influence of environmental noise on the audio data is reduced, the audio data collected in the training set is more accurate, the classification result of the audio classification model is more accurate, and the type of the audio can be better identified. And the trained audio classification model has better robustness and generalization and low dependence on an audio data set, and can train a better audio classification result on a smaller-scale data set.

As an improvement of the above scheme, the step S10 specifically includes calculating audio data in a pre-collected audio sample set according to a preset gaussian probability model to obtain an audio training set, where:

and S100, calculating the audio characteristics of the audio data in the audio sample set of each frame.

Referring to fig. 3, the audio features include: fundamental frequency and short-time energy, and can further comprise: short-term power, short-term zero-crossing rate, etc.

And S101, respectively fitting each audio characteristic according to the Gaussian probability model to obtain a corresponding amplitude spectrum of the background noise.

Specifically, after audio features are obtained through calculation, fitting is performed on each audio feature to obtain corresponding gaussian distribution, and a corresponding amplitude spectrum of background noise is obtained according to the gaussian distribution.

And S102, subtracting the amplitude spectrum of the audio data in each frame of the audio sample set from the amplitude spectrum of the corresponding background noise to obtain a first audio amplitude spectrum, and storing the first audio amplitude spectrum in the audio training set.

Specifically, the audio data is fourier-transformed and then converted into a frequency spectrum, and the frequency spectrum includes a phase spectrum and a magnitude spectrum. The original amplitude spectrum and the amplitude spectrum of the background noise are subjected to subtraction to obtain first audio data, namely the audio data after denoising, so that noise interference in the audio data in the training set is avoided.

In this embodiment, the audio features include: fundamental frequency and short-term energy.

Correspondingly, the step of respectively fitting each audio feature according to the gaussian probability model to obtain a corresponding amplitude spectrum of the background noise, and S101 specifically includes:

and S1010, respectively fitting a fundamental frequency and short-time energy according to the Gaussian probability model to respectively obtain a first Gaussian distribution curve corresponding to the fundamental frequency and a second Gaussian distribution curve corresponding to the short-time energy.

And S1011, obtaining a range of fundamental frequency through the first Gaussian distribution curve, and obtaining a range of short-time energy through the second Gaussian distribution curve.

And S1012, fitting a corresponding amplitude spectrum of the background noise according to the range of the fundamental frequency and the range of the short-time energy.

In this embodiment, the fundamental frequency and the short-time energy of each frame of audio are calculated, and are stored in the data queue of the fundamental frequency feature and the data queue of the short-time energy feature, respectively, and the gaussian probability model is used to fit the fundamental frequency and the short-time energy, respectively, so as to obtain the average value of the fundamental frequency, the range of the fundamental frequency, the average value of the short-time energy, and the range of the short-time energy. And fitting a background noise amplitude spectrum according to the fundamental frequency range and the short-time energy range.

and S9, preprocessing the acquired original audio data to obtain the audio sample set.

Specifically, the original audio is subjected to frame-by-frame windowing, and meanwhile, the overlapping of preset time is reserved between two adjacent sections of audio, so that the leakage of frequency spectrum energy is prevented, and the continuity of frequency spectrum is guaranteed. It is understood that the preset time can be set according to the user's needs, and is not limited herein.

In the embodiment, each segment of signal is divided into audio segments with 20ms as one frame, and two adjacent segments of audio are overlapped by 5 ms.

In this embodiment, the audio data in the training set is divided into several classes, each of which collects 40-60 5s audio frequencies, and then sequentially performs framing, windowing, and overlapping.

Fig. 4 is a flowchart illustrating a method for classifying audio according to an embodiment of the present invention.

An embodiment of the present invention provides an audio classification method, including:

s10', obtaining the audio data to be processed, and calculating the corresponding amplitude spectrum of the audio data to be processed.

Specifically, fourier transform is performed on audio data to be processed to obtain a corresponding frequency spectrum, so that a magnitude spectrum is obtained.

S20', inputting the audio data to be processed into the trained audio classification model, and calculating to obtain a corresponding audio classification result; wherein the trained audio classification model comprises: calculating audio data in a pre-collected audio sample set according to a preset Gaussian probability model to obtain an audio training set; and inputting the training set of the audio into a preset GRU neural network for backward propagation training until loss convergence so as to obtain a trained audio classification model.

The audio classification may be an infant cry, a human-computer interaction sound, or the like. The audio classification method may be integrated in a processor of the electronic device, or may be connected to the electronic device as an external chip, where the processor is connected to a sound collector, such as a microphone, for collecting environmental sounds in a wired or wireless manner, and uploads the environmental sounds to the processor, so that the processor processes audio data, thereby classifying the audio data.

The electronic device may be a desktop computer, a notebook, a palm computer, a mobile phone, a cloud server, or other computing devices. The electronic device may include, but is not limited to, a processor, a memory. For example, the electronic device may also include input output devices, network access devices, buses, and the like.

The embodiment of the invention discloses an audio classification method, which is characterized in that the trained audio classification model is used for carrying out denoising classification on audio data to be processed, so that the influence of environmental noise on the audio data is reduced.

Referring to fig. 5, as an improvement of the foregoing scheme, the step S20' of inputting the audio data to be processed into the trained audio classification model and calculating to obtain a corresponding audio classification result includes:

s200', calculating the audio characteristics of the audio data to be processed. Wherein the audio features include: fundamental frequency and short-term energy.

And S201', respectively fitting each audio frequency characteristic according to the Gaussian probability model to obtain a corresponding amplitude spectrum of the background noise.

S202', the amplitude spectrum corresponding to the audio data to be processed is differenced with the amplitude spectrum of the background noise to obtain a first audio amplitude spectrum.

S203', inputting the first audio magnitude spectrum into a preset GRU neural network, and calculating to obtain a corresponding audio classification result.

In this embodiment, fitting is performed on the fundamental frequency range and the short-time energy range to obtain an amplitude spectrum of the background noise, and then a difference is made between the amplitude spectrum and an amplitude value corresponding to the audio data to be processed to obtain a denoised amplitude value, so that interference of noise on a classification result is reduced.

s30', in response to receiving the prompt instruction of audio classification, sending out corresponding prompt information according to the audio classification result.

In this embodiment, the classification results of the audio are labeled in advance, that is, different classification results may be adapted to different prompts, or only the labeled classification results are prompted. For example, if the classification result of the audio is a crying sound of an infant, a prompt instruction is sent at this time, an alarm can be given through a buzzer, and a user can be prompted through a short message, an email or an APP. The user can also perform alarm classification on the classification result according to needs, for example, the baby cries and the old fall into a first grade, the rest are a second grade, different prompts are set according to different grades, and the method is not limited herein.

Fig. 6 is a schematic structural diagram of an apparatus for training an audio classification model according to an embodiment of the present invention.

An embodiment of the present invention provides a training apparatus for an audio classification model, including:

the denoising module 10 is configured to calculate audio data in a pre-collected audio sample set according to a preset gaussian probability model to obtain an audio training set; wherein the training set of audio comprises: and denoising the audio data.

And the calculating module 20 is configured to perform fourier transform on the audio data in the training set of the audio to obtain a corresponding frequency spectrum.

And the training module 30 is configured to input the frequency spectrum into a preset GRU neural network for back propagation training until loss convergence, so as to obtain a trained audio classification model.

The embodiment of the invention provides a training device of an audio classification model, which is characterized in that audio data in an audio sample set collected in advance are calculated according to a preset Gaussian probability model to perform denoising processing on the audio data in the audio sample set so as to obtain an audio training set, then the audio data in the audio training set are subjected to Fourier transform to obtain a corresponding frequency spectrum, and the corresponding frequency spectrum is input into a preset GRU neural network for backward propagation training until loss is converged so as to obtain the trained audio classification model. Therefore, the audio data collected in advance in the audio sample set are denoised, so that the influence of environmental noise on the audio data is reduced, the audio data collected in the training set is more accurate, the classification result of the audio classification model is more accurate, and the type of the audio can be better identified. And the trained audio classification model has better robustness and generalization and low dependence on an audio data set, and can train a better audio classification result on a smaller-scale data set.

As an improvement of the foregoing solution, the denoising module 10 specifically includes:

and the audio characteristic calculating module is used for calculating the audio characteristics of the audio data in the audio sample set of each frame.

And the first fitting module is used for respectively fitting each audio characteristic according to the Gaussian probability model so as to obtain a corresponding amplitude spectrum of the background noise.

And the first processing module is used for subtracting the amplitude spectrum of the audio data in each frame of the audio sample set from the amplitude spectrum of the corresponding background noise to obtain a first audio amplitude spectrum, and storing the first audio amplitude spectrum in the audio training set.

For the improvement of the above scheme, the device further comprises:

and the preprocessing module is used for preprocessing the acquired original audio data to obtain the audio sample set.

An embodiment of the present invention provides an audio classification apparatus, including:

the acquisition module 10' is used for acquiring audio data to be processed and calculating a magnitude spectrum corresponding to the audio data to be processed.

The classification module 20' is configured to input the audio data to be processed to the trained audio classification model, and calculate to obtain a corresponding audio classification result; wherein the trained audio classification model comprises: calculating audio data in a pre-collected audio sample set according to a preset Gaussian probability model to obtain an audio training set; and inputting the training set of the audio into a preset GRU neural network for backward propagation training until loss convergence so as to obtain a trained audio classification model.

The embodiment of the invention discloses an audio classification device, which is used for carrying out denoising classification on audio data to be processed through a trained audio classification model, so that the influence of environmental noise on the audio data is reduced.

Wherein, the module/unit integrated by the audio classification device can be stored in a computer readable storage medium if the module/unit is realized in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for training an audio classification model, comprising:

2. The method for training an audio classification model according to claim 1, wherein the calculating audio data in a pre-collected audio sample set according to a preset gaussian probability model to obtain an audio training set specifically comprises:

calculating audio features of audio data in the audio sample set of each frame;

3. The method of training an audio classification model according to claim 2,

the audio features include: fundamental frequency and short-term energy;

4. The method for training an audio classification model according to claim 1, wherein before the training of calculating the audio data in the pre-collected audio sample set according to the preset gaussian probability model to obtain the audio, the method further comprises:

5. The method for training an audio classification model according to claim 1, wherein after the audio data in the pre-collected audio sample set is calculated according to the preset gaussian probability model to obtain the training set of audio, before the fourier transform of the audio data in the training set of audio is performed to obtain the corresponding spectrum, the method further comprises:

6. A method for classifying audio, comprising:

7. The audio classification method according to claim 6, wherein the step of inputting the audio data to be processed into the trained audio classification model and calculating to obtain a corresponding audio classification result specifically comprises:

calculating the audio characteristics of the audio data to be processed;

8. The method for classifying audio according to claim 6, wherein after the audio data to be processed is input to the trained audio classification model and the corresponding audio classification result is calculated, the method further comprises:

9. An apparatus for training an audio classification model, comprising:

10. An apparatus for classifying audio, comprising: