CN111105812A

CN111105812A - Audio feature extraction method and device, training method and electronic equipment

Info

Publication number: CN111105812A
Application number: CN201911409010.XA
Authority: CN
Inventors: 何维祯
Original assignee: Pulian International Co Ltd
Current assignee: Pulian International Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-05

Abstract

The invention discloses an audio feature extraction method, an audio feature extraction device, a training method and electronic equipment, wherein the method comprises the following steps: acquiring audio to be extracted according to a preset window length, and dividing the audio to be extracted into M audio frames according to a preset frame length; calculating a frequency spectrum corresponding to each audio frame; obtaining a fitting slope and a fitting intercept corresponding to each frequency spectrum based on a linear fitting algorithm according to the N frequency domain points of each frequency spectrum; calculating to obtain the spectrum flatness of each spectrum according to the spectrums and a preset calculation formula; dividing each frequency spectrum into m sections of frequency spectrum bands, calculating to obtain a logarithmic frequency spectrum corresponding to each section of frequency spectrum band, and further calculating to obtain the frequency spectrum contrast of each frequency spectrum; and obtaining the characteristic quantity of the audio frame according to the fitting slope, the fitting intercept, the spectral flatness and the spectral contrast of the audio frame, and further extracting the audio characteristic of the audio to be extracted. When the audio features extracted by the invention are used in detection scenes such as baby crying and the like, the accuracy of audio detection is favorably improved.

Description

Audio feature extraction method and device, training method and electronic equipment

Technical Field

The invention relates to the technical field of audio processing, in particular to an audio feature extraction method, an audio feature extraction device, an audio feature training method and electronic equipment.

Background

With the development of society, problems such as high labor cost and the like gradually emerge, consumption cost or time cost in the aspect of infant nursing is higher nowadays, and nursing products or household security products such as infant nursing devices capable of identifying infant crying are more and more popular with parents. When the nursing product or the household security product detects the crying of the baby, the warning is automatically sent to a nursing person or a parent, so that the baby can be watched in time.

In the prior art, when detecting a baby crying, it is usually determined whether the baby crying exists according to the energy characteristics of the detected audio, and when the energy characteristics of the detected audio are consistent with the energy characteristics of the baby crying, it is determined that the baby crying exists in the audio, and a warning is given to a caregiver or a parent.

However, the detection environment is often complicated, other sounds of the surrounding environment may exist, the audio may have an environmental noise that is not the baby cry, and when the energy characteristics of the other sounds are similar to the energy characteristics of the baby cry, if the baby cry is detected only by the energy of the audio, the environmental noise may be detected as the baby cry, and the detection accuracy is low.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide an audio feature extraction method, an audio feature extraction device, an audio training method, and an electronic device, which can extract audio features according to a fitting slope, a fitting intercept, a spectral contrast, and a spectral flatness of audio, and are beneficial to multi-dimensional fine audio detection and improvement of audio detection accuracy when used in audio detection scenes such as baby cry.

In order to solve the above technical problem, in a first aspect, the present invention provides an audio feature extraction method, including:

acquiring audio to be extracted according to a preset window length, and dividing the audio to be extracted into M audio frames according to a preset frame length, wherein M is greater than 1;

calculating a frequency spectrum corresponding to each audio frame; wherein the frequency spectrum comprises N frequency domain points, N > 1;

obtaining a fitting slope and a fitting intercept corresponding to each frequency spectrum based on a linear fitting algorithm according to the N frequency domain points of each frequency spectrum;

calculating to obtain the spectrum flatness of each spectrum according to the spectrums and a preset calculation formula;

dividing each frequency spectrum into m sections of frequency spectrum bands, and calculating to obtain a logarithmic frequency spectrum corresponding to each section of frequency spectrum band; m is greater than 1;

obtaining the spectrum contrast of each spectrum according to the m sections of log spectrums corresponding to each spectrum;

obtaining the characteristic quantity of each audio frame according to the fitting slope, the fitting intercept, the spectral flatness and the spectral contrast of each audio frame;

and extracting the audio features of the audio to be extracted according to the feature quantity of the M frames of the audio frames.

Further, the linear fitting algorithm is a linear least square algorithm, and the obtaining of the fitting slope and the fitting intercept corresponding to each frequency spectrum based on the linear fitting algorithm according to the N frequency domain points of each frequency spectrum specifically includes:

selecting frequency domain points with corresponding frequencies within a preset frequency range from the N frequency domain points of each frequency spectrum;

and performing linear fitting on the selected frequency domain points with the corresponding frequencies within a preset frequency range based on a linear least square algorithm to obtain a fitting slope and a fitting intercept corresponding to each frequency spectrum.

Further, the dividing each spectrum into m sections of spectrum bands, and calculating to obtain a logarithmic spectrum corresponding to each section of spectrum band specifically includes:

dividing each frequency spectrum into m sections of frequency spectrum bands, and respectively carrying out K-L conversion processing on each section of frequency spectrum band;

obtaining a logarithmic spectrum corresponding to the spectrum band after each section of the spectrum band is subjected to the K-L conversion processing according to the following formula:

s_i(f″)＝10×log₁₀s_i(f′)；

wherein s is_i(f') is the ith spectral band, s, after K-L transformation_i(f') is s_i(f') i is more than or equal to 1 and less than or equal to m.

Further, the obtaining the spectrum contrast of each spectrum according to the m segments of log spectrums corresponding to each spectrum specifically includes:

for each section of the log spectrum, acquiring a spectrum peak value and a spectrum valley value of the log spectrum, and calculating a peak-valley difference value between the spectrum peak value and the spectrum valley value;

and for each spectrum, calculating the average value of m peak-valley difference values of the corresponding m sections of log spectrum to obtain the spectrum contrast of the spectrum.

Further, the calculation formula is as follows:

wherein s (f) is the frequency spectrum; (s (f)) is the spectral Flatness corresponding to spectrum s (f); n is the number of frequency domain points included in the spectrum, and x (N) is the amplitude of the nth frequency domain point of the spectrum s (f).

Further, the method further comprises:

calculating to obtain a Mel cepstrum coefficient of each audio frame;

then, the obtaining the feature quantity of each audio frame according to the fitting slope, the fitting intercept, the spectral flatness, and the spectral contrast of each audio frame specifically includes:

and obtaining the characteristic quantity of each audio frame according to the fitting slope, the fitting intercept, the spectral flatness, the spectral contrast and the Mel cepstrum coefficient of each audio frame.

In order to solve the above technical problem, in a second aspect, the present invention further provides an audio feature extraction apparatus, including:

the audio frame acquisition module is used for acquiring audio to be extracted according to a preset window length and dividing the audio to be extracted into M audio frames according to a preset frame length, wherein M is greater than 1;

the first calculating module is used for calculating the frequency spectrum corresponding to each audio frame; wherein the frequency spectrum comprises N frequency domain points, N > 1;

the fitting module is used for obtaining a fitting slope and a fitting intercept corresponding to each frequency spectrum based on a linear fitting algorithm according to the N frequency domain points of each frequency spectrum;

the spectrum flatness calculation module is used for calculating and obtaining the spectrum flatness of each spectrum according to the spectrums and a preset calculation formula;

the second calculation module is used for dividing each spectrum into m sections of spectrum bands and calculating to obtain a logarithmic spectrum corresponding to each section of spectrum band; m is greater than 1;

the spectrum contrast calculation module is used for obtaining the spectrum contrast of each spectrum according to the m sections of log spectrums corresponding to each spectrum;

a feature quantity obtaining module, configured to obtain a feature quantity of each audio frame according to the fitting slope, the fitting intercept, the spectral flatness, and the spectral contrast of each audio frame;

and the extraction module is used for extracting the audio features of the audio to be extracted according to the feature quantity of the M frames of the audio frames.

In order to solve the above technical problem, in a third aspect, the present invention further provides a method for training an audio classification model, where the method includes:

constructing an audio classification initial model; the audio classification initial model is used for classifying the audio;

obtaining a plurality of training audios corresponding to each classification result; each training audio is pre-allocated with a classification identifier matched with the corresponding classification result;

taking the training audio as the audio to be extracted, and extracting the audio feature corresponding to each training audio according to the audio feature extraction method of any one of claims 1 to 6;

carrying out standardization processing on the audio features corresponding to each training audio, and constructing a training sample set according to each audio feature after standardization processing and the matched classification identification;

and training the audio classification initial model according to the training sample set to obtain an audio classification model.

Further, the audio features corresponding to each of the training audios are:

A_i＝[a_i1,a_i2,…,a_iq]

wherein X is an audio feature corresponding to the training audio; a. the_iI is more than or equal to 1 and less than or equal to M for the characteristic quantity of the audio frame of the ith frame in the training audio; q is the number of elements of the characteristic quantity, q>1；

Then, the normalizing the audio features corresponding to each of the training audios specifically includes:

normalizing the audio features corresponding to each training audio according to the following formula:

wherein, X' is the audio frequency characteristic after the standardization treatment; a'_iThe feature quantity of the audio frame of the ith frame in the training audio after the normalization processing is carried out; a is_k-meanK is more than or equal to 1 and less than or equal to q and is the average value of the kth element in the characteristic quantity of the M frames of audio frames of the training audio; std (a)_k) Is the variance of the kth element in the feature quantities of the M frames of audio of the training audio.

In order to solve the above technical problem, in a fourth aspect, the present invention further provides an electronic device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor, when executing the computer program, implements the audio feature extraction method according to any one of the first aspects.

The audio feature extraction method, the device, the training method and the electronic equipment can extract the audio features including the fitting slope, the fitting intercept, the spectral flatness and the spectral contrast of the audio, and compared with the audio features only with the energy of the audio, the audio features extracted by the method can distinguish different audios through more-dimensional information. When the method is used for audio detection, noise and target detection sound can be better identified and classified, and the accuracy of audio detection is favorably improved. For example, the method is used for detecting the baby crying, can better identify and classify the environmental noise and the baby crying, and is favorable for improving the accuracy of the baby crying detection.

Drawings

FIG. 1 is a flow chart of an audio feature extraction method according to a preferred embodiment of the present invention;

fig. 2 is a schematic structural diagram of a preferred embodiment of an audio feature extraction apparatus provided in the present invention;

FIG. 3 is a flowchart illustrating a method for training an audio classification model according to a preferred embodiment of the present invention;

fig. 4 is a schematic structural diagram of a preferred embodiment of an electronic device provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of a preferred embodiment of an audio feature extraction method according to the present invention; specifically, the method comprises the following steps:

s1, acquiring the audio to be extracted according to the preset window length, and dividing the audio to be extracted into M audio frames according to the preset frame length, wherein M is greater than 1;

s2, calculating the frequency spectrum corresponding to each audio frame; wherein the frequency spectrum comprises N frequency domain points, N > 1;

s3, obtaining a fitting slope and a fitting intercept corresponding to each frequency spectrum based on a linear fitting algorithm according to the N frequency domain points of each frequency spectrum;

s4, calculating and obtaining the spectrum flatness of each spectrum according to the spectrums and a preset calculation formula;

s5, dividing each frequency spectrum into m sections of frequency spectrum bands, and calculating to obtain a logarithmic frequency spectrum corresponding to each section of frequency spectrum band; m is greater than 1;

s6, obtaining the spectrum contrast of each spectrum according to the m sections of log spectrums corresponding to each spectrum;

s7, obtaining the characteristic quantity of each audio frame according to the fitting slope, the fitting intercept, the spectral flatness and the spectral contrast of each audio frame;

and S8, extracting the audio features of the audio to be extracted according to the feature quantity of the M frames of the audio frames.

In specific implementation, for detecting an input audio, acquiring an audio to be extracted according to a preset window length, for example, intercepting each section of the audio to be extracted according to a 5s window length, and dividing the audio to be extracted into M audio frames according to a preset frame length, where M >1, for example, dividing the 5s audio to be extracted into at least 200 audio frames according to a 25ms frame length; preferably, there is an overlap between adjacent audio frames, for example, there is an overlap of 1/4 frame lengths between adjacent audio frames, and it is anticipated that the number of audio frames will be greater than 200.

The audio frame of each frame has a plurality of time domain points, the time domain points are mapped to a frequency domain through Fourier transform calculation, and the frequency spectrum corresponding to each frame of audio frame is obtained through calculation; wherein the frequency spectrum comprises N frequency domain points. It should be noted that, it is preferable to obtain the frequency spectrum by using the fast fourier transform, and since the obtained frequency spectrum by using the fast fourier transform has symmetry, the N frequency domain points may only select the frequency domain points of the first half of the frequency spectrum, that is, if there are C time domain points of each frame of audio frame, N is equal to C/2.

Performing linear fitting on the N frequency domain points of each frequency spectrum based on a linear fitting algorithm to obtain a fitting slope and a fitting intercept corresponding to each frequency spectrum; and calculating to obtain the spectral flatness of each frequency spectrum according to the frequency spectrums and a preset calculation formula.

Further, dividing each frequency spectrum into m sections of frequency spectrum bands, and calculating to obtain a logarithmic frequency spectrum corresponding to each section of frequency spectrum band; m is greater than 1. And the spectrum contrast of each spectrum is obtained by calculating according to the corresponding m sections of log spectrums.

And traversing the frequency spectrums of all the audio frames to obtain all the fitting slopes, the fitting intercepts, the frequency spectrum flatness and the frequency spectrum contrast of the M frames of audio frames of the audio to be extracted. Obtaining the characteristic quantity of each audio frame according to the fitting slope, the fitting intercept, the spectral flatness and the spectral contrast of the audio frame; and extracting the audio features of the audio to be extracted according to the feature quantity of the M frames of audio frames.

The audio feature extraction method provided by the invention can extract the audio features including the fitting slope, the fitting intercept, the spectral flatness and the spectral contrast of the audio, and compared with the audio features only with the audio energy, the audio features extracted by the invention can distinguish different audios through more-dimensional information. When the method is used for audio detection, noise and target detection sound can be better identified and classified, and the accuracy of audio detection is favorably improved. For example, the method is used for detecting the baby crying, can better identify and classify the environmental noise and the baby crying, and is favorable for improving the accuracy of the baby crying detection.

Preferably, the linear fitting algorithm is a linear least square algorithm, and the obtaining of the fitting slope and the fitting intercept corresponding to each frequency spectrum based on the linear fitting algorithm according to the N frequency domain points of each frequency spectrum specifically includes:

In this embodiment, all frequency domain points of the frequency spectrum are not fitted, but frequency domain points located in a preset frequency range are selected for linear fitting, the preset frequency range is set according to a target detection audio, for example, when an audio characteristic is used for detecting baby cry, since the frequency of baby cry has a certain range, the fitting is performed only on signals in the preset frequency range in the audio, for example, the preset frequency range is set to 250HZ to 600HZ, so as to extract the audio characteristic for baby cry detection. And performing linear simulation by using a linear down-least square algorithm to obtain a fitting slope and a fitting intercept corresponding to each frame of audio frame.

Preferably, the dividing each spectrum into m segments of spectrum bands and obtaining a logarithmic spectrum corresponding to each segment of spectrum band by calculation specifically includes:

s_i(f″)＝10×log₁₀s_i(f′) (1)；

In this embodiment, the correlations between different spectral bands are eliminated through K-L transformation, so that the spectra in different spectral band ranges are uncorrelated with each other, and the peak-to-valley difference corresponding to each spectral band can be separately reflected in the spectral contrast of the spectral band, and the spectral contrast of the spectrum is obtained after averaging.

Specifically, assume that the obtained spectrum is s (f), s_i(f) For the ith frequency spectrum band in s (f), i is more than or equal to 1 and less than or equal to m, and the mean value of the solved frequency spectrum is u, the covariance matrix obtained by K-L transformation is as follows:

C＝E[(s(f)-u)(s(f)-u)^T](2)

solving eigenvalues λ of covariance matrix_iAnd the feature vector phi_i：

Cφ_i＝λ_iφ_i

The K-L transform processing is performed on the spectrum s (F) to become F:

F＝φ^T(s(f)-u)

wherein phi is [ phi ]₁,φ₂,…,φ_m]。

Each spectral band s is obtained from F_i(f) Spectrum after K-L conversion. Then, the logarithmic spectrum corresponding to each spectral band is obtained through the formula (2).

Preferably, the obtaining the spectrum contrast of each spectrum according to the m segments of log spectrums corresponding to each spectrum specifically includes:

In the embodiment, the average value of m peak-to-valley difference values of m segments of log spectrum is used as the spectrum contrast of the spectrum.

Preferably, the calculation formula is:

The present embodiment calculates the spectral flatness of the obtained spectrum by the above calculation formula.

Preferably, the method further comprises:

calculating to obtain a Mel cepstrum coefficient of each audio frame;

In this embodiment, the audio features of the audio further include mel-frequency cepstrum coefficients, which further increases the information dimensionality of the audio features and is further beneficial to improving the accuracy of audio identification and classification.

The invention provides an audio characteristic extraction method, which is implemented specifically, the audio to be extracted is obtained according to the preset window length, and the audio to be extracted is divided into M audio frames according to the preset frame length, wherein M is greater than 1; calculating a frequency spectrum corresponding to each audio frame; wherein the frequency spectrum comprises N frequency domain points; obtaining a fitting slope and a fitting intercept corresponding to each frequency spectrum based on a linear fitting algorithm according to the N frequency domain points of each frequency spectrum; calculating to obtain the spectrum flatness of each spectrum according to the spectrums and a preset calculation formula; dividing each frequency spectrum into m sections of frequency spectrum bands, and calculating to obtain a logarithmic frequency spectrum corresponding to each section of frequency spectrum band; m is greater than 1; obtaining the spectrum contrast of each spectrum according to the m sections of logarithmic spectra corresponding to each spectrum; obtaining the characteristic quantity of each audio frame according to the fitting slope, the fitting intercept, the spectral flatness and the spectral contrast of each audio frame; and extracting the audio features of the audio to be extracted according to the feature quantity of the M frames of audio frames.

Example two

Fig. 2 shows a schematic structural diagram of an audio feature extraction apparatus according to a preferred embodiment of the present invention; specifically, the apparatus comprises:

the audio frame obtaining module 11 is configured to obtain an audio to be extracted according to a preset window length, and divide the audio to be extracted into M audio frames according to a preset frame length, where M is greater than 1;

a first calculating module 12, configured to calculate a frequency spectrum corresponding to each audio frame; wherein the frequency spectrum comprises N frequency domain points, N > 1;

the fitting module 13 is configured to obtain a fitting slope and a fitting intercept corresponding to each frequency spectrum based on a linear fitting algorithm according to the N frequency domain points of each frequency spectrum;

the spectrum flatness calculation module 14 is configured to calculate and obtain a spectrum flatness of each spectrum according to the spectrum and a preset calculation formula;

the second calculating module 15 is configured to divide each spectrum into m segments of spectrum bands, and calculate to obtain a logarithmic spectrum corresponding to each segment of spectrum band; m is greater than 1;

the spectrum contrast calculation module 16 is configured to obtain a spectrum contrast of each spectrum according to the m segments of log spectrums corresponding to each spectrum;

a feature quantity obtaining module 17, configured to obtain a feature quantity of each audio frame according to the fitting slope, the fitting intercept, the spectral flatness, and the spectral contrast of each audio frame;

and the extracting module 18 is configured to extract the audio features of the audio to be extracted according to the feature quantities of the M frames of the audio frames.

Preferably, the fitting module 13 is specifically configured to:

Preferably, the second calculating module 15 is specifically configured to:

s_i(f″)＝10×log₁₀s_i(f′) (1)；

Preferably, the spectral contrast calculation module 16 is specifically configured to:

Preferably, the calculation formula is:

Preferably, the apparatus further comprises a third calculation module, and the third calculation module is specifically configured to:

calculating to obtain a Mel cepstrum coefficient of each audio frame;

then, the feature quantity obtaining module 17 is specifically configured to:

The audio feature extraction device provided by the invention can extract audio features including fitting slope, fitting intercept, spectral flatness and spectral contrast of audio, and compared with the audio features only having audio energy, the audio features extracted by the invention can distinguish different audios through information with more dimensionalities. When the method is used for audio detection, noise and target detection sound can be better identified and classified, and the accuracy of audio detection is favorably improved. For example, the method is used for detecting the baby crying, can better identify and classify the environmental noise and the baby crying, and is favorable for improving the accuracy of the baby crying detection.

It should be noted that, the audio feature extraction device provided in the embodiment of the present invention is configured to execute the steps of the audio feature extraction method described in the above embodiment, and working principles and beneficial effects of the two are in one-to-one correspondence, so that details are not repeated.

It will be understood by those skilled in the art that the schematic diagram of the audio feature extraction apparatus is merely an example of the audio feature extraction apparatus, and does not constitute a limitation of the audio feature extraction apparatus, and may include more or less components than those shown in the drawings, or combine some components, or different components, for example, the audio feature extraction apparatus may further include an input and output device, a network access device, a bus, and the like.

EXAMPLE III

Referring to fig. 3, fig. 3 is a schematic flowchart of a preferred embodiment of a method for training an audio classification model according to the present invention; specifically, the method comprises the following steps:

s9, constructing an audio classification initial model; the audio classification initial model is used for classifying the audio;

s10, obtaining a plurality of training audios corresponding to each classification result; each training audio is pre-allocated with a classification identifier matched with the corresponding classification result;

s11, taking the training audio as the audio to be extracted, and extracting the audio feature corresponding to each training audio according to the audio feature extraction method provided in any one of the above embodiments;

s12, carrying out standardization processing on the audio features corresponding to each training audio, and constructing a training sample set according to each audio feature after standardization processing and the matched classification identification;

and S13, training the audio classification initial model according to the training sample set to obtain an audio classification model.

According to the training method of the audio classification model provided by the embodiment of the invention, the audio feature corresponding to each training audio is extracted according to the audio feature extraction method provided by the embodiment, and the audio classification model obtained by training according to the audio feature can be used for detecting the classification audio and improving the accuracy of audio classification.

It should be noted that, when actually detecting an audio, the input audio is also subjected to audio feature extraction according to the audio feature extraction method provided in the above embodiment, then the audio features are subjected to normalization processing, and then the audio features are input into an audio classification model for processing and classification, so as to obtain a classification result. P > 1.

Preferably, the audio classification model of the present invention may be an SVM model, and if the audio classification model is an SVM model, during specific training, the training audio of the training sample set may be divided into K parts, and one part is selected for training each time, and K times of training are performed in total, so as to obtain a spatial hyperplane for audio classification, where P is 2.

Preferably, each of the training audios has corresponding audio features:

A_i＝[a_i1,a_i2,…,a_iq]

The present invention normalizes the audio features by the above equations (3) and (4).

It should be noted that the training method for the audio classification model provided in the embodiment of the present invention has the same or corresponding technical features as the audio feature extraction method provided in the above embodiment, and the working principles and beneficial effects of the two methods are similar, so that further description is omitted.

Example four

Fig. 4 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention, and fig. 4 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention; specifically, the electronic device includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and the processor implements the audio feature extraction method according to any one of the embodiments provided above when executing the computer program.

Specifically, the electronic device may be one or more processors and memories, and may be a computer, a mobile phone, a tablet or other device capable of performing sound detection, or may be a baby monitor or other device when the audio features are used to detect the crying of a baby.

The electronic device of the embodiment includes: a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the steps in the audio feature extraction method provided in the foregoing embodiment are implemented, for example, in step S1 shown in fig. 1, the audio to be extracted is obtained according to a preset window length, and the audio to be extracted is divided into M audio frames according to a preset frame length, where M > 1. Or, the processor implements the functions of the modules in the apparatus embodiments when executing the computer program, for example, implements an audio frame obtaining module 11, configured to obtain an audio to be extracted according to a preset window length, and divide the audio to be extracted into M audio frames according to a preset frame length, where M > 1.

Illustratively, the computer program can be divided into one or more modules/units (e.g., computer program 1, computer program 2, shown in FIG. 4), which are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the electronic device. For example, the computer program may be divided into an audio frame obtaining module 11, a first calculating module 12, a fitting module 13, a spectral flatness calculating module 14, a second calculating module 15, a spectral contrast calculating module 16, a feature quantity obtaining module 17, and an extracting module 18, and each module has the following specific functions:

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is the control center for the electronic device and that connects the various parts of the overall electronic device using various interfaces and wires.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the integrated module/unit of the electronic device can be stored in a computer readable storage medium if it is implemented in the form of software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the processes in the audio feature extraction method provided by the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the audio feature extraction method provided by any of the above embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that the above-mentioned electronic device may include, but is not limited to, a processor and a memory, and those skilled in the art will understand that the structural diagram of fig. 4 is only an example of the above-mentioned electronic device, and does not constitute a limitation of the electronic device, and may include more or less components than those shown in the drawings, or may combine some components, or may be different components.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method of audio feature extraction, the method comprising:

2. The audio feature extraction method according to claim 1, wherein the linear fitting algorithm is a linear least squares algorithm, and the obtaining of the fitting slope and the fitting intercept corresponding to each of the frequency spectrums based on the linear fitting algorithm according to the N frequency domain points of each of the frequency spectrums specifically includes:

3. The audio feature extraction method according to claim 1, wherein the dividing each spectrum into m spectral bands and obtaining a logarithmic spectrum corresponding to each spectral band by calculation specifically comprises:

s_i(f″)＝10×log₁₀s_i(f′)；

4. The method for extracting audio features according to claim 1, wherein the obtaining the spectral contrast of each spectrum according to the m segments of log spectrums corresponding to each spectrum specifically comprises:

5. The audio feature extraction method of claim 1, wherein the calculation formula is:

6. The audio feature extraction method of claim 1, further comprising:

calculating to obtain a Mel cepstrum coefficient of each audio frame;

7. An audio feature extraction apparatus, characterized in that the apparatus comprises:

8. A method for training an audio classification model, the method comprising:

9. The method for training an audio classification model according to claim 8, wherein the audio features corresponding to each of the training audios are:

A_i＝[a_i1,a_i2,…,a_iq]

10. An electronic device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the audio feature extraction method of any one of claims 1 to 6 when executing the computer program.