CN107452399B

CN107452399B - Audio feature extraction method and device

Info

Publication number: CN107452399B
Application number: CN201710839230.0A
Authority: CN
Inventors: 赵伟峰
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2017-09-18
Filing date: 2017-09-18
Publication date: 2020-09-15
Anticipated expiration: 2037-09-18
Also published as: CN107452399A

Abstract

The invention discloses an audio feature extraction method and device, and belongs to the field of audio processing. The audio feature extraction method comprises the following steps: dividing the audio signal into a plurality of frames through a window function with a window length of M to obtain a sample signal; screening out a signal frame with a corresponding energy value in a first energy interval from the sample signal; determining an upper limit value and a lower limit value of a second energy interval according to the energy value corresponding to the signal frame meeting the preset condition; and determining the signal frame with the corresponding energy value in the second energy interval as the characteristic frame of the audio signal in the screened signal frames. The invention solves the problems that the feature extraction is less carried out on the audio signal in the application of the audio processing field in the related technology, and the burden of the subsequent audio processing is increased, and achieves the effects of carrying out the efficient feature extraction on the audio signal and improving the efficiency of the subsequent audio processing.

Description

Audio feature extraction method and device

Technical Field

The embodiment of the invention relates to the field of audio processing, in particular to an audio feature extraction method and device.

Background

Feature extraction is generally applied to image processing, and is less applied in the field of audio processing.

However, in application scenarios such as audio recognition services like Music Identification (Music Identification) service and Music recommendation (Music recommendation) service, audio features are required.

Therefore, how to efficiently extract the effective features of the audio signal becomes an urgent problem to be solved.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide an audio feature extraction method and apparatus. The technical scheme is as follows:

according to a first aspect of the embodiments of the present invention, there is provided an audio feature extraction method, including:

dividing the audio signal into a plurality of frames through a window function with a window length of M to obtain a sample signal, wherein M is a natural number;

selecting two signal frames from the sample signal according to a preset rule, respectively determining energy values corresponding to the two signal frames as an upper limit value and a lower limit value of a first energy interval, and screening out signal frames of which the corresponding energy values are in the first energy interval from the sample signal;

obtaining an absolute value of the sample signal to obtain a positive sample signal, determining a signal frame with the maximum energy value in the positive sample signal as a signal frame meeting a preset condition, and determining an upper limit value and a lower limit value of a second energy interval according to the energy value corresponding to the signal frame meeting the preset condition;

and determining the signal frame with the corresponding energy value in the second energy interval as the characteristic frame of the audio signal in the screened signal frames.

According to a second aspect of embodiments of the present invention, there is provided an audio feature extraction apparatus, the apparatus including:

the frame dividing module is used for dividing the audio signal into a plurality of frames through a window function with the window length of M to obtain a sample signal, wherein M is a natural number;

the screening module is used for selecting two signal frames from the sample signal according to a preset rule, respectively determining energy values corresponding to the two signal frames as an upper limit value and a lower limit value of a first energy interval, and screening the signal frames of which the corresponding energy values are in the first energy interval from the sample signal;

the first determining module is used for obtaining a positive sample signal by taking an absolute value of the sample signal, determining a signal frame with the largest energy value in the positive sample signal as a signal frame meeting a preset condition, and determining an upper limit value and a lower limit value of a second energy interval according to the energy value corresponding to the signal frame meeting the preset condition;

and a second determining module, configured to determine, as the feature frame of the audio signal, a signal frame with a corresponding energy value in the second energy interval, in the screened signal frames.

According to a third aspect of the embodiments of the present invention, there is provided a terminal, including a processor and a memory, where the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the audio feature extraction method according to the first aspect.

According to a fourth aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the audio feature extraction method according to the first aspect.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

respectively calculating signal frames in the audio signal subjected to windowing and framing processing, and carrying out secondary screening on the signal frames according to energy values, wherein the signal frames subjected to secondary screening have lower energy values and higher energy values, namely the signal frames which are possibly blank and invalid and the signal frames which are possibly noise are filtered; the problem of less carry out the feature extraction to audio signal in audio frequency processing field application among the correlation technique, increase follow-up audio frequency processing's burden is solved, reached and carried out the high efficiency to audio signal and draw, improved follow-up audio frequency processing's efficiency effect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1A is a flow chart of an audio feature extraction method provided in one embodiment of the invention;

fig. 1B is a flowchart of a method for determining an upper limit value and a lower limit value of a second energy interval according to an energy value corresponding to a signal frame meeting a predetermined condition according to an embodiment of the present invention;

fig. 2 is a flowchart of an audio feature extraction method provided in another embodiment of the present invention;

fig. 3 is a block diagram illustrating an audio feature extraction apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1A is a flowchart of an audio feature extraction method provided in an embodiment of the present invention, and as shown in fig. 1A, the audio feature extraction method includes the following steps.

Step 101, dividing an audio signal into multiple frames by a window function with a window length of M, to obtain a sample signal, where M is a natural number.

In the embodiment of the invention, the audio signal in the audio file belongs to a short-time stationary process, a frame signal regarded as a stationary process can be obtained by performing framing and windowing on the audio signal and performing segmentation processing on the audio signal, and the audio signal obtained after framing and windowing is a sample signal.

The original input audio signal can be divided into a plurality of frames and processed by a window function with the length of M, the frame dividing step generally takes 20-30ms as a frame, the frame is shifted to a part where the frame overlaps with an adjacent frame, and 1/3 or 1/2 of the frame length is generally taken as the frame shift in order to avoid too large characteristic change between frames.

The windowing step is a process of multiplying the original audio signal by a window function, which can be expressed by equation (1):

xw (n) ═ w (n) × (n) formula (1)

In the formula (1), xw (N) is a sample signal, x (N) is an audio signal, w (N) is a window function, N is a natural number, and N is greater than or equal to 0 and less than or equal to N-1.

In this formula (1), xw (N) has a length of M, M preferably satisfies a length of a power of 2, the number of frames of xw (N) is L, and L is N/M.

It should be noted that the commonly used window functions include a rectangular window, a hanning window and a hamming window, but the present embodiment does not limit the specific type of the window function at all.

Specifically, when the window functions are a rectangular window, a hanning window and a hamming window, the three window functions are sequentially as follows:

step 102, selecting two signal frames from the sample signal according to a predetermined rule, determining energy values corresponding to the two signal frames as an upper limit value and a lower limit value of a first energy interval respectively, and screening out the signal frames with the corresponding energy values in the first energy interval from the sample signal.

In one possible implementation, step 102 may be implemented by steps S1 through S4 described below.

In step S1, an energy value of each signal frame in the sample signal is calculated.

Specifically, the energy value of each signal frame in the sample signal is calculated according to an energy value calculation formula, so as to obtain an energy value e (l) corresponding to the sample signal. Wherein, the energy value calculation formula is as follows:

step S2, signal frames in the sample signal are sequenced according to a predetermined sequence, so as to obtain a sequence of signal frames.

Wherein, the predetermined sequence can be an increasing energy value sequence or a decreasing energy value sequence.

Step S3, selecting the first from the signal frame sequence

A signal frame and

a signal frame, will

Energy value corresponding to signal frame and the second

Among the energy values corresponding to the signal frames, the energy value corresponding to the signal frame with the larger energy value is determined as the upper limit value of the first energy interval, the energy value corresponding to the signal frame with the smaller energy value is determined as the lower limit value of the first energy interval, and R is a positive decimal number.

Wherein [ ] is a rounded symbol.

It should be noted that R is set manually or preset by the system. When R is set by human, the first energy interval is set by human, and when R is preset by the system, the first energy interval is set by the system.

For example, if the frame number of xw (n) is 1000 and R is set to 0.5, the 250 th frame is selected from the signal frame sequence

A signal frame and 375 th

A signal frame.

When the predetermined sequence is an increasing sequence of energy values, the first step

The energy value corresponding to the signal frame is determined as the upper limit value of the first energy interval, and the second energy interval is set

The energy value corresponding to each signal frame is determined as the lower limit value of the first energy interval.

When the predetermined sequence is a descending sequence of energy values, the first step

The energy value corresponding to each signal frame is determined as the firstUpper limit value of energy interval, will

For example, the frame number of xw (n) is 1000, and R is set to 0.5, then, when the predetermined sequence is an increasing sequence of energy values, the energy value corresponding to the 375 th signal frame is determined as the upper limit value of the first energy interval, and the energy value corresponding to the 250 th signal frame is determined as the lower limit value of the first energy interval; when the predetermined sequence is a descending sequence of energy values, the energy value corresponding to the 250 th signal frame is determined as the upper limit value of the first energy interval, and the energy value corresponding to the 375 th signal frame is determined as the lower limit value of the first energy interval.

In step S4, a signal frame with a corresponding energy value in the first energy interval is screened out from the sample signal.

And traversing the energy value E corresponding to the sample signal, and screening out a signal frame with the energy value in the first energy interval from the sample signal.

Step 103, taking an absolute value of the sample signal to obtain a positive sample signal, determining a signal frame with the largest energy value in the positive sample signal as a signal frame meeting a predetermined condition, and determining an upper limit value and a lower limit value of a second energy interval according to the energy value corresponding to the signal frame meeting the predetermined condition.

The energy values of all signal frames in the positive sample signal are in the energy value range determined by the energy value 0 and the energy value corresponding to the signal frame with the largest energy value in the positive sample signal, so that the determined second energy interval is more accurate by using the energy value corresponding to the signal frame with the largest energy value.

In a possible implementation manner, fig. 1B is a flowchart of a method for determining an upper limit value and a lower limit value of a second energy interval according to an energy value corresponding to a signal frame meeting a predetermined condition, as shown in fig. 1B, where step 103 may be implemented by steps 103a to 103B described below.

And 103a, subtracting the first preset value from the determined energy value corresponding to the signal frame to obtain a first numerical value, and multiplying the determined energy value corresponding to the signal frame by a second preset value to obtain a second numerical value.

The first preset value is a constant, the value of the first preset value may be 100, 200, 300, and the like, and the specific value of the first preset value is not limited in any way in this embodiment. Preferably, the first predetermined value is 200.

The second preset value is a positive decimal, the value of the second preset value may be 0.78, 0.88, 0.98, and the like, and the specific value of the second preset value is not limited in any way in this embodiment. Preferably, the first predetermined value is 0.98.

And 103b, determining the value with the larger energy value as the upper limit value of the second energy interval and determining the negative value corresponding to the value with the larger energy value as the lower limit value of the second energy interval in the first value and the second value.

And 104, determining the signal frame with the corresponding energy value in the second energy interval as the characteristic frame of the audio signal in the screened signal frames.

And traversing the sample signal to screen out an energy value corresponding to the signal frame with the energy value in the first energy interval, and determining the signal frame with the energy value in the second energy interval as a characteristic frame of the audio signal.

In summary, in the audio feature extraction method provided in this embodiment, signal frames in the audio signal after the windowing and framing processing are respectively calculated, and the signal frames are secondarily filtered according to the energy values, so that signal frames with lower energy values and higher energy values are filtered out from the signal frames after the secondary filtering, that is, signal frames that may be blank and invalid and signal frames that may be noise are filtered out; the problem of less carry out the feature extraction to audio signal in audio frequency processing field application among the correlation technique, increase follow-up audio frequency processing's burden is solved, reached and carried out the high efficiency to audio signal and draw, improved follow-up audio frequency processing's efficiency effect.

The audio signal may be normalized prior to being windowed by framing. Fig. 2 is a flowchart of an audio feature extraction method provided in another embodiment of the present invention, and as shown in fig. 2, the audio feature extraction method includes the following steps.

Step 201, performing fourier transform on the audio signal to obtain a time domain signal corresponding to the audio signal.

The audio signal can be transformed from the time domain to the frequency domain through Fourier transform, and a time domain signal corresponding to the audio signal is obtained.

Optionally, the fourier transform comprises a fast fourier transform or a discrete fourier transform.

In one possible implementation, when the audio signal is a multi-channel signal, the audio signal is divided into several independent channels, and the operation of step 201 is performed on the several channels respectively.

In a possible implementation manner, when the audio signal is a multi-channel signal, a downmix algorithm is adopted to mix the audio signal into a single-channel signal, and the operation of step 201 is performed on the single-channel signal.

Step 202, performing normalization processing on the time domain signal according to a normalization formula.

Wherein, the process of normalizing the time domain signal is as follows: setting the sampling frequency to 1 and expressing other frequencies as percentages of the sampling frequency converts the frequency range between 0 and 1, which is favorable for comparing the distribution of each frequency and is convenient for calculation.

Wherein, the normalization formula is:

in the normalization formula, y (i) is the time domain signal after the ith normalization process, x (i) is the ith time domain signal frame, and xmax is the sampling value corresponding to the time domain signal frame with the largest sampling value in the time domain signal after the absolute value is taken, namely xmax ═ abs (max (xn)).

And 203, dividing the time domain signal after the normalization processing into multiple frames through a window function with the window length of M to obtain a sample signal.

Step 204, two signal frames are selected from the sample signal according to a predetermined rule, energy values corresponding to the two signal frames are respectively determined as an upper limit value and a lower limit value of the first energy interval, and the signal frames with the corresponding energy values in the first energy interval are screened from the sample signal.

Step 205, taking an absolute value of the sample signal to obtain a positive sample signal, determining a signal frame with the largest energy value in the positive sample signal as a signal frame meeting a predetermined condition, and determining an upper limit value and a lower limit value of the second energy interval according to the energy value corresponding to the signal frame meeting the predetermined condition.

And step 206, determining the signal frame with the corresponding energy value in the second energy interval as the characteristic frame of the audio signal in the screened signal frames.

It should be noted that, since steps 204 to 206 in this embodiment are similar to steps 102 to 104, the description of steps 204 to 206 is omitted in this embodiment.

The following are embodiments of the apparatus of the present invention, and for details not described in detail in the embodiments of the apparatus, reference may be made to the above-mentioned one-to-one corresponding method embodiments.

Referring to fig. 3, a block diagram of an audio feature extraction apparatus according to an embodiment of the present invention is shown. The device includes: a framing module 301, a screening module 302, a first determining module 303 and a second determining module 304.

A framing module 301, configured to divide an audio signal into multiple frames by using a window function with a window length of M, where M is a natural number, to obtain a sample signal;

the screening module 302 is configured to select two signal frames from the sample signal according to a predetermined rule, determine energy values corresponding to the two signal frames as an upper limit value and a lower limit value of a first energy interval, and screen a signal frame of which the corresponding energy value is in the first energy interval from the sample signal;

a first determining module 303, configured to obtain an absolute value of the sample signal to obtain a positive sample signal, determine a signal frame with a largest energy value in the positive sample signal as a signal frame meeting a predetermined condition, and determine an upper limit value and a lower limit value of a second energy interval according to the energy value corresponding to the signal frame meeting the predetermined condition;

and a second determining module 304, configured to determine, as the feature frame of the audio signal, a signal frame of which the corresponding energy value is in the second energy interval, among the screened signal frames.

In summary, the audio feature extraction apparatus provided in this embodiment respectively calculates the signal frames in the audio signal after the windowing and framing processing, and performs secondary screening on the signal frames according to the energy values, so that signal frames with lower energy values and higher energy values are filtered out from the signal frames after the secondary screening, that is, signal frames that may be blank and invalid and signal frames that may be noise are filtered out; the problem of less carry out the feature extraction to audio signal in audio frequency processing field application among the correlation technique, increase follow-up audio frequency processing's burden is solved, reached and carried out the high efficiency to audio signal and draw, improved follow-up audio frequency processing's efficiency effect.

Based on the audio feature extraction apparatus provided in the foregoing embodiment, optionally, the framing module 301 includes: the device comprises a first processing unit, a second processing unit and a framing unit.

The first processing unit is used for carrying out Fourier transform on the audio signal to obtain a time domain signal corresponding to the audio signal;

the second processing unit is used for carrying out normalization processing on the time domain signals according to a normalization formula;

the framing unit is used for dividing the time domain signal after the normalization processing into a plurality of frames through a window function with the window length of M to obtain a sample signal;

wherein, the normalization formula is:

wherein y (i) is the time domain signal after the ith normalization processing, x (i) is the time domain signal frame of the ith, and xmax is the sampling value corresponding to the time domain signal frame with the largest sampling value in the time domain signals after the absolute value is taken.

Optionally, the framing module 301 is further configured to divide the audio signal into multiple frames according to a windowing formula corresponding to a window function with a window length M, so as to obtain a sample signal;

wherein, the windowing formula is as follows:

xw(n)＝w(n)*x(n)，

wherein, xw (N) is a sample signal, the frame number of xw (N) is L, x (N) is an audio signal, w (N) is a window function, N is a natural number less than or equal to N, and L is N/M.

Optionally, the screening module 302 includes: the device comprises a first calculation unit, a sorting unit, a first determination unit and a screening unit.

The first calculating unit is used for calculating the energy value of each signal frame in the sample signal;

the sequencing unit is used for sequencing the signal frames in the sample signal according to a preset sequence to obtain a signal frame sequence;

a first determining unit for selecting the first from the signal frame sequence

A signal frame and

a signal frame, will

Energy value corresponding to signal frame and the second

Among the energy values corresponding to the signal frames, the energy value corresponding to the signal frame with the larger energy value is determined as the upper limit value of the first energy interval, the energy value corresponding to the signal frame with the smaller energy value is determined as the lower limit value of the first energy interval, and R is a positive decimal number;

and the screening unit is used for screening out the signal frames of which the corresponding energy values are in the first energy interval from the sample signals.

Optionally, the first determining module 303 includes: a second calculation unit and a second determination unit.

The second calculation unit is used for subtracting the first preset value from the energy value corresponding to the determined signal frame to obtain a first numerical value, and multiplying the energy value corresponding to the determined signal frame by the second preset value to obtain a second numerical value;

and the second determining unit is used for determining the value with the larger energy value as the upper limit value of the second energy interval and determining the negative value corresponding to the value with the larger energy value as the lower limit value of the second energy interval in the first value and the second value.

It should be noted that: the audio feature extraction apparatus provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the server is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the embodiments of the audio feature extraction device and the audio feature extraction method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the embodiments of the methods for details, and are not described herein again.

Embodiments of the present invention also provide a computer-readable storage medium, which may be a computer-readable storage medium contained in a memory; or it may be a computer-readable storage medium that exists separately and is not incorporated into the smart terminal. The computer-readable storage medium stores at least one instruction for use by one or more processors in performing the audio feature extraction method.

Referring to fig. 4, a schematic structural diagram of a terminal according to an embodiment of the present invention is shown. Specifically, the method comprises the following steps:

the terminal 400 may include RF (Radio Frequency) circuitry 410, memory 420 including one or more computer-readable storage media, an input unit 430, a display unit 440, a sensor 450, audio circuitry 460, a near field communication module 470, a processor 480 including one or more processing cores, and a power supply 490. Those skilled in the art will appreciate that the terminal configuration shown in fig. 4 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

RF circuit 410 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for receiving downlink information from a base station and processing the received downlink information by one or more processors 480; in addition, data relating to uplink is transmitted to the base station. In general, RF circuitry 410 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, an LNA (low noise amplifier), a duplexer, and the like. In addition, the RF circuitry 410 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for mobile communications), GPRS (General Packet Radio Service), CDMA (Code division Multiple Access), WCDMA (Wideband Code division Multiple Access), LTE (Long Term Evolution), email, SMS (short messaging Service), etc.

The memory 420 may be used to store software programs and modules, and the processor 480 executes various functional applications and data processing by operating the software programs and modules stored in the memory 420. The memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal 400, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 420 may also include a memory controller to provide access to memory 420 by processor 480 and input unit 430.

The input unit 430 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. Specifically, the input unit 430 may include an image input device 431 and other input devices 432. The image input device 431 may be a camera or a photoelectric scanning device. The input unit 430 may include other input devices 432 in addition to the image input device 431. In particular, other input devices 432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 440 may be used to display information input by or provided to a user and various graphical user interfaces of the terminal 400, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 440 may include a Display panel 441, and optionally, the Display panel 441 may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like.

The terminal 400 can also include at least one sensor 450, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 441 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 441 and/or a backlight when the terminal 400 is moved to the ear. The gravity acceleration sensor can detect the acceleration in each direction (generally three axes), can detect the gravity in the static state, and can be used for identifying the application of the mobile phone gesture (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration identification related functions (such as pedometer and knocking) and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal 400, detailed descriptions thereof are omitted.

The audio circuit 460, speaker 461, microphone 462 may provide an audio interface between a user and the terminal 400. The audio circuit 460 may transmit the electrical signal converted from the received audio data to the speaker 461, and convert the electrical signal into a sound signal for output by the speaker 461; on the other hand, the microphone 462 converts the collected sound signals into electrical signals, which are received by the audio circuit 460 and converted into audio data, which are then processed by the audio data output processor 480, passed through the RF circuit 410 to be sent to, for example, another electronic device, or output to the memory 420 for further processing. The audio circuit 460 may also include an earbud jack to provide communication of a peripheral headset with the terminal 400.

The terminal 400 establishes a near field communication connection with an external device through the near field communication module 470 and performs data interaction through the near field communication connection. In this embodiment, the near field communication module 470 specifically includes a bluetooth module and/or a WiFi module.

The processor 480 is a control center of the terminal 400, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the terminal 400 and processes data by operating or executing software programs and/or modules stored in the memory 420 and calling data stored in the memory 420, thereby integrally monitoring the mobile phone. Optionally, processor 480 may include one or more processing cores; preferably, the processor 480 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 480.

The terminal 400 also includes a power supply 490 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 480 via a power management system that may be used to manage charging, discharging, and power consumption. The power supply 490 may also include one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and any like components.

Although not shown, the terminal 400 may further include a bluetooth module or the like, which is not described in detail herein.

In this embodiment, the terminal 400 further includes a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to implement the audio feature extraction method.

It will be understood by those skilled in the art that all or part of the steps in the audio feature extraction method of the above embodiments may be implemented by a program to instruct associated hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

It should be understood that, as used herein, the singular forms "a," "an," "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of audio feature extraction, the method comprising:

determining a signal frame with a corresponding energy value in the second energy interval as a characteristic frame of the audio signal in the screened signal frames;

the determining a second energy interval according to the energy value corresponding to the signal frame meeting the predetermined condition includes:

subtracting a first preset value from the determined energy value corresponding to the signal frame to obtain a first numerical value, and multiplying the determined energy value corresponding to the signal frame by a second preset value to obtain a second numerical value;

and determining a value with a large energy value as an upper limit value of the second energy interval and determining a negative value corresponding to the value with the large energy value as a lower limit value of the second energy interval in the first value and the second value.

2. The method of claim 1, wherein the dividing the audio signal into a plurality of frames by a window function with a window length of M to obtain sample signals comprises:

carrying out Fourier transform on the audio signal to obtain a time domain signal corresponding to the audio signal;

according to a normalization formula, performing normalization processing on the time domain signal;

dividing the time domain signal after normalization processing into multiple frames through a window function with the window length of M to obtain a sample signal;

wherein the normalization formula is:

wherein y (i) is the time domain signal after the ith normalization process, x (i) is the ith time domain signal frame, and xmax is the sampling value corresponding to the time domain signal frame with the largest sampling value in the time domain signal after the absolute value is taken.

3. The method of claim 1, wherein the dividing the audio signal into a plurality of frames by a window function with a window length of M to obtain sample signals comprises:

dividing the audio signal into a plurality of frames according to a windowing formula corresponding to a window function with a window length of M to obtain the sample signal;

wherein the windowing formula is:

xw(n)＝w(n)*x(n)，

wherein, xw (N) is a sample signal, the frame number of xw (N) is L, x (N) is the audio signal, w (N) is a window function, N is a natural number less than or equal to N, and L is N/M.

4. The method of claim 1, wherein the selecting two signal frames from the sample signal according to the predetermined rule, determining energy values corresponding to the two signal frames as an upper limit value and a lower limit value of a first energy interval, respectively, and screening out the signal frames with corresponding energy values in the first energy interval from the sample signal comprises:

calculating an energy value of each signal frame in the sample signal;

sequencing the signal frames in the sample signal according to a preset sequence to obtain a signal frame sequence;

selecting from the sequence of signal frames

A signal frame and

a signal frame, the first

Energy value corresponding to signal frame and the second signal frame

and screening out a signal frame with a corresponding energy value in the first energy interval from the sample signal.

5. An audio feature extraction apparatus, characterized in that the apparatus comprises:

a second determining module, configured to determine, in the screened signal frames, a signal frame with a corresponding energy value in the second energy interval as a feature frame of the audio signal;

the first determining module includes:

the second calculation unit is used for subtracting the first preset value from the energy value corresponding to the determined signal frame to obtain a first numerical value, and multiplying the energy value corresponding to the determined signal frame by a second preset value to obtain a second numerical value;

and a second determining unit, configured to determine, as an upper limit value of the second energy interval, a value having a larger energy value, and determine, as a lower limit value of the second energy interval, a negative value corresponding to the value having the larger energy value, among the first value and the second value.

6. The apparatus of claim 5, wherein the framing module comprises:

the second processing unit is used for carrying out normalization processing on the time domain signal according to a normalization formula;

wherein the normalization formula is:

7. The apparatus of claim 5, wherein the framing module is further configured to divide the audio signal into multiple frames according to a windowing formula corresponding to a window function with a window length M, so as to obtain the sample signal;

wherein the windowing formula is:

xw(n)＝w(n)*x(n)，

8. The apparatus of claim 5, wherein the screening module comprises:

a first calculation unit which calculates an energy value of each signal frame in the sample signal;

a first determining unit for selecting the first from the signal frame sequence

A signal frame and

a signal frame, the first

Energy value corresponding to signal frame and the second signal frame

Among the energy values corresponding to the signal frames, the energy value corresponding to the signal frame with the larger energy value is determined asThe upper limit value of the first energy interval and the energy value corresponding to the signal frame with the small energy value are determined as the lower limit value of the first energy interval, and R is a positive decimal number;

and the screening unit is used for screening out a signal frame of which the corresponding energy value is in the first energy interval from the sample signal.

9. A terminal, characterized in that the terminal comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the instruction is loaded and executed by the processor to realize the audio feature extraction method according to any one of claims 1 to 4.

10. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to implement the audio feature extraction method of any of claims 1 to 4.