CN113223511B

CN113223511B - Audio processing device for speech recognition

Info

Publication number: CN113223511B
Application number: CN202010071503.3A
Authority: CN
Inventors: 冯梦豪
Original assignee: Zhuhai Xuanyang Technology Co ltd
Current assignee: Zhuhai Xuanyang Technology Co ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2024-04-16
Anticipated expiration: 2040-01-21
Also published as: US20210225360A1; CN113223511A; US11404046B2

Abstract

An audio processing device for speech recognition includes a memory circuit, a power spectrum conversion circuit, and a feature extraction circuit. The power spectrum conversion circuit is coupled to the memory circuit, reads a plurality of spectral coefficients of a time-domain audio sampling data from the memory circuit, performs power spectrum conversion and compression processing according to the spectral coefficients to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit. The feature extraction circuit is coupled to the memory circuit, reads the compressed power parameter from the memory circuit, and performs Mel filtering and frequency-time transform processing according to the compressed power parameter to generate an audio feature vector. The bit width of the compressed power parameter is smaller than the bit width of the spectral coefficients.

Description

Audio processing device for speech recognition

Technical Field

The present invention relates to an audio processing device, and more particularly, to an audio processing device for speech recognition.

Background

With the advancement of technology, more and more electronic devices begin to use voice control, and the voice control will become a common user interface for most electronic devices in the future. It is known that the recognition rate of the voice recognition (Speech Recognition) directly affects the user experience of the user using the electronic device. In the implementation of speech recognition, speech feature extraction is an important link. For example, one of the most commonly used speech features at present is Mel-frequency cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC). The mel-frequency spectrum coefficient simulates the auditory characteristics of human ears, can reflect the perception characteristics of human to voice, and achieves higher recognition rate in the practical application of voice recognition. The various steps of speech feature extraction may be implemented by a plurality of hardware circuit blocks, such as a mel filter to generate mel-frequency cepstral coefficients, may be implemented by a plurality of triangular bandpass filters. It is known that the manner in which these hardware circuits operate to achieve speech feature extraction directly affects manufacturing costs, circuit area, circuit performance, and the like. Thus, as speech recognition is applied more and more widely, how to design a speech feature extraction circuit meeting the requirements is one of the important issues of interest to those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides an audio processing device for speech recognition, which can save memory space and reduce memory bit width, thereby reducing hardware cost.

An embodiment of the invention provides an audio processing device for voice recognition, which comprises a memory circuit, a power logarithmic circuit, a Mel filter circuit and a frequency-time conversion circuit. The power logarithmic circuit is coupled with the memory circuit, reads a plurality of frequency spectrum coefficients of the time-domain audio sampling data from the memory circuit, and generates a plurality of power spectrum parameters according to the frequency spectrum coefficients. The power logarithmic circuit performs logarithmic conversion processing on the power spectrum parameters to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit. The Mel filter circuit is coupled to the memory circuit and reads the compressed power parameter from the memory circuit. The Mel filter circuit performs Mel filtering processing on the compressed power parameters to generate multiple Mel spectrum parameters, and writes the Mel spectrum parameters into the memory circuit. The frequency-time conversion circuit is coupled to the memory circuit, reads the mel-frequency spectrum parameter from the memory circuit, and performs frequency-time conversion on the mel-frequency spectrum parameter to generate an audio feature vector.

An embodiment of the invention provides an audio processing device for voice recognition, which comprises a memory circuit, a power spectrum conversion circuit and a feature extraction circuit. The power spectrum conversion circuit is coupled to the memory circuit, reads a plurality of spectral coefficients of a time-domain audio sampling data from the memory circuit, performs power spectrum conversion and compression processing according to the spectral coefficients to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit. The feature extraction circuit is coupled to the memory circuit, reads the compressed power parameter from the memory circuit, and performs mel filtering processing according to the compressed power parameter to generate an audio feature vector. The bit width of the compressed power parameter is smaller than the bit width of the spectral coefficients.

Based on the above, in the embodiment of the invention, the audio processing device for voice recognition may include a memory circuit and a plurality of circuit modules, wherein the circuit modules are used for extracting voice features of the audio data and are sequentially in an operating state at different time periods respectively. Therefore, the circuit modules can share the same memory circuit and repeatedly use the memory circuit in a time-sharing way, thereby saving the hardware cost of the memory circuit. In addition, by writing the compressed power parameters into the memory circuit after performing the power spectrum conversion and compression processing by one of the circuit modules, the maximum required bit width of the memory circuit for speech feature extraction can be reduced.

In order to make the above features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention.

Fig. 2 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention.

Fig. 3 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention.

Description of the reference numerals

10. 30: audio processing device

110: memory circuit

120: power spectrum conversion circuit

130: feature extraction circuit

a1: spectral coefficients

a2: compressed power parameters

fv1: audio feature vector

ip1, ip2-1, ip2-2: input port

131: mel filter circuit

132: frequency-time conversion circuit

a3: mel spectral parameters

141: pretreatment circuit

142: time-frequency conversion circuit

122: power logarithmic circuit

s1: time-domain audio sampling data

a4: preprocessed data

Detailed Description

Some embodiments of the invention will be described in detail below with reference to the drawings, wherein reference to the following description refers to the same or similar elements appearing in different drawings. These examples are only a part of the invention and do not disclose all possible embodiments of the invention. Rather, these embodiments are merely examples of the devices claimed in the present invention.

Fig. 1 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the present invention. Referring to fig. 1, an audio processing apparatus 10 for speech recognition includes a memory circuit 110, a power spectrum conversion circuit 120, and a feature extraction circuit 130. In one embodiment, the audio processing device 10 may be implemented as an audio processing chip with voice recognition function.

The memory circuit 110 is used to buffer data during voice feature extraction, and may be, but not limited to, a static random-access memory (SRAM). The memory circuit 110 may be coupled to the power spectrum conversion circuit 120 and the feature extraction circuit 130 via an internal bus, and the power spectrum conversion circuit 120 and the feature extraction circuit 130 may transmit data to and from the memory circuit 110 via the internal bus.

The power spectrum conversion circuit 120 may read a plurality of spectral coefficients a1 of time-domain (time-domain) audio sampling data from the memory circuit 110, and perform power spectrum conversion and compression processing according to the spectral coefficients a1 to generate a plurality of compressed power parameters a2. In detail, the time-domain audio sampling data is generated by sampling the analog audio signal, and the sampling frequency is, for example, 8 khz or 16 khz. The spectral coefficient a1 is generated by performing a time-frequency transform process, such as a fast fourier transform (Fast Fourier Transformation, FFT), on the time-domain audio sample data in a sampling period (i.e., a frame), and the spectral coefficient a1 of each sampling point includes a Real (Real) component and an Imaginary (imaging) component.

The power spectrum conversion circuit 120 may perform power spectrum conversion on the spectral coefficients a1 to obtain spectral features, i.e. calculate the sum of the square of the real coefficient of the spectral coefficient a1 and the square of the imaginary coefficient of the spectral coefficient a1. It is known that the Bit Width (Bit Width) of the data generated after the power spectrum conversion will be greatly increased. Therefore, in the present embodiment, the power spectrum conversion circuit 120 can further perform the compression process to generate a plurality of compressed power parameters a2, so as to achieve the purpose of compressing the bit width of the data to be written into the memory circuit 110. The compression processing is, for example, logarithmic processing. In other words, the bit width of the compressed power parameter a2 is smaller than the bit width of the spectral coefficient a1. The power spectrum conversion circuit 120 then writes the compressed power parameter a2 to the memory circuit 110.

The feature extraction circuit 130 may read the compressed power parameter a2 from the memory circuit 110 and perform mel filtering processing according to the compressed power parameter a2 to generate the audio feature vector fv1. In one embodiment, the feature extraction circuit 130 may obtain a plurality of audio feature parameters (also referred to as mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficient, MFCC)) by using a mel-filtering process and a frequency-time transform process, so as to obtain a multi-dimensional audio feature vector fv1. Alternatively, in another embodiment, the feature extraction circuit 130 may obtain a plurality of mel-frequency spectrum parameters by using a mel-frequency filtering process and use the mel-frequency spectrum parameters as the audio feature vector fv1. Here, the feature extraction circuit 130 may be implemented by a software module, a hardware module, or a combination thereof, which is not limited herein. The software modules may be programming codes or instructions stored in a recording medium, or the like. The hardware modules may be logic circuits implemented on an integrated circuit (integrated circuit). For example, the frequency-time transform process of the feature extraction circuit 130 may be implemented using a programming language (programming languages). In addition, the mel-filtering and/or frequency-time transform processes of the feature extraction circuit 130 may also be implemented as hardware modules using a hardware description language (hardware description languages or other suitable programming language), and thus may include one or more microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (Field Programmable Gate Array, FPGAs), or other types of hardware circuits.

In one embodiment, the audio feature vector fv1 may be used to match with a predetermined acoustic model or may be provided to a machine learning model for speech recognition purposes. In another embodiment, the audio feature vector fv1 may be matched with the predetermined acoustic model or provided to the machine learning model after other operations. Here, the power spectrum conversion circuit 120 and the feature extraction circuit 130 are sequentially enabled to be in a working state, and the power spectrum conversion circuit 120 and the feature extraction circuit 130 can share the same storage space of the memory circuit 110 in a time-sharing manner. In other words, in one embodiment, the plurality of circuits for generating the audio feature vector fv1 sequentially access the memory circuit 110 in different time periods, i.e. the memory circuit 110 is only accessed by a single circuit module in the same specific time period. It should be noted that the maximum required bit width of the memory circuit 110 is determined according to the bit width of the audio feature vector fv1 outputted by the feature extraction circuit 130.

Here, the power spectrum conversion circuit 120 is connected to the memory circuit 110 via the input port ip1 of the power spectrum conversion circuit 120 to access the memory circuit 110 via the input port ip1 of the power spectrum conversion circuit 120. The feature extraction circuit 130 is connected to the memory circuit 110 via an input port ip2 of the feature extraction circuit 130 to access the memory circuit 110 via the input port ip2 of the feature extraction circuit 130. It should be noted that, in an embodiment, since the power spectrum conversion circuit 120 performs the compression process, the feature extraction circuit 130 may not perform the logarithmic operation. In addition, in one embodiment, the power spectrum conversion circuit 120 reads the spectral coefficient a1 from the memory circuit 110 through the input port ip1, and the feature extraction circuit 130 sequentially reads the compressed power parameter a2 from the memory circuit 110 through the input port ip 2. Accordingly, in the case that the bit width of the compressed power parameter a2 is smaller than the bit width of the spectral coefficient a1, the maximum required bit width of the input port ip2 of the feature extraction circuit 130 is smaller than the maximum required bit width of the input port ip1 of the power spectrum conversion circuit 120.

Fig. 2 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention. Referring to fig. 2, in the present embodiment, the feature extraction circuit 130 may include a mel filter circuit 131 and a frequency-time conversion circuit 132. The mel filter circuit 131 and the frequency-time conversion circuit 132 may be respectively coupled to the memory circuit 110 via an internal bus.

The power spectrum conversion circuit 120 reads a plurality of spectral coefficients a1 of the time-domain audio sampling data from the memory circuit 110, performs power spectrum conversion and compression processing according to the spectral coefficients a1 to generate a plurality of compressed power parameters a2, and writes the compressed power parameters a2 into the memory circuit 110. In this embodiment, the compression process may be a logarithmic process. That is, the power spectrum conversion circuit 120 may generate a plurality of power spectrum parameters according to the spectrum coefficient a1, and perform a logarithmic conversion process on the power spectrum parameters to generate the compressed power parameter a2. For each sampling point in a frame, the power spectral parameter may be generated by calculating the sum of the square of the real coefficient of the spectral coefficient a1 and the square of the imaginary coefficient of the spectral coefficient a1.

In this embodiment, the mel filter circuit 131 may include, for example, a set of 19 nonlinear distributed triangular band-pass filters (Triangular Bandpass Filters). The mel filter circuit 131 reads the compressed power parameter a2 from the memory circuit 110, and performs mel filtering processing on the compressed power parameter a2 to generate a plurality of mel spectrum parameters a3. Next, the mel filter circuit 131 writes the mel spectrum parameter a3 to the memory circuit 110. Specifically, the mel filter circuit 131 may obtain the logarithmic energy output by each triangular band-pass filter according to the compressed power parameter a2, and write the logarithmic energy into the memory circuit 110. Next, the frequency-time conversion circuit 132 reads the mel-frequency spectrum parameter a3 from the memory circuit 110, and performs a frequency-time conversion process on the mel-frequency spectrum parameter a3 to generate an audio feature vector fv1, so as to obtain mel-frequency cepstrum coefficient (MFCC) of a sound box. The frequency-time transform process may be a discrete cosine transform (Discrete cosine transform, DCT) process.

Referring to fig. 2, the memory circuit 110 is sequentially read and written by the power spectrum conversion circuit 120, the mel filter circuit 131 and the frequency-time conversion circuit 132 at different periods, so that the maximum required bit width of the memory circuit 110 is the maximum value of the bit widths of the three data (i.e. the spectral coefficient a1, the compressed power parameter a2 and the mel spectrum parameter a 3) outputted by the power spectrum conversion circuit 120, the mel filter circuit 131 and the frequency-time conversion circuit 132. In other words, the maximum required bit width of the memory circuit 110 is the maximum value of the bit width of the input port ip1 of the power spectrum conversion circuit 120, the bit width of the input port ip2-1 of the mel filter circuit 131, and the bit width of the input port ip2-2 of the frequency-time conversion circuit 132. Since the power spectrum conversion circuit 120 performs the logarithmic processing, the bit width of the input port ip1 of the power spectrum conversion circuit 120 is larger than the bit width of the input port ip2-1 of the mel filter circuit 131. In addition, in the present embodiment of implementing the frequency-time conversion process of the frequency-time conversion circuit 132 by software, the bit width of the mel frequency spectrum parameter a3 is greater than or equal to the bit width of the frequency spectrum coefficient a1, so in one embodiment, the maximum required bit width of the memory circuit 110 is determined according to the bit width of the mel frequency spectrum parameter a3 output by the mel filter circuit 131. However, in other embodiments of implementing the frequency-time conversion process of the frequency-time conversion circuit 132 in hardware, the frequency-time conversion circuit 132 writes data in the middle of the operation into the memory circuit 110, so the maximum required bit width of the memory circuit 110 is determined according to the bit width of the mel frequency spectrum parameter a3 outputted by the mel filter circuit 131 or the bit width of the data outputted by the frequency-time conversion circuit 132.

Fig. 3 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention. Referring to fig. 3, the audio processing apparatus 30 for speech recognition includes a memory circuit 110, a preprocessing circuit 141, a time-frequency conversion circuit 142, a power logarithmic circuit 122, a mel filter circuit 131, and a frequency-time conversion circuit 132. The preprocessing circuit 141, the time-frequency conversion circuit 142, the power logarithmic circuit 122, the mel-filter circuit 131, and the frequency-time conversion circuit 132 are respectively coupled to the memory circuit 110 via an internal bus for performing a read-write operation on the memory circuit 110.

The preprocessing circuit 141 receives the time-domain audio sample data s1 and performs audio preprocessing on the time-domain audio sample data s1 to generate preprocessed data a4. The audio preprocessing may include Pre-emphasis (Pre-rendering) processing, frame blocking (Frame blocking) processing, windowing, and the like. In detail, the preprocessing circuit 141 can receive the time-domain audio sample data s1 after sampling the analog audio signal, and perform the pre-emphasis processing by passing the time-domain audio sample data s1 through a high-pass filter. Next, the preprocessing circuit 141 may perform a Frame processing by grouping N sample data into one Frame (Frame), wherein adjacent frames have overlapping sample data, and the preprocessing circuit 141 may perform a windowing processing by multiplying each Frame by a Hamming window (Hamming window). After all audio preprocessing is completed, the preprocessing circuit 141 writes the preprocessed data a4 to the memory circuit 110.

After the memory circuit 110 has buffered enough pre-processed data a4 (e.g., the pre-processed data a4 of 512 sampled data in a frame), the time-frequency transform circuit 142 reads the pre-processed data a4 from the memory circuit 110 and performs a time-frequency transform process on the pre-processed data a4 to generate the spectral coefficient a1. In the present embodiment, the time-frequency transform circuit 142 performs FFT processing on the preprocessed data a4 to generate the spectral coefficient a1 including the real coefficient and the imaginary coefficient. For example, the time-frequency transform circuit 142 may perform 512-point FFT operation to generate the spectral coefficient a1, but the present invention is not limited thereto. The time-frequency conversion circuit 142 writes these spectral coefficients a1 to the memory circuit 110.

The power logarithmic circuit 122 reads a plurality of spectral coefficients a1 of the time-domain audio sampling data s1 from the memory circuit 110, and generates a plurality of power spectral parameters according to the spectral coefficients a1. For each sampling point in a frame, the power spectral parameter may be generated by calculating the sum of the square of the real coefficient of the spectral coefficient a1 and the square of the imaginary coefficient of the spectral coefficient a1. The power logarithmic circuit 122 performs logarithmic conversion processing on the power spectrum parameters to generate a plurality of compressed power parameters a2, and writes the compressed power parameters a2 into the memory circuit 110.

In one embodiment, based on the following derivation of the formulas (1) to (10), the power logarithmic circuit 122 can perform logarithmic processing on the square of the real coefficient of the spectrum coefficient a1 to generate a first logarithmic value, and perform logarithmic processing on the square of the imaginary coefficient of the spectrum coefficient a1 to generate a second logarithmic value. The power logarithmic circuit 122 generates a compressed power parameter a2 by comparing the first logarithmic value with the second logarithmic value.

P(k)＝Re ² +Im ² (1)

ln(P(k))＝ln(Re ² +Im ² ) =ln (x+y) (2)

Wherein P (k) is a power spectrum parameter; re is the real part coefficient of the spectral coefficient a 1; im is the imaginary coefficient of the spectral coefficient a 1; x is the square of the real coefficient; and y is the square of the imaginary coefficient.

As such, if ln (x) > ln (y):

on the other hand, if ln (x) < ln (y):

where ln (x) represents the first logarithmic value and ln (y) represents the second logarithmic value. Accordingly, by comparing the first and second logarithmic values, the power logarithmic circuit 122 can calculate the compressed power parameter a2 according to the derivation of the equation (6) and the equation (10). And in the formulas (6) and (10), ln (1+e) ^(-p) ) Can be obtained by looking up a look-up table established in advance, so that the compressed power parameter a2 can be obtained by actually calculating the values of ln (x) and ln (y) by the power logarithmic circuit 122. Note that ln (x) =ln (Re ² ) =2ln (Re) and ln (y) =ln (Im ² ) =2ln (Im). Since the power logarithmic circuit 122 directly performs the logarithmic processing after taking the power spectrum parameter, the power logarithmic circuit 122 can generate the compressed power parameter a2 by multiplying the real coefficient of the spectrum coefficient a1 by 2 or multiplying the imaginary coefficient of the spectrum coefficient a1 by 2.

Therefore, compared with a traditional design that the power spectrum parameter is written into the memory circuit after the power spectrum parameter is calculated, the embodiment can avoid the requirement of writing the power spectrum parameter with larger bit width into the memory circuit, thereby achieving the effect of reducing the maximum required bit width of the memory circuit. In other words, by performing the logarithmic processing and then performing the mel filtering, the situation that the power spectrum parameter with large bit width is to be written into the memory circuit is avoided.

After that, the mel filter circuit 131 reads the compressed power parameter a2 from the memory circuit 110. The mel filter circuit 131 performs mel filtering processing on the compressed power parameter a2 to generate a plurality of mel spectrum parameters a3, and writes the mel spectrum parameters a3 into the memory circuit 110. The frequency-time conversion circuit 132 reads the mel-frequency spectrum parameter a3 from the memory circuit 110, and performs a frequency-time conversion process on the mel-frequency spectrum parameter a3 to generate an audio feature vector fv1. The operation of the mel-filter circuit 131 and the frequency-time conversion circuit 132 is similar to that of the embodiment of fig. 2, and will not be repeated here. The maximum required bit width of the memory circuit 110 is determined according to the mel spectrum parameter a3 outputted by the mel filter circuit 131.

It should be noted that, in the present embodiment, the preprocessing circuit 141, the time-frequency conversion circuit 142, the power logarithmic circuit 122, the mel filter circuit 131, and the frequency-time conversion circuit 132 operate sequentially in different periods. Thus, the preprocessing circuit 141, the time-frequency conversion circuit 142, the power logarithmic circuit 122, the mel-filter circuit 131 and the frequency-time conversion circuit 132 can share the memory circuit 110 in a time-sharing manner without arranging a memory circuit between the circuit modules, thereby greatly reducing the hardware cost of the memory circuit and reducing the circuit area.

For example, referring to fig. 3, assuming a sampling frequency of 16 khz, the bit width of the time-domain audio sample data s1 may be 16 bits (bit). The bit width of the preprocessed data a4 may be 24 bits. The bit width of the spectral coefficient a1 may be 24 bits. The bit width of the compressed power parameter a2 may be 19 bits. The bit width of mel-frequency spectral parameter a3 may be 24 bits. The bit width of the audio feature vector fv1 may be 32 bits. In this case, the maximum required bit width required for the memory circuit 110 is 24 bits.

In addition, in one embodiment, the memory size of the memory circuit 110 is the maximum required bit width multiplied by the number of data sets, and the number of data sets is the number of operations of the time-frequency conversion circuit 142 plus two. Specifically, when the number of operation points of the time-frequency conversion circuit 142 is M, the time-frequency conversion circuit 142 outputs M complex results, which respectively include an imaginary coefficient and a real coefficient. Therefore, the time-frequency conversion circuit 142 actually generates m×2 sets of calculation data. However, according to these complex results, the memory circuit 110 has conjugate symmetry, so that only (m+2) memory addresses are actually needed to store (m+2) +2 sets of data. Correspondingly, the memory size of the memory circuit 110 is (m+2) times the maximum required bit width. For example, assuming that the time-frequency transform circuit 142 performs 512-point FFT operations and the maximum required bit width is 24 bits, the memory size of the memory circuit 110 is 514 multiplied by 24.

In summary, in the embodiment of the invention, the memory circuit can be reused by a plurality of circuit modules in the audio feature extraction process, so that the memory space can be saved. In addition, through taking the logarithm and then carrying out Mel filtering, the condition that the power spectrum parameter with large bit width is written into the memory circuit is avoided, the maximum required bit width of the memory circuit for extracting the voice characteristics can be reduced, and the effects of reducing the circuit area and the hardware cost are achieved.

Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, but rather is capable of modification and variation without departing from the spirit and scope of the present invention.

Claims

1. An audio processing apparatus for speech recognition, comprising:

a memory circuit;

the power logarithmic circuit is coupled with the memory circuit, reads a plurality of frequency spectrum coefficients of time domain audio sampling data from the memory circuit, generates a plurality of power spectrum parameters according to the frequency spectrum coefficients, performs logarithmic conversion processing on the power spectrum parameters to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit;

the Mel filter circuit is coupled with the memory circuit, reads the compressed power parameters from the memory circuit, performs Mel filtering on the compressed power parameters to generate a plurality of Mel spectrum parameters, and writes the Mel spectrum parameters into the memory circuit; and

a frequency-time conversion circuit coupled to the memory circuit, reading the Mel frequency spectrum parameters from the memory circuit, performing frequency-time conversion on the Mel frequency spectrum parameters to generate audio feature vectors,

the power logarithmic circuit performs logarithmic processing on the square of the real coefficient to generate a first logarithmic value, performs logarithmic processing on the square of the imaginary coefficient to generate a second logarithmic value, and generates the compressed power parameters by comparing the first logarithmic value with the second logarithmic value.

2. The audio processing apparatus for speech recognition of claim 1, further comprising:

a preprocessing circuit coupled to the memory circuit for receiving the time-domain audio sample data, performing audio preprocessing on the time-domain audio sample data to generate preprocessed data, and writing the preprocessed data into the memory circuit; and

the time-frequency conversion circuit is coupled with the memory circuit, reads the preprocessed data from the memory circuit, performs time-frequency conversion on the preprocessed data to generate the frequency spectrum coefficients, and writes the frequency spectrum coefficients into the memory circuit.

3. The audio processing device for speech recognition according to claim 2, wherein the preprocessing circuit, the time-to-frequency conversion circuit, the power logarithmic circuit, the mel-filter circuit, and the frequency-to-time conversion circuit operate sequentially for a plurality of different time periods, respectively, to access the memory circuit for the plurality of different time periods, respectively.

4. The audio processing device for speech recognition according to claim 2, wherein the maximum required bit width of the memory circuit is determined according to the mel-frequency spectrum parameters output by the mel-filter circuit or the data output by the frequency-time conversion circuit.

5. The audio processing device for speech recognition according to claim 4, wherein the memory circuit has a memory size of a maximum required bit width multiplied by a number of data sets, and the number of data sets is the number of operation points of the time-frequency conversion circuit plus two.

6. The audio processing apparatus for speech recognition according to claim 2, wherein the time-frequency transform process is a fast fourier transform process, and the frequency-time transform process is a discrete cosine transform process.

7. An audio processing apparatus for speech recognition, comprising:

a memory circuit;

the power spectrum conversion circuit is coupled with the memory circuit, reads a plurality of frequency spectrum coefficients of time domain audio sampling data from the memory circuit, performs power spectrum conversion and compression processing according to the frequency spectrum coefficients to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit; and

a feature extraction circuit coupled to the memory circuit, reading the compressed power parameters from the memory circuit, performing Mel filtering according to the compressed power parameters to generate audio feature vectors,

wherein the bit widths of the compressed power parameters are smaller than the bit widths of the spectral coefficients,

the power spectrum conversion circuit generates a plurality of power spectrum parameters according to the spectrum coefficients, performs logarithmic conversion processing on the power spectrum parameters to generate the compressed power parameters,

the power spectrum conversion circuit performs logarithmic processing on the square of the real coefficient to generate a first logarithmic value, performs logarithmic processing on the square of the imaginary coefficient to generate a second logarithmic value, and generates the compressed power parameters by comparing the first logarithmic value with the second logarithmic value.

8. The audio processing device for speech recognition of claim 7, wherein the feature extraction circuit comprises:

and the Mel filter circuit is coupled with the memory circuit, reads the compressed power parameters from the memory circuit, performs Mel filtering on the compressed power parameters to generate a plurality of Mel spectrum parameters, and writes the Mel spectrum parameters into the memory circuit as the audio feature vector.

9. The audio processing device for speech recognition of claim 7, wherein the feature extraction circuit comprises:

the frequency-time conversion circuit is coupled with the memory circuit, reads the Mel frequency spectrum parameters from the memory circuit, and performs frequency-time conversion on the Mel frequency spectrum parameters to generate the audio feature vector.

10. The audio processing device for speech recognition of claim 7, wherein the feature extraction circuit does not perform a logarithmic operation.

11. The audio processing apparatus for speech recognition according to claim 7, wherein the maximum required bit width of the input port of the feature extraction circuit is smaller than the maximum required bit width of the input port of the power spectrum conversion circuit,

wherein the feature extraction circuit is connected to the memory circuit via an input port of the feature extraction circuit to access the memory circuit via the input port of the feature extraction circuit,

the power spectrum conversion circuit is connected to the memory circuit through an input port of the power spectrum conversion circuit to access the memory circuit through the input port of the power spectrum conversion circuit.