CN113223511B - Audio processing device for speech recognition - Google Patents

Audio processing device for speech recognition Download PDF

Info

Publication number
CN113223511B
CN113223511B CN202010071503.3A CN202010071503A CN113223511B CN 113223511 B CN113223511 B CN 113223511B CN 202010071503 A CN202010071503 A CN 202010071503A CN 113223511 B CN113223511 B CN 113223511B
Authority
CN
China
Prior art keywords
circuit
memory circuit
parameters
mel
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010071503.3A
Other languages
Chinese (zh)
Other versions
CN113223511A (en
Inventor
冯梦豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Xuanyang Technology Co ltd
Original Assignee
Zhuhai Xuanyang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Xuanyang Technology Co ltd filed Critical Zhuhai Xuanyang Technology Co ltd
Priority to CN202010071503.3A priority Critical patent/CN113223511B/en
Priority to US16/867,571 priority patent/US11404046B2/en
Publication of CN113223511A publication Critical patent/CN113223511A/en
Application granted granted Critical
Publication of CN113223511B publication Critical patent/CN113223511B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/285Memory allocation or algorithm optimisation to reduce hardware requirements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

An audio processing device for speech recognition includes a memory circuit, a power spectrum conversion circuit, and a feature extraction circuit. The power spectrum conversion circuit is coupled to the memory circuit, reads a plurality of spectral coefficients of a time-domain audio sampling data from the memory circuit, performs power spectrum conversion and compression processing according to the spectral coefficients to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit. The feature extraction circuit is coupled to the memory circuit, reads the compressed power parameter from the memory circuit, and performs Mel filtering and frequency-time transform processing according to the compressed power parameter to generate an audio feature vector. The bit width of the compressed power parameter is smaller than the bit width of the spectral coefficients.

Description

Audio processing device for speech recognition
Technical Field
The present invention relates to an audio processing device, and more particularly, to an audio processing device for speech recognition.
Background
With the advancement of technology, more and more electronic devices begin to use voice control, and the voice control will become a common user interface for most electronic devices in the future. It is known that the recognition rate of the voice recognition (Speech Recognition) directly affects the user experience of the user using the electronic device. In the implementation of speech recognition, speech feature extraction is an important link. For example, one of the most commonly used speech features at present is Mel-frequency cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC). The mel-frequency spectrum coefficient simulates the auditory characteristics of human ears, can reflect the perception characteristics of human to voice, and achieves higher recognition rate in the practical application of voice recognition. The various steps of speech feature extraction may be implemented by a plurality of hardware circuit blocks, such as a mel filter to generate mel-frequency cepstral coefficients, may be implemented by a plurality of triangular bandpass filters. It is known that the manner in which these hardware circuits operate to achieve speech feature extraction directly affects manufacturing costs, circuit area, circuit performance, and the like. Thus, as speech recognition is applied more and more widely, how to design a speech feature extraction circuit meeting the requirements is one of the important issues of interest to those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides an audio processing device for speech recognition, which can save memory space and reduce memory bit width, thereby reducing hardware cost.
An embodiment of the invention provides an audio processing device for voice recognition, which comprises a memory circuit, a power logarithmic circuit, a Mel filter circuit and a frequency-time conversion circuit. The power logarithmic circuit is coupled with the memory circuit, reads a plurality of frequency spectrum coefficients of the time-domain audio sampling data from the memory circuit, and generates a plurality of power spectrum parameters according to the frequency spectrum coefficients. The power logarithmic circuit performs logarithmic conversion processing on the power spectrum parameters to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit. The Mel filter circuit is coupled to the memory circuit and reads the compressed power parameter from the memory circuit. The Mel filter circuit performs Mel filtering processing on the compressed power parameters to generate multiple Mel spectrum parameters, and writes the Mel spectrum parameters into the memory circuit. The frequency-time conversion circuit is coupled to the memory circuit, reads the mel-frequency spectrum parameter from the memory circuit, and performs frequency-time conversion on the mel-frequency spectrum parameter to generate an audio feature vector.
An embodiment of the invention provides an audio processing device for voice recognition, which comprises a memory circuit, a power spectrum conversion circuit and a feature extraction circuit. The power spectrum conversion circuit is coupled to the memory circuit, reads a plurality of spectral coefficients of a time-domain audio sampling data from the memory circuit, performs power spectrum conversion and compression processing according to the spectral coefficients to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit. The feature extraction circuit is coupled to the memory circuit, reads the compressed power parameter from the memory circuit, and performs mel filtering processing according to the compressed power parameter to generate an audio feature vector. The bit width of the compressed power parameter is smaller than the bit width of the spectral coefficients.
Based on the above, in the embodiment of the invention, the audio processing device for voice recognition may include a memory circuit and a plurality of circuit modules, wherein the circuit modules are used for extracting voice features of the audio data and are sequentially in an operating state at different time periods respectively. Therefore, the circuit modules can share the same memory circuit and repeatedly use the memory circuit in a time-sharing way, thereby saving the hardware cost of the memory circuit. In addition, by writing the compressed power parameters into the memory circuit after performing the power spectrum conversion and compression processing by one of the circuit modules, the maximum required bit width of the memory circuit for speech feature extraction can be reduced.
In order to make the above features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
Fig. 1 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention.
Fig. 2 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention.
Fig. 3 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention.
Description of the reference numerals
10. 30: audio processing device
110: memory circuit
120: power spectrum conversion circuit
130: feature extraction circuit
a1: spectral coefficients
a2: compressed power parameters
fv1: audio feature vector
ip1, ip2-1, ip2-2: input port
131: mel filter circuit
132: frequency-time conversion circuit
a3: mel spectral parameters
141: pretreatment circuit
142: time-frequency conversion circuit
122: power logarithmic circuit
s1: time-domain audio sampling data
a4: preprocessed data
Detailed Description
Some embodiments of the invention will be described in detail below with reference to the drawings, wherein reference to the following description refers to the same or similar elements appearing in different drawings. These examples are only a part of the invention and do not disclose all possible embodiments of the invention. Rather, these embodiments are merely examples of the devices claimed in the present invention.
Fig. 1 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the present invention. Referring to fig. 1, an audio processing apparatus 10 for speech recognition includes a memory circuit 110, a power spectrum conversion circuit 120, and a feature extraction circuit 130. In one embodiment, the audio processing device 10 may be implemented as an audio processing chip with voice recognition function.
The memory circuit 110 is used to buffer data during voice feature extraction, and may be, but not limited to, a static random-access memory (SRAM). The memory circuit 110 may be coupled to the power spectrum conversion circuit 120 and the feature extraction circuit 130 via an internal bus, and the power spectrum conversion circuit 120 and the feature extraction circuit 130 may transmit data to and from the memory circuit 110 via the internal bus.
The power spectrum conversion circuit 120 may read a plurality of spectral coefficients a1 of time-domain (time-domain) audio sampling data from the memory circuit 110, and perform power spectrum conversion and compression processing according to the spectral coefficients a1 to generate a plurality of compressed power parameters a2. In detail, the time-domain audio sampling data is generated by sampling the analog audio signal, and the sampling frequency is, for example, 8 khz or 16 khz. The spectral coefficient a1 is generated by performing a time-frequency transform process, such as a fast fourier transform (Fast Fourier Transformation, FFT), on the time-domain audio sample data in a sampling period (i.e., a frame), and the spectral coefficient a1 of each sampling point includes a Real (Real) component and an Imaginary (imaging) component.
The power spectrum conversion circuit 120 may perform power spectrum conversion on the spectral coefficients a1 to obtain spectral features, i.e. calculate the sum of the square of the real coefficient of the spectral coefficient a1 and the square of the imaginary coefficient of the spectral coefficient a1. It is known that the Bit Width (Bit Width) of the data generated after the power spectrum conversion will be greatly increased. Therefore, in the present embodiment, the power spectrum conversion circuit 120 can further perform the compression process to generate a plurality of compressed power parameters a2, so as to achieve the purpose of compressing the bit width of the data to be written into the memory circuit 110. The compression processing is, for example, logarithmic processing. In other words, the bit width of the compressed power parameter a2 is smaller than the bit width of the spectral coefficient a1. The power spectrum conversion circuit 120 then writes the compressed power parameter a2 to the memory circuit 110.
The feature extraction circuit 130 may read the compressed power parameter a2 from the memory circuit 110 and perform mel filtering processing according to the compressed power parameter a2 to generate the audio feature vector fv1. In one embodiment, the feature extraction circuit 130 may obtain a plurality of audio feature parameters (also referred to as mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficient, MFCC)) by using a mel-filtering process and a frequency-time transform process, so as to obtain a multi-dimensional audio feature vector fv1. Alternatively, in another embodiment, the feature extraction circuit 130 may obtain a plurality of mel-frequency spectrum parameters by using a mel-frequency filtering process and use the mel-frequency spectrum parameters as the audio feature vector fv1. Here, the feature extraction circuit 130 may be implemented by a software module, a hardware module, or a combination thereof, which is not limited herein. The software modules may be programming codes or instructions stored in a recording medium, or the like. The hardware modules may be logic circuits implemented on an integrated circuit (integrated circuit). For example, the frequency-time transform process of the feature extraction circuit 130 may be implemented using a programming language (programming languages). In addition, the mel-filtering and/or frequency-time transform processes of the feature extraction circuit 130 may also be implemented as hardware modules using a hardware description language (hardware description languages or other suitable programming language), and thus may include one or more microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (Field Programmable Gate Array, FPGAs), or other types of hardware circuits.
In one embodiment, the audio feature vector fv1 may be used to match with a predetermined acoustic model or may be provided to a machine learning model for speech recognition purposes. In another embodiment, the audio feature vector fv1 may be matched with the predetermined acoustic model or provided to the machine learning model after other operations. Here, the power spectrum conversion circuit 120 and the feature extraction circuit 130 are sequentially enabled to be in a working state, and the power spectrum conversion circuit 120 and the feature extraction circuit 130 can share the same storage space of the memory circuit 110 in a time-sharing manner. In other words, in one embodiment, the plurality of circuits for generating the audio feature vector fv1 sequentially access the memory circuit 110 in different time periods, i.e. the memory circuit 110 is only accessed by a single circuit module in the same specific time period. It should be noted that the maximum required bit width of the memory circuit 110 is determined according to the bit width of the audio feature vector fv1 outputted by the feature extraction circuit 130.
Here, the power spectrum conversion circuit 120 is connected to the memory circuit 110 via the input port ip1 of the power spectrum conversion circuit 120 to access the memory circuit 110 via the input port ip1 of the power spectrum conversion circuit 120. The feature extraction circuit 130 is connected to the memory circuit 110 via an input port ip2 of the feature extraction circuit 130 to access the memory circuit 110 via the input port ip2 of the feature extraction circuit 130. It should be noted that, in an embodiment, since the power spectrum conversion circuit 120 performs the compression process, the feature extraction circuit 130 may not perform the logarithmic operation. In addition, in one embodiment, the power spectrum conversion circuit 120 reads the spectral coefficient a1 from the memory circuit 110 through the input port ip1, and the feature extraction circuit 130 sequentially reads the compressed power parameter a2 from the memory circuit 110 through the input port ip 2. Accordingly, in the case that the bit width of the compressed power parameter a2 is smaller than the bit width of the spectral coefficient a1, the maximum required bit width of the input port ip2 of the feature extraction circuit 130 is smaller than the maximum required bit width of the input port ip1 of the power spectrum conversion circuit 120.
Fig. 2 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention. Referring to fig. 2, in the present embodiment, the feature extraction circuit 130 may include a mel filter circuit 131 and a frequency-time conversion circuit 132. The mel filter circuit 131 and the frequency-time conversion circuit 132 may be respectively coupled to the memory circuit 110 via an internal bus.
The power spectrum conversion circuit 120 reads a plurality of spectral coefficients a1 of the time-domain audio sampling data from the memory circuit 110, performs power spectrum conversion and compression processing according to the spectral coefficients a1 to generate a plurality of compressed power parameters a2, and writes the compressed power parameters a2 into the memory circuit 110. In this embodiment, the compression process may be a logarithmic process. That is, the power spectrum conversion circuit 120 may generate a plurality of power spectrum parameters according to the spectrum coefficient a1, and perform a logarithmic conversion process on the power spectrum parameters to generate the compressed power parameter a2. For each sampling point in a frame, the power spectral parameter may be generated by calculating the sum of the square of the real coefficient of the spectral coefficient a1 and the square of the imaginary coefficient of the spectral coefficient a1.
In this embodiment, the mel filter circuit 131 may include, for example, a set of 19 nonlinear distributed triangular band-pass filters (Triangular Bandpass Filters). The mel filter circuit 131 reads the compressed power parameter a2 from the memory circuit 110, and performs mel filtering processing on the compressed power parameter a2 to generate a plurality of mel spectrum parameters a3. Next, the mel filter circuit 131 writes the mel spectrum parameter a3 to the memory circuit 110. Specifically, the mel filter circuit 131 may obtain the logarithmic energy output by each triangular band-pass filter according to the compressed power parameter a2, and write the logarithmic energy into the memory circuit 110. Next, the frequency-time conversion circuit 132 reads the mel-frequency spectrum parameter a3 from the memory circuit 110, and performs a frequency-time conversion process on the mel-frequency spectrum parameter a3 to generate an audio feature vector fv1, so as to obtain mel-frequency cepstrum coefficient (MFCC) of a sound box. The frequency-time transform process may be a discrete cosine transform (Discrete cosine transform, DCT) process.
Referring to fig. 2, the memory circuit 110 is sequentially read and written by the power spectrum conversion circuit 120, the mel filter circuit 131 and the frequency-time conversion circuit 132 at different periods, so that the maximum required bit width of the memory circuit 110 is the maximum value of the bit widths of the three data (i.e. the spectral coefficient a1, the compressed power parameter a2 and the mel spectrum parameter a 3) outputted by the power spectrum conversion circuit 120, the mel filter circuit 131 and the frequency-time conversion circuit 132. In other words, the maximum required bit width of the memory circuit 110 is the maximum value of the bit width of the input port ip1 of the power spectrum conversion circuit 120, the bit width of the input port ip2-1 of the mel filter circuit 131, and the bit width of the input port ip2-2 of the frequency-time conversion circuit 132. Since the power spectrum conversion circuit 120 performs the logarithmic processing, the bit width of the input port ip1 of the power spectrum conversion circuit 120 is larger than the bit width of the input port ip2-1 of the mel filter circuit 131. In addition, in the present embodiment of implementing the frequency-time conversion process of the frequency-time conversion circuit 132 by software, the bit width of the mel frequency spectrum parameter a3 is greater than or equal to the bit width of the frequency spectrum coefficient a1, so in one embodiment, the maximum required bit width of the memory circuit 110 is determined according to the bit width of the mel frequency spectrum parameter a3 output by the mel filter circuit 131. However, in other embodiments of implementing the frequency-time conversion process of the frequency-time conversion circuit 132 in hardware, the frequency-time conversion circuit 132 writes data in the middle of the operation into the memory circuit 110, so the maximum required bit width of the memory circuit 110 is determined according to the bit width of the mel frequency spectrum parameter a3 outputted by the mel filter circuit 131 or the bit width of the data outputted by the frequency-time conversion circuit 132.
Fig. 3 is a schematic diagram of an audio processing device for speech recognition according to an embodiment of the invention. Referring to fig. 3, the audio processing apparatus 30 for speech recognition includes a memory circuit 110, a preprocessing circuit 141, a time-frequency conversion circuit 142, a power logarithmic circuit 122, a mel filter circuit 131, and a frequency-time conversion circuit 132. The preprocessing circuit 141, the time-frequency conversion circuit 142, the power logarithmic circuit 122, the mel-filter circuit 131, and the frequency-time conversion circuit 132 are respectively coupled to the memory circuit 110 via an internal bus for performing a read-write operation on the memory circuit 110.
The preprocessing circuit 141 receives the time-domain audio sample data s1 and performs audio preprocessing on the time-domain audio sample data s1 to generate preprocessed data a4. The audio preprocessing may include Pre-emphasis (Pre-rendering) processing, frame blocking (Frame blocking) processing, windowing, and the like. In detail, the preprocessing circuit 141 can receive the time-domain audio sample data s1 after sampling the analog audio signal, and perform the pre-emphasis processing by passing the time-domain audio sample data s1 through a high-pass filter. Next, the preprocessing circuit 141 may perform a Frame processing by grouping N sample data into one Frame (Frame), wherein adjacent frames have overlapping sample data, and the preprocessing circuit 141 may perform a windowing processing by multiplying each Frame by a Hamming window (Hamming window). After all audio preprocessing is completed, the preprocessing circuit 141 writes the preprocessed data a4 to the memory circuit 110.
After the memory circuit 110 has buffered enough pre-processed data a4 (e.g., the pre-processed data a4 of 512 sampled data in a frame), the time-frequency transform circuit 142 reads the pre-processed data a4 from the memory circuit 110 and performs a time-frequency transform process on the pre-processed data a4 to generate the spectral coefficient a1. In the present embodiment, the time-frequency transform circuit 142 performs FFT processing on the preprocessed data a4 to generate the spectral coefficient a1 including the real coefficient and the imaginary coefficient. For example, the time-frequency transform circuit 142 may perform 512-point FFT operation to generate the spectral coefficient a1, but the present invention is not limited thereto. The time-frequency conversion circuit 142 writes these spectral coefficients a1 to the memory circuit 110.
The power logarithmic circuit 122 reads a plurality of spectral coefficients a1 of the time-domain audio sampling data s1 from the memory circuit 110, and generates a plurality of power spectral parameters according to the spectral coefficients a1. For each sampling point in a frame, the power spectral parameter may be generated by calculating the sum of the square of the real coefficient of the spectral coefficient a1 and the square of the imaginary coefficient of the spectral coefficient a1. The power logarithmic circuit 122 performs logarithmic conversion processing on the power spectrum parameters to generate a plurality of compressed power parameters a2, and writes the compressed power parameters a2 into the memory circuit 110.
In one embodiment, based on the following derivation of the formulas (1) to (10), the power logarithmic circuit 122 can perform logarithmic processing on the square of the real coefficient of the spectrum coefficient a1 to generate a first logarithmic value, and perform logarithmic processing on the square of the imaginary coefficient of the spectrum coefficient a1 to generate a second logarithmic value. The power logarithmic circuit 122 generates a compressed power parameter a2 by comparing the first logarithmic value with the second logarithmic value.
P(k)=Re 2 +Im 2 (1)
ln(P(k))=ln(Re 2 +Im 2 ) =ln (x+y) (2)
Wherein P (k) is a power spectrum parameter; re is the real part coefficient of the spectral coefficient a 1; im is the imaginary coefficient of the spectral coefficient a 1; x is the square of the real coefficient; and y is the square of the imaginary coefficient.
As such, if ln (x) > ln (y):
on the other hand, if ln (x) < ln (y):
where ln (x) represents the first logarithmic value and ln (y) represents the second logarithmic value. Accordingly, by comparing the first and second logarithmic values, the power logarithmic circuit 122 can calculate the compressed power parameter a2 according to the derivation of the equation (6) and the equation (10). And in the formulas (6) and (10), ln (1+e) (-p) ) Can be obtained by looking up a look-up table established in advance, so that the compressed power parameter a2 can be obtained by actually calculating the values of ln (x) and ln (y) by the power logarithmic circuit 122. Note that ln (x) =ln (Re 2 ) =2ln (Re) and ln (y) =ln (Im 2 ) =2ln (Im). Since the power logarithmic circuit 122 directly performs the logarithmic processing after taking the power spectrum parameter, the power logarithmic circuit 122 can generate the compressed power parameter a2 by multiplying the real coefficient of the spectrum coefficient a1 by 2 or multiplying the imaginary coefficient of the spectrum coefficient a1 by 2.
Therefore, compared with a traditional design that the power spectrum parameter is written into the memory circuit after the power spectrum parameter is calculated, the embodiment can avoid the requirement of writing the power spectrum parameter with larger bit width into the memory circuit, thereby achieving the effect of reducing the maximum required bit width of the memory circuit. In other words, by performing the logarithmic processing and then performing the mel filtering, the situation that the power spectrum parameter with large bit width is to be written into the memory circuit is avoided.
After that, the mel filter circuit 131 reads the compressed power parameter a2 from the memory circuit 110. The mel filter circuit 131 performs mel filtering processing on the compressed power parameter a2 to generate a plurality of mel spectrum parameters a3, and writes the mel spectrum parameters a3 into the memory circuit 110. The frequency-time conversion circuit 132 reads the mel-frequency spectrum parameter a3 from the memory circuit 110, and performs a frequency-time conversion process on the mel-frequency spectrum parameter a3 to generate an audio feature vector fv1. The operation of the mel-filter circuit 131 and the frequency-time conversion circuit 132 is similar to that of the embodiment of fig. 2, and will not be repeated here. The maximum required bit width of the memory circuit 110 is determined according to the mel spectrum parameter a3 outputted by the mel filter circuit 131.
It should be noted that, in the present embodiment, the preprocessing circuit 141, the time-frequency conversion circuit 142, the power logarithmic circuit 122, the mel filter circuit 131, and the frequency-time conversion circuit 132 operate sequentially in different periods. Thus, the preprocessing circuit 141, the time-frequency conversion circuit 142, the power logarithmic circuit 122, the mel-filter circuit 131 and the frequency-time conversion circuit 132 can share the memory circuit 110 in a time-sharing manner without arranging a memory circuit between the circuit modules, thereby greatly reducing the hardware cost of the memory circuit and reducing the circuit area.
For example, referring to fig. 3, assuming a sampling frequency of 16 khz, the bit width of the time-domain audio sample data s1 may be 16 bits (bit). The bit width of the preprocessed data a4 may be 24 bits. The bit width of the spectral coefficient a1 may be 24 bits. The bit width of the compressed power parameter a2 may be 19 bits. The bit width of mel-frequency spectral parameter a3 may be 24 bits. The bit width of the audio feature vector fv1 may be 32 bits. In this case, the maximum required bit width required for the memory circuit 110 is 24 bits.
In addition, in one embodiment, the memory size of the memory circuit 110 is the maximum required bit width multiplied by the number of data sets, and the number of data sets is the number of operations of the time-frequency conversion circuit 142 plus two. Specifically, when the number of operation points of the time-frequency conversion circuit 142 is M, the time-frequency conversion circuit 142 outputs M complex results, which respectively include an imaginary coefficient and a real coefficient. Therefore, the time-frequency conversion circuit 142 actually generates m×2 sets of calculation data. However, according to these complex results, the memory circuit 110 has conjugate symmetry, so that only (m+2) memory addresses are actually needed to store (m+2) +2 sets of data. Correspondingly, the memory size of the memory circuit 110 is (m+2) times the maximum required bit width. For example, assuming that the time-frequency transform circuit 142 performs 512-point FFT operations and the maximum required bit width is 24 bits, the memory size of the memory circuit 110 is 514 multiplied by 24.
In summary, in the embodiment of the invention, the memory circuit can be reused by a plurality of circuit modules in the audio feature extraction process, so that the memory space can be saved. In addition, through taking the logarithm and then carrying out Mel filtering, the condition that the power spectrum parameter with large bit width is written into the memory circuit is avoided, the maximum required bit width of the memory circuit for extracting the voice characteristics can be reduced, and the effects of reducing the circuit area and the hardware cost are achieved.
Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, but rather is capable of modification and variation without departing from the spirit and scope of the present invention.

Claims (11)

1. An audio processing apparatus for speech recognition, comprising:
a memory circuit;
the power logarithmic circuit is coupled with the memory circuit, reads a plurality of frequency spectrum coefficients of time domain audio sampling data from the memory circuit, generates a plurality of power spectrum parameters according to the frequency spectrum coefficients, performs logarithmic conversion processing on the power spectrum parameters to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit;
the Mel filter circuit is coupled with the memory circuit, reads the compressed power parameters from the memory circuit, performs Mel filtering on the compressed power parameters to generate a plurality of Mel spectrum parameters, and writes the Mel spectrum parameters into the memory circuit; and
a frequency-time conversion circuit coupled to the memory circuit, reading the Mel frequency spectrum parameters from the memory circuit, performing frequency-time conversion on the Mel frequency spectrum parameters to generate audio feature vectors,
the power logarithmic circuit performs logarithmic processing on the square of the real coefficient to generate a first logarithmic value, performs logarithmic processing on the square of the imaginary coefficient to generate a second logarithmic value, and generates the compressed power parameters by comparing the first logarithmic value with the second logarithmic value.
2. The audio processing apparatus for speech recognition of claim 1, further comprising:
a preprocessing circuit coupled to the memory circuit for receiving the time-domain audio sample data, performing audio preprocessing on the time-domain audio sample data to generate preprocessed data, and writing the preprocessed data into the memory circuit; and
the time-frequency conversion circuit is coupled with the memory circuit, reads the preprocessed data from the memory circuit, performs time-frequency conversion on the preprocessed data to generate the frequency spectrum coefficients, and writes the frequency spectrum coefficients into the memory circuit.
3. The audio processing device for speech recognition according to claim 2, wherein the preprocessing circuit, the time-to-frequency conversion circuit, the power logarithmic circuit, the mel-filter circuit, and the frequency-to-time conversion circuit operate sequentially for a plurality of different time periods, respectively, to access the memory circuit for the plurality of different time periods, respectively.
4. The audio processing device for speech recognition according to claim 2, wherein the maximum required bit width of the memory circuit is determined according to the mel-frequency spectrum parameters output by the mel-filter circuit or the data output by the frequency-time conversion circuit.
5. The audio processing device for speech recognition according to claim 4, wherein the memory circuit has a memory size of a maximum required bit width multiplied by a number of data sets, and the number of data sets is the number of operation points of the time-frequency conversion circuit plus two.
6. The audio processing apparatus for speech recognition according to claim 2, wherein the time-frequency transform process is a fast fourier transform process, and the frequency-time transform process is a discrete cosine transform process.
7. An audio processing apparatus for speech recognition, comprising:
a memory circuit;
the power spectrum conversion circuit is coupled with the memory circuit, reads a plurality of frequency spectrum coefficients of time domain audio sampling data from the memory circuit, performs power spectrum conversion and compression processing according to the frequency spectrum coefficients to generate a plurality of compressed power parameters, and writes the compressed power parameters into the memory circuit; and
a feature extraction circuit coupled to the memory circuit, reading the compressed power parameters from the memory circuit, performing Mel filtering according to the compressed power parameters to generate audio feature vectors,
wherein the bit widths of the compressed power parameters are smaller than the bit widths of the spectral coefficients,
the power spectrum conversion circuit generates a plurality of power spectrum parameters according to the spectrum coefficients, performs logarithmic conversion processing on the power spectrum parameters to generate the compressed power parameters,
the power spectrum conversion circuit performs logarithmic processing on the square of the real coefficient to generate a first logarithmic value, performs logarithmic processing on the square of the imaginary coefficient to generate a second logarithmic value, and generates the compressed power parameters by comparing the first logarithmic value with the second logarithmic value.
8. The audio processing device for speech recognition of claim 7, wherein the feature extraction circuit comprises:
and the Mel filter circuit is coupled with the memory circuit, reads the compressed power parameters from the memory circuit, performs Mel filtering on the compressed power parameters to generate a plurality of Mel spectrum parameters, and writes the Mel spectrum parameters into the memory circuit as the audio feature vector.
9. The audio processing device for speech recognition of claim 7, wherein the feature extraction circuit comprises:
the Mel filter circuit is coupled with the memory circuit, reads the compressed power parameters from the memory circuit, performs Mel filtering on the compressed power parameters to generate a plurality of Mel spectrum parameters, and writes the Mel spectrum parameters into the memory circuit; and
the frequency-time conversion circuit is coupled with the memory circuit, reads the Mel frequency spectrum parameters from the memory circuit, and performs frequency-time conversion on the Mel frequency spectrum parameters to generate the audio feature vector.
10. The audio processing device for speech recognition of claim 7, wherein the feature extraction circuit does not perform a logarithmic operation.
11. The audio processing apparatus for speech recognition according to claim 7, wherein the maximum required bit width of the input port of the feature extraction circuit is smaller than the maximum required bit width of the input port of the power spectrum conversion circuit,
wherein the feature extraction circuit is connected to the memory circuit via an input port of the feature extraction circuit to access the memory circuit via the input port of the feature extraction circuit,
the power spectrum conversion circuit is connected to the memory circuit through an input port of the power spectrum conversion circuit to access the memory circuit through the input port of the power spectrum conversion circuit.
CN202010071503.3A 2020-01-21 2020-01-21 Audio processing device for speech recognition Active CN113223511B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010071503.3A CN113223511B (en) 2020-01-21 2020-01-21 Audio processing device for speech recognition
US16/867,571 US11404046B2 (en) 2020-01-21 2020-05-06 Audio processing device for speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010071503.3A CN113223511B (en) 2020-01-21 2020-01-21 Audio processing device for speech recognition

Publications (2)

Publication Number Publication Date
CN113223511A CN113223511A (en) 2021-08-06
CN113223511B true CN113223511B (en) 2024-04-16

Family

ID=76857265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010071503.3A Active CN113223511B (en) 2020-01-21 2020-01-21 Audio processing device for speech recognition

Country Status (2)

Country Link
US (1) US11404046B2 (en)
CN (1) CN113223511B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11881904B2 (en) * 2022-03-31 2024-01-23 Dell Products, L.P. Power detection in the time domain on a periodic basis with statistical counters

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010191252A (en) * 2009-02-19 2010-09-02 Toyota Motor Corp Speech recognition device, speech recognition method
WO2014153800A1 (en) * 2013-03-29 2014-10-02 京东方科技集团股份有限公司 Voice recognition system
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
CN109166591A (en) * 2018-08-29 2019-01-08 昆明理工大学 A kind of classification method based on audio frequency characteristics signal
WO2019232846A1 (en) * 2018-06-04 2019-12-12 平安科技(深圳)有限公司 Speech differentiation method and apparatus, and computer device and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6845359B2 (en) * 2001-03-22 2005-01-18 Motorola, Inc. FFT based sine wave synthesis method for parametric vocoders
US6772315B1 (en) * 2001-05-24 2004-08-03 Rambus Inc Translation lookaside buffer extended to provide physical and main-memory addresses
SG140445A1 (en) * 2003-07-28 2008-03-28 Sony Corp Method and apparatus for automatically recognizing audio data
JP5223786B2 (en) * 2009-06-10 2013-06-26 富士通株式会社 Voice band extending apparatus, voice band extending method, voice band extending computer program, and telephone
CN202615783U (en) 2012-05-23 2012-12-19 西北师范大学 Mel cepstrum analysis synthesizer based on FPGA
EP2862169A4 (en) * 2012-06-15 2016-03-02 Jemardator Ab Cepstral separation difference
US10719115B2 (en) * 2014-12-30 2020-07-21 Avago Technologies International Sales Pte. Limited Isolated word training and detection using generated phoneme concatenation models of audio inputs
US11004461B2 (en) * 2017-09-01 2021-05-11 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN111210806B (en) * 2020-01-10 2022-06-17 东南大学 Low-power-consumption MFCC voice feature extraction circuit based on serial FFT

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010191252A (en) * 2009-02-19 2010-09-02 Toyota Motor Corp Speech recognition device, speech recognition method
WO2014153800A1 (en) * 2013-03-29 2014-10-02 京东方科技集团股份有限公司 Voice recognition system
WO2019232846A1 (en) * 2018-06-04 2019-12-12 平安科技(深圳)有限公司 Speech differentiation method and apparatus, and computer device and storage medium
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
CN109166591A (en) * 2018-08-29 2019-01-08 昆明理工大学 A kind of classification method based on audio frequency characteristics signal

Also Published As

Publication number Publication date
US20210225360A1 (en) 2021-07-22
CN113223511A (en) 2021-08-06
US11404046B2 (en) 2022-08-02

Similar Documents

Publication Publication Date Title
JP2021086154A (en) Method, device, apparatus, and computer-readable storage medium for speech recognition
US20210193149A1 (en) Method, apparatus and device for voiceprint recognition, and medium
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
US11183177B2 (en) Real-time voice recognition apparatus equipped with ASIC chip and smartphone
AU2017404565A1 (en) Electronic device, method and system of identity verification and computer readable storage medium
CN110970036B (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
TW200306526A (en) Method for robust voice recognition by analyzing redundant features of source signal
CN111433847A (en) Speech conversion method and training method, intelligent device and storage medium
Vu et al. Implementation of the MFCC front-end for low-cost speech recognition systems
CN113223511B (en) Audio processing device for speech recognition
CN112530410A (en) Command word recognition method and device
KR102194194B1 (en) Method, apparatus for blind signal seperating and electronic device
Zhang et al. Temporal Transformer Networks for Acoustic Scene Classification.
Helali et al. Real time speech recognition based on PWP thresholding and MFCC using SVM
CN112397086A (en) Voice keyword detection method and device, terminal equipment and storage medium
Ren et al. Recalibrated bandpass filtering on temporal waveform for audio spoof detection
Joy et al. Deep scattering power spectrum features for robust speech recognition
CN110875037A (en) Voice data processing method and device and electronic equipment
CN111462770A (en) L STM-based late reverberation suppression method and system
Ernawan et al. Efficient discrete tchebichef on spectrum analysis of speech recognition
Pardede et al. Generalized-log spectral mean normalization for speech recognition
CN113555031B (en) Training method and device of voice enhancement model, and voice enhancement method and device
Li et al. Dual-stream speech dereverberation network using long-term and short-term cues
CN114067784A (en) Training method and device of fundamental frequency extraction model and fundamental frequency extraction method and device
CN111341321A (en) Matlab-based spectrogram generating and displaying method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant