CN110610696A

CN110610696A - MFCC feature extraction method and device based on mixed signal domain

Info

Publication number: CN110610696A
Application number: CN201810615611.5A
Authority: CN
Inventors: 李钦; 乔飞; 魏琦; 朱慧峰; 刘辛军; 杨华中
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2019-12-24
Anticipated expiration: 2038-06-14
Also published as: CN110610696B

Abstract

The embodiment of the invention provides a MFCC feature extraction method and a MFCC feature extraction device based on a mixed signal domain, wherein the mixed signal domain comprises an analog signal domain and a digital signal domain, and the method comprises the following steps: acquiring a preprocessed voice signal in the analog signal domain; performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands; calculating the time domain signals in each frequency band according to a preset operation rule; carrying out low-pass filtering processing on the operation result, and taking the operation result after the low-pass filtering processing as the energy value of the time domain signal in each frequency band; and converting the energy value into a digital signal, performing data processing on the converted energy value in the digital signal domain, and taking the result of the data processing as the extracted Mel frequency cepstrum coefficient MFCC characteristic. The device performs the above method. The method and the device provided by the embodiment of the invention can effectively extract the MFCC characteristics, improve the extraction speed and reduce the energy consumed in the extraction process.

Description

MFCC feature extraction method and device based on mixed signal domain

Technical Field

The embodiment of the invention relates to the technical field of voice feature extraction, in particular to a mixed signal domain-based MFCC feature extraction method and device.

Background

Voice interaction has become an important approach between human-computer interaction, and therefore, automatic voice recognition is very important. Furthermore, in energy-constrained application scenarios, low-power and energy-efficient automatic speech recognition is of paramount importance.

Auditory feature extraction is a key in automatic speech recognition, Mel-scale Frequency Cepstral coeffients (hereinafter referred to as "MFCC") can intuitively show the distribution of speech signals in a Frequency domain, and therefore, MFCC features are widely extracted as auditory features and are also the most commonly used speech features at present. FIG. 1 is a flow chart of a prior art MFCC feature extraction method; as shown in fig. 1, the speech signal is converted from the analog domain to the digital domain, where data processing, including fourier transform, Mel filtering, etc., is performed. In the course of carrying out the embodiments of the present invention, the inventors found that: in the MFCC feature extraction process in fig. 1, the fourier transform process consumes considerable computation time and computation resources, and the analog-to-digital conversion process also consumes certain computation time and computation resources, thereby causing excessive energy consumption in the prior art.

Therefore, how to avoid the above-mentioned drawbacks, and effectively extract MFCC features and reduce the energy consumed in the extraction process becomes a problem that needs to be solved for low-power automatic speech recognition.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the invention provides a method and a device for extracting MFCC features based on a mixed signal domain.

In a first aspect, an embodiment of the present invention provides a method for MFCC feature extraction based on a mixed signal domain, where the mixed signal domain includes an analog signal domain and a digital signal domain, and the method includes:

acquiring a preprocessed voice signal in the analog signal domain; performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands;

calculating the time domain signals in each frequency band according to a preset operation rule;

carrying out low-pass filtering processing on the operation result, and taking the operation result after the low-pass filtering processing as the energy value of the time domain signal in each frequency band;

and converting the energy value into a digital signal, performing data processing on the converted energy value in the digital signal domain, and taking the result of the data processing as the extracted Mel frequency cepstrum coefficient MFCC characteristic.

In a second aspect, an embodiment of the present invention provides an MFCC feature extraction apparatus based on a mixed signal domain, where the mixed signal domain includes an analog signal domain and a digital signal domain, and the apparatus includes:

an acquisition unit configured to acquire a preprocessed voice signal in the analog signal domain; performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands;

the operation unit is used for operating the time domain signals in each frequency band according to a preset operation rule;

the filtering unit is used for performing low-pass filtering processing on the operation result and taking the operation result after the low-pass filtering processing as the energy value of the time domain signal in each frequency band;

and the extraction unit is used for converting the energy value into a digital signal, performing data processing on the converted energy value in the digital signal domain, and taking the result of the data processing as the extracted Mel frequency cepstrum coefficient MFCC characteristic.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, including:

the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform a method comprising:

According to the mixed signal domain-based MFCC feature extraction method and device provided by the embodiment of the invention, time domain signals of voice signals in different frequency bands are extracted from the analog signal domain, the time domain signals in each frequency band are subjected to operation and low-pass filtering, and the energy value obtained after the low-pass filtering is subjected to data processing in the digital signal domain, so that the MFCC feature can be effectively extracted, and the energy consumed in the extraction process is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a prior art MFCC feature extraction method;

FIG. 2 is a schematic flow chart of a mixed signal domain-based MFCC feature extraction method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a MFCC feature extraction method according to another embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a mixed signal domain-based MFCC feature extraction device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 2 is a schematic flow diagram of a mixed signal domain-based MFCC feature extraction method according to an embodiment of the present invention, and as shown in fig. 2, the mixed signal domain includes an analog signal domain and a digital signal domain, and includes the following steps:

s201: acquiring a preprocessed voice signal in the analog signal domain; and performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands.

Specifically, the device acquires a preprocessed voice signal in the analog signal domain; and performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands. FIG. 3 is a flowchart of a MFCC feature extraction method according to another embodiment of the present invention; as shown in fig. 3, the preprocessed voice signal may be a voice signal obtained by amplifying an original voice signal through a low noise amplifier.

S202: and calculating the time domain signals in each frequency band according to a preset operation rule.

Specifically, the device calculates the time domain signals in each frequency band according to a preset calculation rule. Referring to fig. 3, further, the time domain signal in each frequency band may be squared, and the time domain signal in each frequency band may be squared according to the following formula:

|x(t)|²

x (t) is the time domain signal of the speech signal, according to the Pasteval theorem:

wherein E is_iIs the energy, x, of the ith frame speech signal in each frequency band_i(t) is the time domain signal, X, of the i-th frame speech signal in each frequency band_iAnd (omega) is a frequency domain signal of the ith frame voice signal in each frequency band. That is, the sum of the integrals of the squares of the time domain signals for a frequency band is equal to the frequency band2 pi times the integrated sum of the squares of the frequency domain signals.

S203: and performing low-pass filtering processing on the operation result, and taking the operation result after the low-pass filtering processing as the energy value of the time domain signal in each frequency band.

Specifically, the device performs low-pass filtering on the operation result, and uses the operation result after the low-pass filtering as the energy value of the time domain signal in each frequency band. Referring to fig. 3, a preset analog low-pass filter may be used to perform low-pass filtering on the operation result.

S204: and converting the energy value into a digital signal, performing data processing on the converted energy value in the digital signal domain, and taking the result of the data processing as the extracted Mel frequency cepstrum coefficient MFCC characteristic.

Specifically, the device converts the energy value into a digital signal, performs data processing on the converted energy value in the digital signal domain, and takes the result of the data processing as the extracted Mel cepstrum coefficient MFCC feature. Referring to fig. 3, an analog-to-digital converter with an ultra-low sampling rate (lower than a preset sampling rate threshold) may be used to convert the energy value into a digital signal. Then, the transformed energy value is subjected to framing, logarithm processing and Discrete Cosine Transform (DCT). For each frame of speech signal, the embodiment of the present invention generates and outputs the energy value in each frequency band in the analog signal domain, and the output rate thereof is less changed (for example, 80 Hz). Furthermore, while the framing step of the prior art implementation shown in fig. 1 is performed at the front end of the digital signal domain, in the embodiment of the present invention, the front end is in the analog signal domain, and the signal values cannot be stored, so that the framing with aliasing cannot be performed. The embodiment of the invention puts the framing step into a digital signal domain behind an analog-to-digital converter, and as the aliasing length is half of the frame length, namely the output signal change rate of the analog-to-digital converter is 80Hz, the output value of a half frame is stored in the digital signal domain and averaged with the output value of the next half frame, so as to obtain the average energy value of the frame.

The embodiment of the invention does not need the step of FFT (Fourier transform) with complex calculation shown in figure 1, and also utilizes the advantages of high energy efficiency and high speed of the analog circuit to complete the extraction and calculation of the energy distribution of the input voice signal with higher speed and higher energy efficiency. The method realized by the prior art directly accesses the 16bits and 16kHz analog-to-digital converter behind the sensor, and for a voice frame with the length of 25ms, each frame has 1400 16-bit sampling points, so that the operation cost of FFT and square operation is greatly increased, and simultaneously higher ADC energy consumption is introduced. In the embodiment of the invention, each frame at the analog-to-digital conversion part only has 40 sampling points of 16bits, so that the energy consumption of the analog-to-digital converter part is greatly reduced, the speed of the part is improved, and the operation cost of the logarithmic multiplication and DCT part is also reduced.

The embodiment of the invention carries out analog simulation on the processing circuit of the analog signal domain on the cadence platform by adopting a CMOS180nm process. In order to evaluate the performance of the MFCC features extracted by the embodiment of the invention, the embodiment of the invention is based on a transducer flow platform, and adopts a TI-DIGITS voice data set and an LSTM neural network to perform automatic voice recognition accuracy performance test. The test results are shown in table 1:

TABLE 1

Referring to table 1 above, the comparative results in energy consumption are very significant in the examples of the present invention compared to the prior art. Compared with an FPGA (field programmable gate array), the energy loss of each frame of MFCC feature extraction is saved by 97.2%, and compared with an ASIC (application specific integrated circuit), the energy loss is saved by 95.1%. Therefore, the embodiment of the invention has obvious saving effect on energy loss. Compared with the prior art, the embodiment of the invention has certain advantages on the speed characteristic of extracting the MFCC features, and the MFCC extracting speed of the FPGA, the DSP and the ASIC is several times or even tens of times of that of the embodiment of the invention. The GPU trades very high energy consumption for faster speed, but the GPU has no advantage in low-power consumption application scenarios, considering the comprehensive energy consumption and extraction speed. Because the data dimension is reduced in the front-end processing of the analog signal domain, the requirement for the analog-digital conversion part is greatly reduced, which is reflected in the aspect of sampling rate.

In summary, the embodiment of the present invention can greatly reduce the operation energy loss and the time loss in the extraction process, and eliminate the FFT which occupies a large amount of operation cost in the existing method. Compared with the prior art, the method saves energy consumption by at least 95.1 percent, and the operation speed is improved by more than 6.4 times. Simulation results also show that the MFCC feature extraction accuracy is as high as 99%. Compared with the MFCC feature extraction method in the prior art, the method and the device for extracting the MFCC features have the advantages and the effects in a low-power-consumption application scene are obvious.

According to the mixed signal domain-based MFCC feature extraction method provided by the embodiment of the invention, time domain signals of voice signals in different frequency bands are extracted in the analog signal domain, the time domain signals in each frequency band are subjected to operation and low-pass filtering, and the energy value obtained after the low-pass filtering is subjected to data processing in the digital signal domain, so that the MFCC feature can be effectively extracted, and the energy consumed in the extraction process is reduced.

On the basis of the above embodiment, the calculating the time domain signal in each frequency band according to the preset calculation rule includes:

and performing square operation on the time domain signals in each frequency band.

Specifically, the device performs a squaring operation on the time domain signals in each frequency band. Reference may be made to the above embodiments, which are not described in detail.

According to the MFCC feature extraction method based on the mixed signal domain, provided by the embodiment of the invention, the time domain signals in each frequency band are subjected to square operation, so that the operation result is more reasonable, and the normal operation of the method is ensured.

On the basis of the foregoing embodiment, the low-pass filtering processing on the operation result includes:

and carrying out low-pass filtering processing on the operation result by adopting a preset analog low-pass filter.

Specifically, the device performs low-pass filtering processing on the operation result by using a preset low-pass filter. Reference may be made to the above embodiments, which are not described in detail.

According to the MFCC feature extraction method based on the mixed signal domain, the operation result is subjected to low-pass filtering processing by adopting the preset low-pass filter, and the operation result can be effectively subjected to low-pass filtering processing.

On the basis of the above embodiment, the data processing of the converted energy value in the digital domain includes:

and framing, logarithm processing and Discrete Cosine Transform (DCT) are carried out on the converted energy value.

Specifically, the device performs framing, logarithm processing and Discrete Cosine Transform (DCT) on the converted energy value. Reference may be made to the above embodiments, which are not described in detail.

The MFCC feature extraction method based on the mixed signal domain provided by the embodiment of the invention can effectively extract the MFCC feature by performing framing, logarithm taking processing and Discrete Cosine Transform (DCT) on the converted energy value.

On the basis of the above embodiment, before the step of performing operation on the time domain signal in each frequency band according to the preset operation rule, the method further includes:

and acquiring the frequency characteristics of the voice signals.

Specifically, the device acquires a frequency characteristic of the voice signal. For example: the male voice is more concentrated in a region with a lower frequency than the female voice, and thus it can be determined whether the male voice or the female voice is the voice by the frequency characteristics.

And according to the frequency characteristics, determining the frequency distribution range in which the frequency characteristics are positioned, and closing the frequency bands which are not in the frequency distribution range.

Specifically, the device determines a frequency distribution range in which the frequency feature is located according to the frequency feature, and closes a frequency band that is not within the frequency distribution range. Referring to fig. 3, a band switching device (corresponding to the user switching device in fig. 3) may be previously provided after the low noise amplifier in fig. 3. Referring to the above example, if a male voice is confirmed, the frequency distribution range in which the male voice is located is determined, and the band switching device is adjusted to close the path of the frequency band not within the frequency distribution range, thereby ensuring that the low frequency part characteristic is not affected.

The MFCC feature extraction method based on the mixed signal domain further avoids information sampling in a useless frequency band, so that the analysis speed is improved, and the energy consumption is reduced.

On the basis of the above embodiment, the turning off the frequency bands not within the frequency distribution range includes:

and closing the path of the frequency band which is not in the frequency distribution range through a preset frequency band switching device.

Specifically, the device closes the path of the frequency band not within the frequency distribution range by a preset frequency band switching device. Reference may be made to the above embodiments, which are not described in detail.

According to the MFCC feature extraction method based on the mixed signal domain, the preset frequency band switch device is used for closing the access of the frequency band which is not in the frequency distribution range, information sampling in the useless frequency band is further effectively avoided, the analysis speed is improved, and the energy consumption is reduced.

Fig. 4 is a schematic structural diagram of a mixed signal domain-based MFCC feature extraction device according to an embodiment of the present invention, and as shown in fig. 4, an embodiment of the present invention provides a mixed signal domain-based MFCC feature extraction device, where the mixed signal domain includes an analog signal domain and a digital signal domain, the device includes an obtaining unit 401, an arithmetic unit 402, a filtering unit 403, and an extracting unit 404, where:

the obtaining unit 401 is configured to obtain a preprocessed voice signal in the analog signal domain; performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands; the operation unit 402 is configured to perform operation on the time domain signals in each frequency band according to a preset operation rule; the filtering unit 403 is configured to perform low-pass filtering on the operation result, and use the operation result after the low-pass filtering as an energy value of the time domain signal in each frequency band; the extracting unit 404 is configured to convert the energy value into a digital signal, perform data processing on the converted energy value in the digital signal domain, and use a result of the data processing as the extracted mel-frequency cepstrum coefficient MFCC characteristic.

Specifically, the obtaining unit 401 is configured to obtain a preprocessed voice signal in the analog signal domain; performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands; the operation unit 402 is configured to perform operation on the time domain signals in each frequency band according to a preset operation rule; the filtering unit 403 is configured to perform low-pass filtering on the operation result, and use the operation result after the low-pass filtering as an energy value of the time domain signal in each frequency band; the extracting unit 404 is configured to convert the energy value into a digital signal, perform data processing on the converted energy value in the digital signal domain, and use a result of the data processing as the extracted mel-frequency cepstrum coefficient MFCC characteristic.

According to the MFCC feature extraction device based on the mixed signal domain, provided by the embodiment of the invention, time domain signals of voice signals in different frequency bands are extracted in the analog signal domain, the time domain signals in each frequency band are subjected to operation and low-pass filtering, and the energy value obtained after the low-pass filtering is subjected to data processing in the digital signal domain, so that the MFCC feature can be effectively extracted, and the energy consumed in the extraction process is reduced.

The MFCC feature extraction apparatus based on a mixed signal domain provided in the embodiments of the present invention may be specifically configured to execute the processing flows of the above method embodiments, and the functions thereof are not described herein again, and refer to the detailed description of the above method embodiments.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: a processor (processor)501, a memory (memory)502, and a bus 503;

the processor 501 and the memory 502 complete communication with each other through a bus 503;

the processor 501 is configured to call program instructions in the memory 502 to perform the methods provided by the above-mentioned method embodiments, for example, including: acquiring a preprocessed voice signal in the analog signal domain; performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands; calculating the time domain signals in each frequency band according to a preset operation rule; carrying out low-pass filtering processing on the operation result, and taking the operation result after the low-pass filtering processing as the energy value of the time domain signal in each frequency band; and converting the energy value into a digital signal, performing data processing on the converted energy value in the digital signal domain, and taking the result of the data processing as the extracted Mel frequency cepstrum coefficient MFCC characteristic.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: acquiring a preprocessed voice signal in the analog signal domain; performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands; calculating the time domain signals in each frequency band according to a preset operation rule; carrying out low-pass filtering processing on the operation result, and taking the operation result after the low-pass filtering processing as the energy value of the time domain signal in each frequency band; and converting the energy value into a digital signal, performing data processing on the converted energy value in the digital signal domain, and taking the result of the data processing as the extracted Mel frequency cepstrum coefficient MFCC characteristic.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring a preprocessed voice signal in the analog signal domain; performing Mel frequency analysis on the voice signal to extract time domain signals of the voice signal in different frequency bands; calculating the time domain signals in each frequency band according to a preset operation rule; carrying out low-pass filtering processing on the operation result, and taking the operation result after the low-pass filtering processing as the energy value of the time domain signal in each frequency band; and converting the energy value into a digital signal, performing data processing on the converted energy value in the digital signal domain, and taking the result of the data processing as the extracted Mel frequency cepstrum coefficient MFCC characteristic.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for MFCC feature extraction based on a mixed signal domain, wherein the mixed signal domain comprises an analog signal domain and a digital signal domain, the method comprising:

2. The method of claim 1, wherein the operating the time domain signals in each frequency band according to a preset operation rule comprises:

3. The method according to claim 1, wherein the low-pass filtering the operation result comprises:

4. The method of claim 1, wherein said data processing of the converted energy values in the digital domain comprises:

5. The method according to any one of claims 1 to 4, wherein before the step of operating the time domain signals in each frequency band according to the preset operation rule, the method further comprises:

acquiring the frequency characteristics of the voice signal;

6. The method of claim 5, wherein turning off frequency bands not within the frequency distribution comprises:

7. A MFCC feature extraction apparatus based on a mixed signal domain, wherein the mixed signal domain comprises an analog signal domain and a digital signal domain, the method comprising:

8. An electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 6.

9. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 6.