CN113823277A - Keyword recognition method, system, medium, and apparatus based on deep learning - Google Patents

Keyword recognition method, system, medium, and apparatus based on deep learning Download PDF

Info

Publication number
CN113823277A
CN113823277A CN202111389758.5A CN202111389758A CN113823277A CN 113823277 A CN113823277 A CN 113823277A CN 202111389758 A CN202111389758 A CN 202111389758A CN 113823277 A CN113823277 A CN 113823277A
Authority
CN
China
Prior art keywords
audio
emphasis
decoding
discrete cosine
code stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111389758.5A
Other languages
Chinese (zh)
Inventor
李强
朱勇
王尧
叶东翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Barrot Wireless Co Ltd
Original Assignee
Barrot Wireless Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Barrot Wireless Co Ltd filed Critical Barrot Wireless Co Ltd
Priority to CN202111389758.5A priority Critical patent/CN113823277A/en
Publication of CN113823277A publication Critical patent/CN113823277A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The application discloses a keyword identification method, a system, a medium and equipment based on deep learning, which belong to the technical field of audio decoding, and the method comprises the following steps: when an audio receiving end decodes an audio code stream, only performing a transform domain noise shaping decoding step in a standard decoding process to obtain a discrete cosine transform spectrum coefficient corresponding to the audio code stream; performing feature extraction on the discrete cosine transform spectrum coefficient to obtain a mel frequency cepstrum coefficient; and processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream. The method comprises the steps of only partially decoding an audio code stream needing to be decoded to obtain intermediate parameters; the intermediate parameters are processed through the pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream, so that the complicated audio decoding step with large computation amount and the audio time-frequency conversion step are omitted, the power consumption is saved, and the recognition speed of the keywords is improved.

Description

Keyword recognition method, system, medium, and apparatus based on deep learning
Technical Field
The present application relates to the field of audio decoding technologies, and in particular, to a method, a system, a medium, and an apparatus for keyword recognition based on deep learning.
Background
In the prior art, wireless audio has many typical application scenarios, such as a bluetooth-based remote controller, which is widely used in smart home products, and the general flow is as follows: a user sends a voice control command, such as 'opening an air conditioner', generates an audio compression packet through microphone acquisition, analog-to-digital conversion, audio preprocessing and an audio encoder, and finally sends the audio compression packet out through a wireless communication module; the receiving end wireless communication module receives the audio compression packet, calls an audio decoder to generate an audio PCM, and identifies keywords through the keyword identification module, if the air conditioner is turned on, the keywords are converted into corresponding control signals to control the household appliances. In the process of recognizing the keywords in the user voice command at the audio decoding end, the conversion from the frequency domain to the time domain is involved in the decoding process of the audio decoder, and the conversion from the time domain to the frequency domain is involved in the keyword recognition module.
Disclosure of Invention
The application provides a keyword identification method, system, medium and device based on deep learning, aiming at the problems that in the prior art, when an audio receiving end identifies keywords in a voice signal, repeated operation is carried out on a processing process with a large part of calculation amount, so that the identification speed of the keywords is low, and the power consumption is increased.
In one embodiment of the present application, a method for recognizing a keyword based on deep learning is provided, including: when the audio receiving end decodes the audio code stream, only performing the transform domain noise shaping decoding step in the standard decoding process to obtain the discrete cosine transform spectrum coefficient corresponding to the audio code stream; performing feature extraction on the discrete cosine transform spectrum coefficient to obtain a mel frequency cepstrum coefficient; and processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.
Optionally, the step of transform domain noise shaping decoding in the standard decoding process is only performed, and a discrete cosine transform spectral coefficient corresponding to the audio code stream is obtained, including: decoding an audio code stream according to a standard decoding process, and obtaining a discrete cosine transform spectrum coefficient after code stream analysis, arithmetic and residual decoding, noise filling and noise gain, time domain noise decoding and transform domain noise shaping decoding are sequentially carried out, wherein the actual decoding process does not comprise a frequency domain and time domain conversion process and a long-term post-filter processing process.
Optionally, the performing feature extraction on the discrete cosine transform spectral coefficient in the frequency domain to obtain a mel-frequency cepstrum coefficient includes: the discrete cosine transform spectral coefficients are subjected to pre-emphasis processing, and after the pre-emphasis processing, energy spectrum operation processing is directly performed, so that the time domain-frequency domain conversion process between the pre-emphasis processing and the energy spectrum operation processing is omitted.
Optionally, the pre-emphasis process includes: extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table; and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one.
Optionally, before the audio receiving end decodes the audio code stream, the method further includes: acquiring Mel frequency cepstrum coefficients corresponding to a plurality of audio files respectively; and training the deep neural network model according to the Mel frequency cepstrum coefficient and the keywords corresponding to the audio file to obtain parameters of the deep neural network model, so that after the Mel frequency cepstrum coefficient is input into the deep neural network model, the accuracy rate of the keywords corresponding to the deep neural network model is greater than or equal to a preset threshold value through setting the parameters of the deep neural network model.
In one aspect of the present application, a keyword recognition system based on deep learning is provided, including: the audio decoding module is used for performing transform domain noise shaping decoding in a standard decoding process when decoding the audio code stream to obtain a discrete cosine transform spectrum coefficient corresponding to the audio code stream; the characteristic extraction module is used for extracting the characteristics of the discrete cosine transform spectrum coefficient to obtain a Mel frequency cepstrum coefficient; and the neural network model processing module is used for processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.
In one aspect of the present application, a computer-readable storage medium is provided, where the storage medium stores computer instructions, and the computer instructions are operated to execute the keyword recognition method based on deep learning in the first aspect.
In one aspect of the present application, a computer device is provided, which includes a processor and a memory, where the memory stores computer instructions, wherein: the processor operates the computer instructions to perform the deep learning based keyword recognition method of scenario one.
The beneficial effect of this application is: the method comprises the steps of only partially decoding an audio code stream needing to be decoded to obtain intermediate parameters; the intermediate parameters are processed through the pre-trained deep neural network model to obtain the keywords corresponding to the audio code stream, so that the complicated decoding step with large computation amount is omitted, the power consumption is saved, and the recognition speed of the keywords is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 shows a flow chart of audio code stream processing at a Bluetooth receiving end;
FIG. 2 is a flow chart illustrating an embodiment of the deep learning based keyword recognition method according to the present application;
FIG. 3 shows a standard decoding flow for an LC3 audio decoder;
FIG. 4 illustrates a standard recognition flow of the keyword recognition module;
fig. 5 shows a diagram of a pre-emphasis frequency response curve of the pre-emphasis process of the present application;
FIG. 6 is a flow chart illustrating an example of the deep learning based keyword recognition method of the present application;
FIG. 7 is a schematic diagram illustrating an embodiment of the deep learning based keyword recognition system of the present application.
With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of steps or elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
Currently, the mainstream bluetooth audio codec is as follows: SBC: the A2DP protocol is mandatory and is most widely used, and all Bluetooth audio equipment must support the protocol, but the tone quality is general; AAC-LC, wherein the sound quality is good and the application is wide, and a plurality of mainstream mobile phones support the AAC-LC, but compared with SBC, the memory occupation is large, the operation complexity is high, a plurality of Bluetooth devices are based on an embedded platform, the battery capacity is limited, the operation capability of a processor is poor, the memory is limited, and the patent fee is high; aptX series: the sound quality is good, but the code rate is high, the code rate of aptX needs 384kbps, and the code rate of aptX-HD is 576kbps, is a unique technology of high pass, and is relatively closed; LDAC, the tone quality is better, but the code rate is also very high, it is 330kbps, 660kbps and 990kbps respectively, because the wireless environment that the bluetooth apparatus locates is especially complicated, there is certain difficulty in supporting such high code rate steadily, and it is the unique technology of Sony, it is very closed too; LHDC: the sound quality is good, but the code rate is also high, typically including 400kbps, 600 kbps and 900kbps, and such high code rate puts high requirements on the baseband/radio frequency design of bluetooth. For the above reasons, the Bluetooth international association Bluetooth Sig combines with numerous manufacturers to provide LC3, mainly for Bluetooth low energy, and can also be used for classic Bluetooth, which has the advantages of low delay, high sound quality and coding gain, and no special fee in the Bluetooth field, and is paid attention by the manufacturers. Among them, the LC3 audio codec mainly faces bluetooth low energy, and has a high requirement for power consumption. Therefore, in the application of the LC3 audio codec, reducing power consumption becomes a key.
At present, the application of wireless audio is wide, and the wireless audio is particularly applied to voice control of smart homes. In the prior art, for example, the flow of the remote controller based on the bluetooth technology performing the voice control on the air conditioner is roughly as follows:
firstly, a user sends a voice control instruction, such as 'turn on the air conditioner', the voice command is acquired by a microphone, subjected to analog-to-digital conversion, audio preprocessing and audio encoder coding to generate an audio compression packet, and finally the audio compression packet is sent out through a wireless communication module. Then, at a receiving end, after the wireless communication module receives the audio compression packet, an audio decoder is called to decode the audio compression packet to obtain an audio PCM, keyword recognition is carried out through a keyword recognition module, and finally an instruction of opening an air conditioner is obtained, so that the air conditioner is controlled to be opened.
Fig. 1 shows a flow chart of audio code stream processing at a bluetooth receiving end. The audio decoder and the keyword recognition module at the bluetooth receiving end are the most critical two modules, wherein the processing process at the audio decoder comprises: analyzing the code stream; arithmetic and residual decoding, noise filling and global gain; decoding time domain noise; transform domain noise decoding; and (3) frequency domain to time domain conversion, namely a low-delay modified inverse discrete cosine transform and a filtering process of a long-term post filter. The processing flow of the keyword identification module comprises the following steps: feature extraction: deep neural network processing and corresponding post-processing. Wherein, the keyword feature extraction part comprises: pre-emphasis processing; windowing; time-to-frequency domain conversion, typically a discrete fourier transform process; an energy spectrum; a Mel filter bank; and (4) carrying out logarithmic transformation and discrete cosine transformation to finally generate Mel frequency cepstrum coefficients (MFCC for short). And obtaining keywords corresponding to the audio code stream according to the Mel frequency cepstrum coefficient through deep neural network processing and corresponding post-processing.
As can be seen from the above description, in the decoding process of the audio decoder and the keyword feature extraction process of the keyword recognition module, there are inverse operations of frequency domain to time domain conversion and time domain to frequency domain conversion, and a large amount of operations is required to be consumed for performing the frequency domain to time domain conversion or the time domain to frequency domain conversion, which results in large power consumption.
In order to solve the problems, according to the method and the device, only a part of standard decoding processes are carried out on the audio code stream in the decoding process of the audio decoder at the audio receiving end, after the discrete cosine transform spectrum coefficient corresponding to the audio code stream is obtained, the subsequent conversion process from the frequency domain to the time domain is not carried out any more, and the conversion process from the time domain to the frequency domain is also not required to be carried out in the keyword identification module, so that the two steps with higher computation amount are omitted, the computation amount is reduced, and the identification process of the keywords is accelerated.
In order to solve the above problems, the present application provides a keyword recognition method based on deep learning, including: when the audio receiving end decodes the audio code stream, only performing the transform domain noise shaping decoding step in the standard decoding process to obtain the discrete cosine transform spectrum coefficient corresponding to the audio code stream; performing feature extraction on the discrete cosine transform spectrum coefficient to obtain a mel frequency cepstrum coefficient; and processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.
According to the keyword identification method, the step of converting the frequency domain into the time domain in the audio decoder is deleted, so that the conversion process from the time domain to the frequency domain in the keyword identification process is omitted, the operation amount is reduced, and the identification speed of the keywords is accelerated. The discrete cosine transform spectrum coefficient obtained by the audio decoder is processed by a deep neural network model, and the corresponding key words are directly obtained. The deep network model needs to be pre-trained in advance according to the discrete cosine transform spectral coefficients and the corresponding keywords, so that the accuracy of the processing result of the deep network model on the discrete cosine transform spectral coefficients is improved, and accurate keywords are obtained.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 2 is a flowchart illustrating an embodiment of the deep learning-based keyword recognition method according to the present application.
In the embodiment shown in fig. 2, the method for recognizing keywords based on deep learning of the present application includes: the process S201 is to perform transform domain noise shaping decoding only in the standard decoding process when the audio receiving end decodes the audio code stream, and obtain a discrete cosine transform spectral coefficient corresponding to the audio code stream.
In this embodiment, when decoding an audio code stream, decoding is performed according to a standard decoding flow, but only a part of the standard decoding flow is performed, so long as a discrete cosine transform spectral coefficient corresponding to the audio code stream is obtained. Therefore, partial decoding process is omitted, and the calculation power and the power consumption of the decoder are saved.
Optionally, the step of transform domain noise shaping decoding in the standard decoding process is only performed, and a discrete cosine transform spectral coefficient corresponding to the audio code stream is obtained, including: decoding an audio code stream according to a standard decoding process, and obtaining a discrete cosine transform spectrum coefficient after code stream analysis, arithmetic and residual decoding, noise filling and noise gain, time domain noise decoding and transform domain noise shaping decoding are sequentially carried out, wherein the actual decoding process does not comprise a frequency domain and time domain conversion process and a long-term post-filter processing process.
In this alternative embodiment, fig. 3 shows the standard decoding flow of the LC3 audio decoder. As shown in fig. 3, the decoding flow of a standard audio decoder includes: analyzing the code stream; arithmetic and residual decoding, noise filling and global gain; time domain noise decoding and transform domain noise shaping decoding; the conversion from frequency domain to time domain and the filtering process of the long-term post filter. The conversion process from the frequency domain to the time domain is low-delay improved inverse discrete cosine transform, and the conversion from the time domain to the frequency domain in the keyword identification process is inverse operation. Therefore, in the method of the present application, after the decoding process of the audio code stream is subjected to transform domain noise shaping decoding, the decoding process is ended before the operation of frequency domain to time domain conversion is performed, and the discrete cosine transform spectrum coefficient corresponding to the audio code stream is obtained.
According to the keyword identification method based on deep learning, in the process of audio decoding, the frequency domain-time domain conversion with large operation amount and the filtering process of a long-term post filter are omitted, the intermediate result of the discrete cosine transform spectral coefficient of the audio decoding is directly obtained, and the subsequent keyword identification process is performed on the discrete cosine transform spectral coefficient, so that the power consumption is reduced, and the calculation force is saved.
In the embodiment shown in fig. 2, the method for recognizing keywords based on deep learning of the present application includes: the process S202 is to perform feature extraction on the discrete cosine transform spectral coefficient to obtain a mel-frequency cepstrum coefficient.
In this embodiment, a mel-frequency cepstrum coefficient is obtained by performing a feature extraction process on a discrete cosine transform spectral coefficient obtained by decoding.
In this embodiment, since the discrete cosine transform spectral coefficients are directly obtained without performing the frequency domain to time domain conversion and the filtering process of the long-term post-filter at the audio decoder side, the original standard identification process also needs to be adjusted in the keyword identification module.
Optionally, the performing feature extraction on the discrete cosine transform spectral coefficient in the frequency domain to obtain a mel-frequency cepstrum coefficient includes: the discrete cosine transform spectral coefficients are subjected to pre-emphasis processing, and after the pre-emphasis processing, energy spectrum operation processing is directly performed, so that the time domain-frequency domain conversion process between the pre-emphasis processing and the energy spectrum operation processing is omitted.
Fig. 4 shows a standard recognition flow of the keyword recognition module. As shown in fig. 4, the feature extraction section in the keyword recognition includes: pre-emphasis; windowing; a time-to-frequency domain transform, typically a discrete fourier transform; an energy spectrum; a Mel filter bank; carrying out logarithmic transformation; discrete cosine transform to generate Mel Frequency Cepstrum Coefficient (MFCC). And correspondingly adjusting the identification process according to the updating of the audio code stream decoding process of the audio decoding end. In the decoding process, the conversion process from the frequency domain to the time domain is not performed, and correspondingly, the conversion process from the time domain to the frequency domain is not performed in the identification process of the keyword, so that the calculation power is saved, and the power consumption is reduced. After adjustment, in the keyword recognition module, the flow of the feature extraction part is as follows: pre-emphasis; an energy spectrum; a Mel filter bank; carrying out logarithmic transformation; discrete cosine transform to generate Mel Frequency Cepstrum Coefficient (MFCC).
Optionally, the pre-emphasis process includes: extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table; and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one. Fig. 5 shows a diagram of a pre-emphasis frequency response curve of the pre-emphasis process of the present application. As shown in fig. 5, the horizontal axis represents frequency values, and the vertical axis represents gain values. In a section of audio, the main energy is concentrated in low frequency, and the attenuation of the high frequency part is fast, so that the low frequency part and the high frequency part are flat in the process of keyword recognition, pre-emphasis processing is performed, the low frequency energy is attenuated, and the high frequency energy is emphasized.
In this optional embodiment, the discrete cosine transform spectrum coefficient obtained by decoding at the receiving end is not subjected to frequency domain to time domain conversion, so that the pre-emphasis processing of the discrete cosine transform spectrum coefficient also needs to be correspondingly adjusted in the identification process of the keyword identification module. Namely, the discrete cosine spectral coefficients are pre-emphasized in the frequency domain. Firstly, extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table, and pre-emphasizing discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients according to the one-to-one corresponding relation.
Specifically, taking a 16KHz sampling rate as an example, the pre-emphasis frequency response is stored as a pre-emphasis coefficient table, p (0), p (1), p (2), …, p (159), at 50Hz intervals. The discrete cosine transform spectral coefficients MDCT have 160 spectral coefficients,
Figure 736492DEST_PATH_IMAGE001
is 160.
Figure 115258DEST_PATH_IMAGE002
Performing pre-emphasis processing according to a pre-emphasis formula, specifically as follows:
Figure 553193DEST_PATH_IMAGE003
in the method, in the characteristic extraction process of the keyword recognition module, because the windowing step is already carried out when the sending end carries out audio coding, and meanwhile, because the conversion processes from the time domain to the frequency domain and from the frequency domain to the time domain are operated in an inverse mode, the windowing and the conversion process from the time domain to the frequency domain in the standard process are omitted, and therefore power consumption is reduced.
In one example of the present application, the following briefly introduces another flow of the feature extraction process in the keyword recognition module, as follows:
in the energy spectrum process, pseudo-spectrum coefficients are firstly generated:
Figure 911493DEST_PATH_IMAGE004
wherein, when
Figure 321746DEST_PATH_IMAGE005
Or
Figure 372878DEST_PATH_IMAGE001
When the temperature of the water is higher than the set temperature,
Figure 298109DEST_PATH_IMAGE006
next, an energy spectrum is then generated: this step and the previous step may be combined in embodiments to further save computation, and are separated for ease of description.
Figure 958635DEST_PATH_IMAGE007
It should be noted that, the invention does not limit whether the pseudo-spectrum coefficient is used, and the energy spectrum can also be generated for key word recognition by directly using the MDCT discrete cosine transform spectrum coefficient, but because the energy distribution of the MDCT pseudo-spectrum coefficient and the energy distribution of the fourier transform spectrum coefficient have better corresponding relation, the performance of training and recognition can be improved by using the pseudo-spectrum to calculate the energy spectrum.
The process at the Mel filter bank is as follows:
the energy of each channel is obtained by calculating the frequency spectrum energy through a Mel filter bank
Figure 20132DEST_PATH_IMAGE008
The mel filter bank is formed by connecting a series of triangular filters,
Figure 242166DEST_PATH_IMAGE009
is the mth mel filter, which belongs to the mature technology and is not described in detail here.
The log transformation process is as follows:
Figure 592376DEST_PATH_IMAGE010
discrete cosine transform process: generating a Mel frequency cepstrum coefficient (MFCC for short), wherein the calculation formula is as follows:
Figure 558058DEST_PATH_IMAGE011
and D is the dimension of the MFCC feature.
In the embodiment shown in fig. 2, the method for recognizing keywords based on deep learning of the present application includes: in the process S203, the Mel frequency cepstrum coefficient is identified according to the pre-trained deep neural network model, and the keyword probability corresponding to the audio code stream is obtained.
In the embodiment, the mel-frequency cepstrum coefficient is processed through a pre-trained deep neural network model, and the keyword probability corresponding to the audio code stream is obtained according to the corresponding relation between the mel-frequency cepstrum coefficient and the keywords. In the subsequent keyword processing process, a keyword processing module at the audio receiving end determines a final keyword according to the keyword probability, and then controls corresponding equipment to act. For example, after a section of audio code stream is processed by the keyword recognition method, the probability of the keyword for obtaining the "opening" of the air conditioner is 90%, and the probability of the keyword for obtaining the "temperature rise" is 20%, so that the subsequent keyword processing module determines that the keyword is opened according to the keyword probability, and further controls the air conditioner to perform the operation of opening the air conditioner. The subsequent processing performed by the subsequent keyword processing module according to the keyword probability is not specifically limited in the present application.
Optionally, before the audio receiving end decodes the audio code stream, the method further includes: acquiring Mel frequency cepstrum coefficients corresponding to a plurality of audio files respectively; and training the deep neural network model according to the Mel frequency cepstrum coefficient and the keywords corresponding to the audio file to obtain parameters of the deep neural network model, so that after the Mel frequency cepstrum coefficient is input into the deep neural network model, the accuracy rate of the keywords corresponding to the deep neural network model is greater than or equal to a preset threshold value through setting the parameters of the deep neural network model.
In the optional embodiment, the deep neural network model is trained in advance, wherein in the training process of the model, a large number of audio files of the voice material are processed on an offline PC or server to obtain mel-frequency cepstrum coefficients corresponding to the audio files, and the mel-frequency cepstrum coefficients and audio file keywords corresponding to the mel-frequency cepstrum coefficients are used as training samples to establish the corresponding relation between the mel-frequency cepstrum coefficients and the keywords. The principle of extracting the mel-frequency cepstrum coefficient of the audio file in the PC or the server can be referred to the above-described audio decoding process, and the specific situation can be adaptively adjusted. It should be noted that, a processing device for the audio file, such as a PC or a related server, may be directly set in the deep neural network model, and then, when performing processing, the audio file is directly input into the deep neural network model for training, and a corresponding relationship between mel-frequency cepstrum coefficients and keywords is established, so as to obtain parameters of the deep neural network model for a specific inference process of the keywords to use later.
When the model training result is tested, after the model is set according to the deep neural network model parameters, a certain number of audio files are input into the model, and the corresponding relation between the audio files and the corresponding keywords is counted. Specifically, the preset threshold may be selected to be 95%. At this time, after model setting is performed according to the deep neural network model parameters, in a specific keyword identification process, the mel-frequency cepstrum coefficient is processed through the deep neural network model, and the accuracy rate of obtaining corresponding keywords is greater than or equal to 95% so as to ensure the accuracy of obtaining corresponding keywords according to the deep neural network model. Wherein, the deep neural network model can select a convolutional neural network, CNN for short; a deep neural network, DNN for short; a recurrent neural network, abbreviated RNN; the long and short term memory network is abbreviated as LSTM, where the above is only an example of a partial deep neural network, and the application is not particularly limited with respect to the selection of a specific deep neural network.
Specifically, after the deep neural network model is trained to obtain parameters of the deep neural network model, the deep neural network model in the specific recognition process is set according to the parameters of the deep neural network model in the subsequent specific keyword reasoning process, and then the keywords corresponding to the audio code stream are reasoned, so that the processing speed of the keyword reasoning process and the accuracy of keyword recognition are improved.
The application discloses a deep neural network model training method based on discrete cosine transform of an audio file and a Mel frequency cepstrum coefficient is obtained on the basis. The neural network model mentioned in the prior art is based on fast fourier transform, and the mel-frequency cepstrum coefficients are obtained on the basis of the fast fourier transform. In addition, on the basis of the training method, the keyword identification method can greatly save the operation amount of the audio decoding and feature extraction process when the actual keyword is identified, and avoids the conversion process from the frequency domain to the time domain and from the time domain to the frequency domain in the prior art. The power consumption is reduced, and the calculation power is saved.
Fig. 6 is a flowchart illustrating an example of the deep learning-based keyword recognition method according to the present application. As shown in fig. 6, in the method of the present application, compared with the prior art, the conversion process from the frequency domain to the time domain is omitted in the decoding process, and the conversion process from the time domain to the frequency domain is correspondingly omitted in the feature identification process, so that the two steps requiring large computation power and power consumption are omitted, thereby saving computation power and speeding up the identification process of the keyword.
The method and the device can be used for low-power-consumption Bluetooth audio and can also be used for classic Bluetooth. The method can be used in the field of Bluetooth and other wireless communication fields, particularly in keyword identification; the existing information and the existing algorithm module of the audio decoder are fully utilized, the operation complexity of the decoder is reduced by omitting the processing processes of low-delay improved inverse discrete cosine transform and long-term post-filter in the decoding process, and the operation complexity of the keyword identification feature extraction module is reduced by omitting the processing processes of windowing and time-frequency conversion; based on the above, the power consumption is greatly saved, and the service life of the equipment is prolonged. Saving program space and code space required by related modules and reducing the cost of equipment. The method can be applied to the combination of the Bluetooth remote controller and the intelligent home, and the audio code stream controlled and coded by the Bluetooth remote controller is used for carrying out keyword recognition to realize the control of the intelligent home. The above and the application thereof are all within the scope of protection of the present application.
According to the keyword identification method based on deep learning, the conversion process from the frequency domain with large operation amount to the time domain in the standard decoding process of the audio code stream is omitted, the corresponding time domain in the keyword identification process is omitted, the recognition is directly carried out according to the discrete cosine transform spectrum coefficient to obtain the Mel frequency cepstrum coefficient, and the Mel frequency cepstrum coefficient is processed through a pre-trained deep neural network model to obtain the keyword corresponding to the audio code stream. The method omits the process with larger calculation amount in the process of identifying the keywords, saves calculation power, reduces power consumption and improves the speed of identifying the keywords. Especially, the Bluetooth low energy consumption with strict requirements on the power consumption has greater significance. In addition, the method and the device identify the keywords through the deep neural network model, and accuracy of keyword identification is guaranteed.
FIG. 7 is a schematic diagram illustrating an embodiment of the deep learning based keyword recognition system of the present application.
In the embodiment shown in fig. 7, the deep learning-based keyword recognition system of the present application includes: an audio decoding module 701, which only performs a transform domain noise shaping decoding step in a standard decoding process when an audio receiving end decodes an audio code stream, to obtain a discrete cosine transform spectral coefficient corresponding to the audio code stream; a feature extraction module 702, which performs feature extraction on the discrete cosine transform spectral coefficient to obtain a mel-frequency cepstrum coefficient; and a neural network model processing module 703, which processes the mel-frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.
Optionally, in the feature extraction module, the pre-emphasis processing is performed on the discrete cosine transform spectrum coefficient in the frequency domain, and after the pre-emphasis processing, the energy spectrum operation processing is directly performed, so that a time domain to frequency domain conversion process between the pre-emphasis processing and the energy spectrum operation processing is omitted.
Optionally, the pre-emphasis process in the feature extraction module includes: extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table; and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one. The specific execution process is consistent with the description in the keyword recognition method, and is not repeated herein. In the embodiment, the step of converting the frequency domain into the time domain in the audio decoder is eliminated, so that the time domain-frequency domain conversion process in the keyword identification process is omitted, the operation amount is reduced, and the identification speed of the keywords is accelerated. The discrete cosine transform spectral coefficients obtained by an audio decoder are used for feature extraction, and a deep neural network model is used for processing based on MFCC features, so that the corresponding keyword probability is directly obtained. When the deep neural network model conducts keyword reasoning on the Mel frequency cepstrum coefficient MFCC characteristics, model setting can be conducted according to predetermined parameters of the deep neural network model, and therefore processing speed and accuracy of a keyword reasoning process are improved. The operation principle of the deep learning-based keyword recognition system is similar to that of the deep learning-based keyword recognition method, and is not repeated.
In a particular embodiment of the present application, a computer-readable storage medium stores computer instructions, wherein the computer instructions are operable to perform the method for deep learning based keyword recognition described in any of the embodiments. Wherein the storage medium may be directly in hardware, in a software module executed by a processor, or in a combination of the two.
A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
The Processor may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one embodiment of the present application, a computer device includes a processor and a memory, the memory storing computer instructions, wherein: the processor operates the computer instructions to perform the deep learning based keyword recognition method described in any of the embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above embodiments are merely examples, which are not intended to limit the scope of the present disclosure, and all equivalent structural changes made by using the contents of the specification and the drawings, or any other related technical fields, are also included in the scope of the present disclosure.

Claims (10)

1. A keyword recognition method based on deep learning is characterized by comprising the following steps:
when an audio receiving end decodes an audio code stream, only performing a transform domain noise shaping decoding step in a standard decoding process to obtain a discrete cosine transform spectrum coefficient corresponding to the audio code stream;
performing feature extraction on the discrete cosine transform spectrum coefficient to obtain a Mel frequency cepstrum coefficient;
and processing the mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.
2. The method for recognizing keywords based on deep learning of claim 1, wherein the step of performing transform domain noise shaping decoding only in a standard decoding process to obtain discrete cosine transform spectral coefficients corresponding to the audio code stream comprises:
and decoding the audio code stream according to the standard decoding process, and obtaining the discrete cosine transform spectrum coefficient after sequentially performing code stream analysis, arithmetic and residual decoding, noise filling and noise gain, time domain noise decoding and transform domain noise shaping decoding, wherein the actual decoding process does not comprise a frequency domain and time domain conversion process and a long-term post-filter processing process.
3. The method for recognizing keywords based on deep learning of claim 1, wherein the performing feature extraction on the discrete cosine transform spectral coefficients to obtain mel-frequency cepstrum coefficients comprises:
and pre-emphasis processing is carried out on the discrete cosine transform spectrum coefficient in a frequency domain, and energy spectrum operation processing is directly carried out after the pre-emphasis processing, so that the conversion process from a time domain to a frequency domain between the pre-emphasis processing and the energy spectrum operation processing is omitted.
4. The method of claim 3, wherein the pre-emphasis process comprises:
extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table;
and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one.
5. The method for recognizing keywords based on deep learning of claim 1, wherein before the audio receiving end decodes the audio code stream, the method further comprises:
acquiring the Mel frequency cepstrum coefficients corresponding to a plurality of audio files respectively;
and training a deep neural network model according to the Mel frequency cepstrum coefficient and the keywords corresponding to the audio file to obtain deep neural network model parameters, so that after the Mel frequency cepstrum coefficient is input into the deep neural network model, the accuracy rate of the keywords corresponding to the Mel frequency cepstrum coefficient is greater than or equal to a preset threshold value through setting the deep neural network model parameters.
6. A keyword recognition system based on deep learning, comprising:
the audio decoding module is used for only performing a transform domain noise shaping decoding step in a standard decoding process when decoding an audio code stream to obtain a discrete cosine transform spectrum coefficient corresponding to the audio code stream;
the characteristic extraction module is used for extracting the characteristics of the discrete cosine transform spectrum coefficient to obtain a Mel frequency cepstrum coefficient;
and the neural network model processing module is used for processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.
7. The system of claim 6, wherein in the feature extraction module, the system performs pre-emphasis processing on the discrete cosine transform spectral coefficients in the frequency domain, and directly performs energy spectrum operation processing after the pre-emphasis processing, thereby omitting the time-domain to frequency-domain conversion process between the pre-emphasis processing and the energy spectrum operation processing.
8. The deep learning based keyword recognition system of claim 6, wherein the pre-emphasis process in the feature extraction module comprises:
extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table;
and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one.
9. A computer readable storage medium storing computer instructions, wherein the computer instructions are operative to perform the deep learning based keyword recognition method of any one of claims 1 to 5.
10. A computer device comprising a processor and a memory, the memory storing computer instructions, wherein: the processor operates the computer instructions to perform the deep learning based keyword recognition method of any one of claims 1-5.
CN202111389758.5A 2021-11-23 2021-11-23 Keyword recognition method, system, medium, and apparatus based on deep learning Pending CN113823277A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111389758.5A CN113823277A (en) 2021-11-23 2021-11-23 Keyword recognition method, system, medium, and apparatus based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111389758.5A CN113823277A (en) 2021-11-23 2021-11-23 Keyword recognition method, system, medium, and apparatus based on deep learning

Publications (1)

Publication Number Publication Date
CN113823277A true CN113823277A (en) 2021-12-21

Family

ID=78919685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111389758.5A Pending CN113823277A (en) 2021-11-23 2021-11-23 Keyword recognition method, system, medium, and apparatus based on deep learning

Country Status (1)

Country Link
CN (1) CN113823277A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114286451A (en) * 2021-12-23 2022-04-05 咻享智能(深圳)有限公司 Wireless communication instruction compression method and related device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090254352A1 (en) * 2005-12-14 2009-10-08 Matsushita Electric Industrial Co., Ltd. Method and system for extracting audio features from an encoded bitstream for audio classification
CN102982804A (en) * 2011-09-02 2013-03-20 杜比实验室特许公司 Method and system of voice frequency classification
CN104251934A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Harmonic analysis method and apparatus, and method and apparatus for determining clutter in harmonic wave
CN108847217A (en) * 2018-05-31 2018-11-20 平安科技(深圳)有限公司 A kind of phonetic segmentation method, apparatus, computer equipment and storage medium
CN109599093A (en) * 2018-10-26 2019-04-09 北京中关村科金技术有限公司 Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection
CN111462757A (en) * 2020-01-15 2020-07-28 北京远鉴信息技术有限公司 Data processing method and device based on voice signal, terminal and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090254352A1 (en) * 2005-12-14 2009-10-08 Matsushita Electric Industrial Co., Ltd. Method and system for extracting audio features from an encoded bitstream for audio classification
CN102982804A (en) * 2011-09-02 2013-03-20 杜比实验室特许公司 Method and system of voice frequency classification
CN104251934A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Harmonic analysis method and apparatus, and method and apparatus for determining clutter in harmonic wave
CN108847217A (en) * 2018-05-31 2018-11-20 平安科技(深圳)有限公司 A kind of phonetic segmentation method, apparatus, computer equipment and storage medium
CN109599093A (en) * 2018-10-26 2019-04-09 北京中关村科金技术有限公司 Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection
CN111462757A (en) * 2020-01-15 2020-07-28 北京远鉴信息技术有限公司 Data processing method and device based on voice signal, terminal and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114286451A (en) * 2021-12-23 2022-04-05 咻享智能(深圳)有限公司 Wireless communication instruction compression method and related device
CN114286451B (en) * 2021-12-23 2023-09-22 咻享智能(深圳)有限公司 Wireless communication instruction compression method and related device

Similar Documents

Publication Publication Date Title
CN113724725B (en) Bluetooth audio squeal detection suppression method, device, medium and Bluetooth device
CN104575504A (en) Method for personalized television voice wake-up by voiceprint and voice identification
US20180233127A1 (en) Enhanced speech generation
WO2014114048A1 (en) Voice recognition method and apparatus
JP2014142627A (en) Voice identification method and device
CN104766608A (en) Voice control method and voice control device
CN111681663B (en) Method, system, storage medium and device for reducing audio coding computation amount
CN111145763A (en) GRU-based voice recognition method and system in audio
CN112289328A (en) Method and system for determining audio coding rate
CN111540342A (en) Energy threshold adjusting method, device, equipment and medium
CN114420140B (en) Frequency band expansion method, encoding and decoding method and system based on generation countermeasure network
CN207603881U (en) A kind of intelligent sound wireless sound box
CN113823277A (en) Keyword recognition method, system, medium, and apparatus based on deep learning
Beritelli et al. A pattern recognition system for environmental sound classification based on MFCCs and neural networks
CN109427336B (en) Voice object recognition method and device
CN104978960A (en) Photographing method and device based on speech recognition
CN109741761B (en) Sound processing method and device
Li et al. A high-performance auditory feature for robust speech recognition.
CN110797008B (en) Far-field voice recognition method, voice recognition model training method and server
CN114582361B (en) High-resolution audio coding and decoding method and system based on generation countermeasure network
CN114121004B (en) Voice recognition method, system, medium and equipment based on deep learning
CN114121004A (en) Speech recognition method, system, medium, and apparatus based on deep learning
Vicente-Peña et al. Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition
CN112509556B (en) Voice awakening method and device
CN109273003A (en) Sound control method and system for automobile data recorder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211221