CN113823277A

CN113823277A - Keyword recognition method, system, medium, and apparatus based on deep learning

Info

Publication number: CN113823277A
Application number: CN202111389758.5A
Authority: CN
Inventors: 李强; 朱勇; 王尧; 叶东翔
Original assignee: Barrot Wireless Co Ltd
Current assignee: Barrot Wireless Co Ltd
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2021-12-21

Abstract

The application discloses a keyword identification method, a system, a medium and equipment based on deep learning, which belong to the technical field of audio decoding, and the method comprises the following steps: when an audio receiving end decodes an audio code stream, only performing a transform domain noise shaping decoding step in a standard decoding process to obtain a discrete cosine transform spectrum coefficient corresponding to the audio code stream; performing feature extraction on the discrete cosine transform spectrum coefficient to obtain a mel frequency cepstrum coefficient; and processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream. The method comprises the steps of only partially decoding an audio code stream needing to be decoded to obtain intermediate parameters; the intermediate parameters are processed through the pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream, so that the complicated audio decoding step with large computation amount and the audio time-frequency conversion step are omitted, the power consumption is saved, and the recognition speed of the keywords is improved.

Description

Keyword recognition method, system, medium, and apparatus based on deep learning

Technical Field

The present application relates to the field of audio decoding technologies, and in particular, to a method, a system, a medium, and an apparatus for keyword recognition based on deep learning.

Background

In the prior art, wireless audio has many typical application scenarios, such as a bluetooth-based remote controller, which is widely used in smart home products, and the general flow is as follows: a user sends a voice control command, such as 'opening an air conditioner', generates an audio compression packet through microphone acquisition, analog-to-digital conversion, audio preprocessing and an audio encoder, and finally sends the audio compression packet out through a wireless communication module; the receiving end wireless communication module receives the audio compression packet, calls an audio decoder to generate an audio PCM, and identifies keywords through the keyword identification module, if the air conditioner is turned on, the keywords are converted into corresponding control signals to control the household appliances. In the process of recognizing the keywords in the user voice command at the audio decoding end, the conversion from the frequency domain to the time domain is involved in the decoding process of the audio decoder, and the conversion from the time domain to the frequency domain is involved in the keyword recognition module.

Disclosure of Invention

The application provides a keyword identification method, system, medium and device based on deep learning, aiming at the problems that in the prior art, when an audio receiving end identifies keywords in a voice signal, repeated operation is carried out on a processing process with a large part of calculation amount, so that the identification speed of the keywords is low, and the power consumption is increased.

In one embodiment of the present application, a method for recognizing a keyword based on deep learning is provided, including: when the audio receiving end decodes the audio code stream, only performing the transform domain noise shaping decoding step in the standard decoding process to obtain the discrete cosine transform spectrum coefficient corresponding to the audio code stream; performing feature extraction on the discrete cosine transform spectrum coefficient to obtain a mel frequency cepstrum coefficient; and processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.

Optionally, the step of transform domain noise shaping decoding in the standard decoding process is only performed, and a discrete cosine transform spectral coefficient corresponding to the audio code stream is obtained, including: decoding an audio code stream according to a standard decoding process, and obtaining a discrete cosine transform spectrum coefficient after code stream analysis, arithmetic and residual decoding, noise filling and noise gain, time domain noise decoding and transform domain noise shaping decoding are sequentially carried out, wherein the actual decoding process does not comprise a frequency domain and time domain conversion process and a long-term post-filter processing process.

Optionally, the performing feature extraction on the discrete cosine transform spectral coefficient in the frequency domain to obtain a mel-frequency cepstrum coefficient includes: the discrete cosine transform spectral coefficients are subjected to pre-emphasis processing, and after the pre-emphasis processing, energy spectrum operation processing is directly performed, so that the time domain-frequency domain conversion process between the pre-emphasis processing and the energy spectrum operation processing is omitted.

Optionally, the pre-emphasis process includes: extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table; and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one.

Optionally, before the audio receiving end decodes the audio code stream, the method further includes: acquiring Mel frequency cepstrum coefficients corresponding to a plurality of audio files respectively; and training the deep neural network model according to the Mel frequency cepstrum coefficient and the keywords corresponding to the audio file to obtain parameters of the deep neural network model, so that after the Mel frequency cepstrum coefficient is input into the deep neural network model, the accuracy rate of the keywords corresponding to the deep neural network model is greater than or equal to a preset threshold value through setting the parameters of the deep neural network model.

In one aspect of the present application, a keyword recognition system based on deep learning is provided, including: the audio decoding module is used for performing transform domain noise shaping decoding in a standard decoding process when decoding the audio code stream to obtain a discrete cosine transform spectrum coefficient corresponding to the audio code stream; the characteristic extraction module is used for extracting the characteristics of the discrete cosine transform spectrum coefficient to obtain a Mel frequency cepstrum coefficient; and the neural network model processing module is used for processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.

In one aspect of the present application, a computer-readable storage medium is provided, where the storage medium stores computer instructions, and the computer instructions are operated to execute the keyword recognition method based on deep learning in the first aspect.

In one aspect of the present application, a computer device is provided, which includes a processor and a memory, where the memory stores computer instructions, wherein: the processor operates the computer instructions to perform the deep learning based keyword recognition method of scenario one.

The beneficial effect of this application is: the method comprises the steps of only partially decoding an audio code stream needing to be decoded to obtain intermediate parameters; the intermediate parameters are processed through the pre-trained deep neural network model to obtain the keywords corresponding to the audio code stream, so that the complicated decoding step with large computation amount is omitted, the power consumption is saved, and the recognition speed of the keywords is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 shows a flow chart of audio code stream processing at a Bluetooth receiving end;

FIG. 2 is a flow chart illustrating an embodiment of the deep learning based keyword recognition method according to the present application;

FIG. 3 shows a standard decoding flow for an LC3 audio decoder;

FIG. 4 illustrates a standard recognition flow of the keyword recognition module;

fig. 5 shows a diagram of a pre-emphasis frequency response curve of the pre-emphasis process of the present application;

FIG. 6 is a flow chart illustrating an example of the deep learning based keyword recognition method of the present application;

FIG. 7 is a schematic diagram illustrating an embodiment of the deep learning based keyword recognition system of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of steps or elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

Currently, the mainstream bluetooth audio codec is as follows: SBC: the A2DP protocol is mandatory and is most widely used, and all Bluetooth audio equipment must support the protocol, but the tone quality is general; AAC-LC, wherein the sound quality is good and the application is wide, and a plurality of mainstream mobile phones support the AAC-LC, but compared with SBC, the memory occupation is large, the operation complexity is high, a plurality of Bluetooth devices are based on an embedded platform, the battery capacity is limited, the operation capability of a processor is poor, the memory is limited, and the patent fee is high; aptX series: the sound quality is good, but the code rate is high, the code rate of aptX needs 384kbps, and the code rate of aptX-HD is 576kbps, is a unique technology of high pass, and is relatively closed; LDAC, the tone quality is better, but the code rate is also very high, it is 330kbps, 660kbps and 990kbps respectively, because the wireless environment that the bluetooth apparatus locates is especially complicated, there is certain difficulty in supporting such high code rate steadily, and it is the unique technology of Sony, it is very closed too; LHDC: the sound quality is good, but the code rate is also high, typically including 400kbps, 600 kbps and 900kbps, and such high code rate puts high requirements on the baseband/radio frequency design of bluetooth. For the above reasons, the Bluetooth international association Bluetooth Sig combines with numerous manufacturers to provide LC3, mainly for Bluetooth low energy, and can also be used for classic Bluetooth, which has the advantages of low delay, high sound quality and coding gain, and no special fee in the Bluetooth field, and is paid attention by the manufacturers. Among them, the LC3 audio codec mainly faces bluetooth low energy, and has a high requirement for power consumption. Therefore, in the application of the LC3 audio codec, reducing power consumption becomes a key.

At present, the application of wireless audio is wide, and the wireless audio is particularly applied to voice control of smart homes. In the prior art, for example, the flow of the remote controller based on the bluetooth technology performing the voice control on the air conditioner is roughly as follows:

firstly, a user sends a voice control instruction, such as 'turn on the air conditioner', the voice command is acquired by a microphone, subjected to analog-to-digital conversion, audio preprocessing and audio encoder coding to generate an audio compression packet, and finally the audio compression packet is sent out through a wireless communication module. Then, at a receiving end, after the wireless communication module receives the audio compression packet, an audio decoder is called to decode the audio compression packet to obtain an audio PCM, keyword recognition is carried out through a keyword recognition module, and finally an instruction of opening an air conditioner is obtained, so that the air conditioner is controlled to be opened.

Fig. 1 shows a flow chart of audio code stream processing at a bluetooth receiving end. The audio decoder and the keyword recognition module at the bluetooth receiving end are the most critical two modules, wherein the processing process at the audio decoder comprises: analyzing the code stream; arithmetic and residual decoding, noise filling and global gain; decoding time domain noise; transform domain noise decoding; and (3) frequency domain to time domain conversion, namely a low-delay modified inverse discrete cosine transform and a filtering process of a long-term post filter. The processing flow of the keyword identification module comprises the following steps: feature extraction: deep neural network processing and corresponding post-processing. Wherein, the keyword feature extraction part comprises: pre-emphasis processing; windowing; time-to-frequency domain conversion, typically a discrete fourier transform process; an energy spectrum; a Mel filter bank; and (4) carrying out logarithmic transformation and discrete cosine transformation to finally generate Mel frequency cepstrum coefficients (MFCC for short). And obtaining keywords corresponding to the audio code stream according to the Mel frequency cepstrum coefficient through deep neural network processing and corresponding post-processing.

As can be seen from the above description, in the decoding process of the audio decoder and the keyword feature extraction process of the keyword recognition module, there are inverse operations of frequency domain to time domain conversion and time domain to frequency domain conversion, and a large amount of operations is required to be consumed for performing the frequency domain to time domain conversion or the time domain to frequency domain conversion, which results in large power consumption.

In order to solve the problems, according to the method and the device, only a part of standard decoding processes are carried out on the audio code stream in the decoding process of the audio decoder at the audio receiving end, after the discrete cosine transform spectrum coefficient corresponding to the audio code stream is obtained, the subsequent conversion process from the frequency domain to the time domain is not carried out any more, and the conversion process from the time domain to the frequency domain is also not required to be carried out in the keyword identification module, so that the two steps with higher computation amount are omitted, the computation amount is reduced, and the identification process of the keywords is accelerated.

In order to solve the above problems, the present application provides a keyword recognition method based on deep learning, including: when the audio receiving end decodes the audio code stream, only performing the transform domain noise shaping decoding step in the standard decoding process to obtain the discrete cosine transform spectrum coefficient corresponding to the audio code stream; performing feature extraction on the discrete cosine transform spectrum coefficient to obtain a mel frequency cepstrum coefficient; and processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.

According to the keyword identification method, the step of converting the frequency domain into the time domain in the audio decoder is deleted, so that the conversion process from the time domain to the frequency domain in the keyword identification process is omitted, the operation amount is reduced, and the identification speed of the keywords is accelerated. The discrete cosine transform spectrum coefficient obtained by the audio decoder is processed by a deep neural network model, and the corresponding key words are directly obtained. The deep network model needs to be pre-trained in advance according to the discrete cosine transform spectral coefficients and the corresponding keywords, so that the accuracy of the processing result of the deep network model on the discrete cosine transform spectral coefficients is improved, and accurate keywords are obtained.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating an embodiment of the deep learning-based keyword recognition method according to the present application.

In the embodiment shown in fig. 2, the method for recognizing keywords based on deep learning of the present application includes: the process S201 is to perform transform domain noise shaping decoding only in the standard decoding process when the audio receiving end decodes the audio code stream, and obtain a discrete cosine transform spectral coefficient corresponding to the audio code stream.

In this embodiment, when decoding an audio code stream, decoding is performed according to a standard decoding flow, but only a part of the standard decoding flow is performed, so long as a discrete cosine transform spectral coefficient corresponding to the audio code stream is obtained. Therefore, partial decoding process is omitted, and the calculation power and the power consumption of the decoder are saved.

In this alternative embodiment, fig. 3 shows the standard decoding flow of the LC3 audio decoder. As shown in fig. 3, the decoding flow of a standard audio decoder includes: analyzing the code stream; arithmetic and residual decoding, noise filling and global gain; time domain noise decoding and transform domain noise shaping decoding; the conversion from frequency domain to time domain and the filtering process of the long-term post filter. The conversion process from the frequency domain to the time domain is low-delay improved inverse discrete cosine transform, and the conversion from the time domain to the frequency domain in the keyword identification process is inverse operation. Therefore, in the method of the present application, after the decoding process of the audio code stream is subjected to transform domain noise shaping decoding, the decoding process is ended before the operation of frequency domain to time domain conversion is performed, and the discrete cosine transform spectrum coefficient corresponding to the audio code stream is obtained.

According to the keyword identification method based on deep learning, in the process of audio decoding, the frequency domain-time domain conversion with large operation amount and the filtering process of a long-term post filter are omitted, the intermediate result of the discrete cosine transform spectral coefficient of the audio decoding is directly obtained, and the subsequent keyword identification process is performed on the discrete cosine transform spectral coefficient, so that the power consumption is reduced, and the calculation force is saved.

In the embodiment shown in fig. 2, the method for recognizing keywords based on deep learning of the present application includes: the process S202 is to perform feature extraction on the discrete cosine transform spectral coefficient to obtain a mel-frequency cepstrum coefficient.

In this embodiment, a mel-frequency cepstrum coefficient is obtained by performing a feature extraction process on a discrete cosine transform spectral coefficient obtained by decoding.

In this embodiment, since the discrete cosine transform spectral coefficients are directly obtained without performing the frequency domain to time domain conversion and the filtering process of the long-term post-filter at the audio decoder side, the original standard identification process also needs to be adjusted in the keyword identification module.

Fig. 4 shows a standard recognition flow of the keyword recognition module. As shown in fig. 4, the feature extraction section in the keyword recognition includes: pre-emphasis; windowing; a time-to-frequency domain transform, typically a discrete fourier transform; an energy spectrum; a Mel filter bank; carrying out logarithmic transformation; discrete cosine transform to generate Mel Frequency Cepstrum Coefficient (MFCC). And correspondingly adjusting the identification process according to the updating of the audio code stream decoding process of the audio decoding end. In the decoding process, the conversion process from the frequency domain to the time domain is not performed, and correspondingly, the conversion process from the time domain to the frequency domain is not performed in the identification process of the keyword, so that the calculation power is saved, and the power consumption is reduced. After adjustment, in the keyword recognition module, the flow of the feature extraction part is as follows: pre-emphasis; an energy spectrum; a Mel filter bank; carrying out logarithmic transformation; discrete cosine transform to generate Mel Frequency Cepstrum Coefficient (MFCC).

Optionally, the pre-emphasis process includes: extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table; and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one. Fig. 5 shows a diagram of a pre-emphasis frequency response curve of the pre-emphasis process of the present application. As shown in fig. 5, the horizontal axis represents frequency values, and the vertical axis represents gain values. In a section of audio, the main energy is concentrated in low frequency, and the attenuation of the high frequency part is fast, so that the low frequency part and the high frequency part are flat in the process of keyword recognition, pre-emphasis processing is performed, the low frequency energy is attenuated, and the high frequency energy is emphasized.

In this optional embodiment, the discrete cosine transform spectrum coefficient obtained by decoding at the receiving end is not subjected to frequency domain to time domain conversion, so that the pre-emphasis processing of the discrete cosine transform spectrum coefficient also needs to be correspondingly adjusted in the identification process of the keyword identification module. Namely, the discrete cosine spectral coefficients are pre-emphasized in the frequency domain. Firstly, extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table, and pre-emphasizing discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients according to the one-to-one corresponding relation.

Specifically, taking a 16KHz sampling rate as an example, the pre-emphasis frequency response is stored as a pre-emphasis coefficient table, p (0), p (1), p (2), …, p (159), at 50Hz intervals. The discrete cosine transform spectral coefficients MDCT have 160 spectral coefficients,

is 160.

Performing pre-emphasis processing according to a pre-emphasis formula, specifically as follows:

。

in the method, in the characteristic extraction process of the keyword recognition module, because the windowing step is already carried out when the sending end carries out audio coding, and meanwhile, because the conversion processes from the time domain to the frequency domain and from the frequency domain to the time domain are operated in an inverse mode, the windowing and the conversion process from the time domain to the frequency domain in the standard process are omitted, and therefore power consumption is reduced.

In one example of the present application, the following briefly introduces another flow of the feature extraction process in the keyword recognition module, as follows:

in the energy spectrum process, pseudo-spectrum coefficients are firstly generated:

wherein, when

Or

When the temperature of the water is higher than the set temperature,

。

next, an energy spectrum is then generated: this step and the previous step may be combined in embodiments to further save computation, and are separated for ease of description.

。

It should be noted that, the invention does not limit whether the pseudo-spectrum coefficient is used, and the energy spectrum can also be generated for key word recognition by directly using the MDCT discrete cosine transform spectrum coefficient, but because the energy distribution of the MDCT pseudo-spectrum coefficient and the energy distribution of the fourier transform spectrum coefficient have better corresponding relation, the performance of training and recognition can be improved by using the pseudo-spectrum to calculate the energy spectrum.

The process at the Mel filter bank is as follows:

the energy of each channel is obtained by calculating the frequency spectrum energy through a Mel filter bank

The mel filter bank is formed by connecting a series of triangular filters,

is the mth mel filter, which belongs to the mature technology and is not described in detail here.

The log transformation process is as follows:

。

discrete cosine transform process: generating a Mel frequency cepstrum coefficient (MFCC for short), wherein the calculation formula is as follows:

and D is the dimension of the MFCC feature.

In the embodiment shown in fig. 2, the method for recognizing keywords based on deep learning of the present application includes: in the process S203, the Mel frequency cepstrum coefficient is identified according to the pre-trained deep neural network model, and the keyword probability corresponding to the audio code stream is obtained.

In the embodiment, the mel-frequency cepstrum coefficient is processed through a pre-trained deep neural network model, and the keyword probability corresponding to the audio code stream is obtained according to the corresponding relation between the mel-frequency cepstrum coefficient and the keywords. In the subsequent keyword processing process, a keyword processing module at the audio receiving end determines a final keyword according to the keyword probability, and then controls corresponding equipment to act. For example, after a section of audio code stream is processed by the keyword recognition method, the probability of the keyword for obtaining the "opening" of the air conditioner is 90%, and the probability of the keyword for obtaining the "temperature rise" is 20%, so that the subsequent keyword processing module determines that the keyword is opened according to the keyword probability, and further controls the air conditioner to perform the operation of opening the air conditioner. The subsequent processing performed by the subsequent keyword processing module according to the keyword probability is not specifically limited in the present application.

In the optional embodiment, the deep neural network model is trained in advance, wherein in the training process of the model, a large number of audio files of the voice material are processed on an offline PC or server to obtain mel-frequency cepstrum coefficients corresponding to the audio files, and the mel-frequency cepstrum coefficients and audio file keywords corresponding to the mel-frequency cepstrum coefficients are used as training samples to establish the corresponding relation between the mel-frequency cepstrum coefficients and the keywords. The principle of extracting the mel-frequency cepstrum coefficient of the audio file in the PC or the server can be referred to the above-described audio decoding process, and the specific situation can be adaptively adjusted. It should be noted that, a processing device for the audio file, such as a PC or a related server, may be directly set in the deep neural network model, and then, when performing processing, the audio file is directly input into the deep neural network model for training, and a corresponding relationship between mel-frequency cepstrum coefficients and keywords is established, so as to obtain parameters of the deep neural network model for a specific inference process of the keywords to use later.

When the model training result is tested, after the model is set according to the deep neural network model parameters, a certain number of audio files are input into the model, and the corresponding relation between the audio files and the corresponding keywords is counted. Specifically, the preset threshold may be selected to be 95%. At this time, after model setting is performed according to the deep neural network model parameters, in a specific keyword identification process, the mel-frequency cepstrum coefficient is processed through the deep neural network model, and the accuracy rate of obtaining corresponding keywords is greater than or equal to 95% so as to ensure the accuracy of obtaining corresponding keywords according to the deep neural network model. Wherein, the deep neural network model can select a convolutional neural network, CNN for short; a deep neural network, DNN for short; a recurrent neural network, abbreviated RNN; the long and short term memory network is abbreviated as LSTM, where the above is only an example of a partial deep neural network, and the application is not particularly limited with respect to the selection of a specific deep neural network.

Specifically, after the deep neural network model is trained to obtain parameters of the deep neural network model, the deep neural network model in the specific recognition process is set according to the parameters of the deep neural network model in the subsequent specific keyword reasoning process, and then the keywords corresponding to the audio code stream are reasoned, so that the processing speed of the keyword reasoning process and the accuracy of keyword recognition are improved.

The application discloses a deep neural network model training method based on discrete cosine transform of an audio file and a Mel frequency cepstrum coefficient is obtained on the basis. The neural network model mentioned in the prior art is based on fast fourier transform, and the mel-frequency cepstrum coefficients are obtained on the basis of the fast fourier transform. In addition, on the basis of the training method, the keyword identification method can greatly save the operation amount of the audio decoding and feature extraction process when the actual keyword is identified, and avoids the conversion process from the frequency domain to the time domain and from the time domain to the frequency domain in the prior art. The power consumption is reduced, and the calculation power is saved.

Fig. 6 is a flowchart illustrating an example of the deep learning-based keyword recognition method according to the present application. As shown in fig. 6, in the method of the present application, compared with the prior art, the conversion process from the frequency domain to the time domain is omitted in the decoding process, and the conversion process from the time domain to the frequency domain is correspondingly omitted in the feature identification process, so that the two steps requiring large computation power and power consumption are omitted, thereby saving computation power and speeding up the identification process of the keyword.

The method and the device can be used for low-power-consumption Bluetooth audio and can also be used for classic Bluetooth. The method can be used in the field of Bluetooth and other wireless communication fields, particularly in keyword identification; the existing information and the existing algorithm module of the audio decoder are fully utilized, the operation complexity of the decoder is reduced by omitting the processing processes of low-delay improved inverse discrete cosine transform and long-term post-filter in the decoding process, and the operation complexity of the keyword identification feature extraction module is reduced by omitting the processing processes of windowing and time-frequency conversion; based on the above, the power consumption is greatly saved, and the service life of the equipment is prolonged. Saving program space and code space required by related modules and reducing the cost of equipment. The method can be applied to the combination of the Bluetooth remote controller and the intelligent home, and the audio code stream controlled and coded by the Bluetooth remote controller is used for carrying out keyword recognition to realize the control of the intelligent home. The above and the application thereof are all within the scope of protection of the present application.

According to the keyword identification method based on deep learning, the conversion process from the frequency domain with large operation amount to the time domain in the standard decoding process of the audio code stream is omitted, the corresponding time domain in the keyword identification process is omitted, the recognition is directly carried out according to the discrete cosine transform spectrum coefficient to obtain the Mel frequency cepstrum coefficient, and the Mel frequency cepstrum coefficient is processed through a pre-trained deep neural network model to obtain the keyword corresponding to the audio code stream. The method omits the process with larger calculation amount in the process of identifying the keywords, saves calculation power, reduces power consumption and improves the speed of identifying the keywords. Especially, the Bluetooth low energy consumption with strict requirements on the power consumption has greater significance. In addition, the method and the device identify the keywords through the deep neural network model, and accuracy of keyword identification is guaranteed.

In the embodiment shown in fig. 7, the deep learning-based keyword recognition system of the present application includes: an audio decoding module 701, which only performs a transform domain noise shaping decoding step in a standard decoding process when an audio receiving end decodes an audio code stream, to obtain a discrete cosine transform spectral coefficient corresponding to the audio code stream; a feature extraction module 702, which performs feature extraction on the discrete cosine transform spectral coefficient to obtain a mel-frequency cepstrum coefficient; and a neural network model processing module 703, which processes the mel-frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.

Optionally, in the feature extraction module, the pre-emphasis processing is performed on the discrete cosine transform spectrum coefficient in the frequency domain, and after the pre-emphasis processing, the energy spectrum operation processing is directly performed, so that a time domain to frequency domain conversion process between the pre-emphasis processing and the energy spectrum operation processing is omitted.

Optionally, the pre-emphasis process in the feature extraction module includes: extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table; and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one. The specific execution process is consistent with the description in the keyword recognition method, and is not repeated herein. In the embodiment, the step of converting the frequency domain into the time domain in the audio decoder is eliminated, so that the time domain-frequency domain conversion process in the keyword identification process is omitted, the operation amount is reduced, and the identification speed of the keywords is accelerated. The discrete cosine transform spectral coefficients obtained by an audio decoder are used for feature extraction, and a deep neural network model is used for processing based on MFCC features, so that the corresponding keyword probability is directly obtained. When the deep neural network model conducts keyword reasoning on the Mel frequency cepstrum coefficient MFCC characteristics, model setting can be conducted according to predetermined parameters of the deep neural network model, and therefore processing speed and accuracy of a keyword reasoning process are improved. The operation principle of the deep learning-based keyword recognition system is similar to that of the deep learning-based keyword recognition method, and is not repeated.

In a particular embodiment of the present application, a computer-readable storage medium stores computer instructions, wherein the computer instructions are operable to perform the method for deep learning based keyword recognition described in any of the embodiments. Wherein the storage medium may be directly in hardware, in a software module executed by a processor, or in a combination of the two.

A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.

The Processor may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one embodiment of the present application, a computer device includes a processor and a memory, the memory storing computer instructions, wherein: the processor operates the computer instructions to perform the deep learning based keyword recognition method described in any of the embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above embodiments are merely examples, which are not intended to limit the scope of the present disclosure, and all equivalent structural changes made by using the contents of the specification and the drawings, or any other related technical fields, are also included in the scope of the present disclosure.

Claims

1. A keyword recognition method based on deep learning is characterized by comprising the following steps:

when an audio receiving end decodes an audio code stream, only performing a transform domain noise shaping decoding step in a standard decoding process to obtain a discrete cosine transform spectrum coefficient corresponding to the audio code stream;

performing feature extraction on the discrete cosine transform spectrum coefficient to obtain a Mel frequency cepstrum coefficient;

and processing the mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.

2. The method for recognizing keywords based on deep learning of claim 1, wherein the step of performing transform domain noise shaping decoding only in a standard decoding process to obtain discrete cosine transform spectral coefficients corresponding to the audio code stream comprises:

and decoding the audio code stream according to the standard decoding process, and obtaining the discrete cosine transform spectrum coefficient after sequentially performing code stream analysis, arithmetic and residual decoding, noise filling and noise gain, time domain noise decoding and transform domain noise shaping decoding, wherein the actual decoding process does not comprise a frequency domain and time domain conversion process and a long-term post-filter processing process.

3. The method for recognizing keywords based on deep learning of claim 1, wherein the performing feature extraction on the discrete cosine transform spectral coefficients to obtain mel-frequency cepstrum coefficients comprises:

and pre-emphasis processing is carried out on the discrete cosine transform spectrum coefficient in a frequency domain, and energy spectrum operation processing is directly carried out after the pre-emphasis processing, so that the conversion process from a time domain to a frequency domain between the pre-emphasis processing and the energy spectrum operation processing is omitted.

4. The method of claim 3, wherein the pre-emphasis process comprises:

extracting corresponding pre-emphasis coefficients from a pre-established pre-emphasis coefficient table;

and carrying out pre-emphasis processing on the discrete cosine transform spectrum coefficients according to the pre-emphasis coefficients, wherein the pre-emphasis coefficients correspond to the discrete cosine transform spectrum coefficients one to one.

5. The method for recognizing keywords based on deep learning of claim 1, wherein before the audio receiving end decodes the audio code stream, the method further comprises:

acquiring the Mel frequency cepstrum coefficients corresponding to a plurality of audio files respectively;

and training a deep neural network model according to the Mel frequency cepstrum coefficient and the keywords corresponding to the audio file to obtain deep neural network model parameters, so that after the Mel frequency cepstrum coefficient is input into the deep neural network model, the accuracy rate of the keywords corresponding to the Mel frequency cepstrum coefficient is greater than or equal to a preset threshold value through setting the deep neural network model parameters.

6. A keyword recognition system based on deep learning, comprising:

the audio decoding module is used for only performing a transform domain noise shaping decoding step in a standard decoding process when decoding an audio code stream to obtain a discrete cosine transform spectrum coefficient corresponding to the audio code stream;

the characteristic extraction module is used for extracting the characteristics of the discrete cosine transform spectrum coefficient to obtain a Mel frequency cepstrum coefficient;

and the neural network model processing module is used for processing the Mel frequency cepstrum coefficient according to a pre-trained deep neural network model to obtain the keyword probability corresponding to the audio code stream.

7. The system of claim 6, wherein in the feature extraction module, the system performs pre-emphasis processing on the discrete cosine transform spectral coefficients in the frequency domain, and directly performs energy spectrum operation processing after the pre-emphasis processing, thereby omitting the time-domain to frequency-domain conversion process between the pre-emphasis processing and the energy spectrum operation processing.

8. The deep learning based keyword recognition system of claim 6, wherein the pre-emphasis process in the feature extraction module comprises:

9. A computer readable storage medium storing computer instructions, wherein the computer instructions are operative to perform the deep learning based keyword recognition method of any one of claims 1 to 5.

10. A computer device comprising a processor and a memory, the memory storing computer instructions, wherein: the processor operates the computer instructions to perform the deep learning based keyword recognition method of any one of claims 1-5.