CN115101063A

CN115101063A - Low-computation-power voice recognition method, device, equipment and medium

Info

Publication number: CN115101063A
Application number: CN202211014435.2A
Authority: CN
Inventors: 李�杰; 王广新; 杨汉丹
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-09-23
Anticipated expiration: 2042-08-23
Also published as: CN115101063B

Abstract

The low-computation-power voice recognition method provided by the application obtains a plurality of voice signals by performing operations such as framing and windowing on input voice, combining a plurality of voice signals to form a voice window, then placing a plurality of voice windows into a trained deep neural network awakening model for processing to obtain a posterior probability of a column of phoneme mark types corresponding to each voice window, then combining the posterior probabilities of the plurality of voice windows to form a decoding matrix, and then calculating the score of a specified keyword (awakening word) in the decoding matrix to judge whether the input voice has the awakening word, which is different from the traditional convolution neural network model which adopts a mode of processing and calculating large-scale high-computation-power operation by one frame, and can solve the problems of chips used by some artificial intelligent Internet of things devices in the prior art, the architecture design of the system does not support parallelism, so that the voice wake-up algorithm cannot be deployed on the chip.

Description

Low-computational-power voice recognition method, device, equipment and medium

Technical Field

The present application relates to the field of speech recognition interaction, and in particular, to a low-computation-effort speech recognition method, apparatus, device, and medium.

Background

With the popularization of computer technology, people's lives have gradually entered the intelligent era nowadays. Not only computers, mobile phones, PADs, but also people's clothes and food residents all begin to apply emerging intelligent technologies, such as smart televisions, smart navigation, smart homes, and the like; more and more intelligent devices introduce a voice interaction control system, people do not need to control the intelligent devices by using hands or a remote controller, and the like, but send a specific voice awakening instruction to the intelligent devices to awaken the intelligent devices to enter a voice control mode, and the operation of the intelligent devices can be controlled by voice;

voice wakeup is usually based on convolution calculation to recognize and acquire wakeup words, therefore, most voice wakeup algorithms contain convolution calculation, and convolution acceleration depends on parallelism, however, some AIOT devices (artificial intelligence internet of things devices) use chips whose architecture design does not support parallelism and cannot perform real-time convolution calculation, which poses a challenge on how to deploy voice wakeup algorithms on such chips.

Disclosure of Invention

The application provides a low-computation-power voice recognition method, device, equipment and medium, and aims to solve the problem that some chips used by artificial intelligence Internet of things equipment in the prior art cannot be deployed with a voice wake-up algorithm due to the fact that the structural design of the chips does not support parallelism.

In order to solve the above technical problem, in a first aspect, the present application provides a low-computation-effort speech recognition method, including:

performing frame windowing on input voice to obtain a plurality of voice signals;

combining the plurality of voice signals to form a plurality of voice windows;

inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a list of phoneme mark types matched with the voice windows;

combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;

and calculating the score of the appointed keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the decoding matrix contains the appointed keyword.

Preferably, the step of performing frame-division windowing on the input speech to obtain a plurality of speech signals includes:

acquiring input voice;

and carrying out windowing and framing processing on the voice at a voice window with a preset window length and a preset time interval to obtain a plurality of voice signals.

Preferably, the step of combining the plurality of speech signals to form a plurality of speech windows includes:

and combining the plurality of voice signals in sequence according to a preset window step length by taking the first voice signal as an initial voice signal and the preset window length as a window length to form a plurality of voice windows.

Preferably, the deep neural network wake-up model includes a feature input layer, a hidden layer, an output layer and an attention layer, wherein the hidden layer includes at least one fully-connected layer, and a nonlinear activation function is provided behind the fully-connected layer of each layer.

Preferably, the step of inputting the plurality of speech windows into a trained deep neural network wake-up model for processing to obtain a plurality of posterior probabilities of a list of phoneme symbol types matching the speech windows includes:

inputting the multiple voice windows into a trained deep neural network awakening model according to a preset sequence for processing, wherein the processing method comprises the following steps:

acquiring a corresponding voice window during processing, and generating a current voice window;

acquiring a preset number of voice windows closest to the current voice window, and generating historical voice windows;

splicing the current voice window and the historical voice window on the full connection layer to generate a spliced voice window;

multiplying the spliced voice window by a preset weight matrix, and summing in the window direction of the spliced voice window to obtain a summation result;

and inputting the summation result into the full-connection layer to obtain the posterior probability of a list of phoneme mark types of the current voice window.

Preferably, the training step of the deep neural network wake model includes:

training an initial model based on a deep neural network based on a time sequence classification training criterion by using a universal corpus until the initial model network converges to obtain a preliminary convergence model based on the deep neural network;

and using the general corpus and the specific awakening word corpus to train the primary convergence model for the second time until the primary convergence model converges for the second time, so as to obtain the deep neural network awakening model.

Preferably, the step of calculating the score of the specified keyword in the decoding matrix includes:

acquiring the length of the decoding matrix;

and when the length of the decoding matrix exceeds a preset length, calculating the score of the specified keyword in the decoding matrix according to a preset step length.

In a second aspect, the present application further provides a low computational power speech recognition apparatus comprising:

the frame windowing module is used for performing frame windowing processing on input voice to obtain a plurality of voice signals;

the voice window generating module is used for combining the voice signals to form a plurality of voice windows;

the posterior probability calculation module is used for inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a list of phoneme mark types matched with the voice windows;

a decoding matrix generating module for combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;

and the judging module is used for calculating the score of the specified keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the specified keyword is contained in the decoding matrix.

In a third aspect, the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the low-computation speech recognition method described in any one of the above when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the low-computation speech recognition method of any of the above.

The application provides a low-computation speech recognition method, which comprises the steps of obtaining a plurality of speech signals by performing operations such as framing and windowing on input speech, combining the speech signals to form speech, inputting the speech into a trained deep neural network wake-up model in a plurality of speech windows for processing to obtain a posterior probability of a column of phoneme mark types corresponding to each speech window, combining the posterior probabilities of the speech windows to form a decoding matrix, and calculating scores of specified keywords (namely wake-up words) in the decoding matrix to judge whether wake-up words exist in the input speech, wherein the method is different from the traditional wake-up algorithm which depends on a convolutional neural network and needs a chip to support an operator with parallelization acceleration, but converts the speech signals of one frame into the form of the speech windows, reduces the computation requirement on the chip, and can solve the problems of chips used by some AIOT equipment in the prior art, the architecture design does not support parallelism, so that the voice wake-up algorithm cannot be deployed on the chip.

Drawings

FIG. 1 is a flow diagram of a low-computation speech recognition method according to an embodiment;

FIG. 2 is a schematic diagram of a low computational power speech recognition apparatus according to an embodiment;

FIG. 3 is a block diagram illustrating a computer device according to an embodiment.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the content clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, units, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, units, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, a low-computation-power speech recognition method provided in an embodiment of the present application includes:

s1: performing framing and windowing processing on input voice to obtain a plurality of voice signals;

s2: combining the plurality of voice signals to form a plurality of voice windows;

s3: inputting the multiple voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a column of phoneme mark types matched with the voice windows;

s4: combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;

s5: and calculating the score of the appointed keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the decoding matrix contains the appointed keyword.

As described in step S1, before feature extraction is performed on the speech input into the chip, windowing and framing operations need to be performed on the speech. Specifically, a speech with a window length of 25 milliseconds may be preset, and a frame shift between adjacent frames is 10 milliseconds, that is, each frame is divided into 10 milliseconds every 10 milliseconds, and each 10 millisecond speech is divided into a speech signal, further, since the phoneme duration of each language phoneme unit is different, the frame shift length may be set according to the language of the input speech to include the duration of the phoneme of the pronunciation unit;

as described in step S2, the input speech is already divided into a plurality of speech signals in step S1, and therefore, a plurality of speech signals are combined to form a plurality of speech windows, the window length of a speech window can be set by itself, the moving step length of a speech window can also be set by itself, the moving step length of a speech window can be half of the window length, in some embodiments, the moving step length can be one third of the window length or another proportion, generally, the moving step length of a speech window is smaller than the window length, and in this embodiment, the moving step length of a speech window, the window length and the length of a speech signal are not limited. Specifically, the windowing may be performed according to a time frame of the voice signal, that is, starting from a first frame obtained by dividing, if each voice signal is 10 milliseconds, the window length of the voice window is 30 milliseconds, and the step length of the voice window is 10 milliseconds, the first voice window includes the first voice signal, the second voice signal, and the third voice signal, and after the voice window moves rightward for 10 milliseconds, the second voice window formed includes the second voice signal, the third voice signal, and the fourth voice signal;

specifically, a voice window formed after performing windowing and framing operation on online recorded voice is extracted voice features, and the voice features may be filter bank (hereinafter, referred to as fbank) features, or other voice features, such as Mel frequency cepstrum Coefficient (hereinafter, referred to as MFCC) features, which are not limited in this embodiment, and in the embodiment of the present invention, the voice features are fbank features, and 40-dimensional fbank is used, but the dimension of fbank may be set according to an actual usage scenario;

as described in step S3, after step S2, a plurality of speech windows including speech features may be obtained, and all the speech windows are sequentially input into the trained deep neural network wake-up model, the deep neural network wake-up model may correspond to the phoneme label type preset in the deep neural network wake-up model according to the phonemes in the speech window, and after weighting and averaging of the deep neural network wake-up model, the posterior probability of the phoneme mark type corresponding to each speech window may be calculated, and after one speech window is calculated by the deep neural network wake-up model, the posterior probability of a list of phoneme mark types may be obtained; the posterior probability of the phoneme mark types in a column can be obtained after the processing of the deep neural network awakening model;

as described in the above step S4, after the step S3, a plurality of a row of the posterior probabilities of the phoneme symbol types are obtained, and the posterior probabilities of the phoneme symbol types in the row are combined to form a decoding matrix;

as described in the foregoing step S5, after the foregoing step S4, the score of the specified keyword may be calculated in the decoding matrix, and if the score of the specified keyword in the decoding matrix exceeds a certain threshold, it is determined that the decoding matrix contains the specified keyword, that is, it indicates that the speech contains the wakeup word;

therefore, in the scheme, a plurality of speech windows are formed after the input speech is subjected to framing and windowing processing, then the plurality of speech windows are placed into a trained deep neural network wake-up model for processing, a column of posterior probabilities of phoneme mark types corresponding to each speech window is obtained, then the posterior probabilities of the plurality of speech windows are combined, a decoding matrix is formed, and then scores of specified keywords (wake-up words) in the decoding matrix are calculated to judge whether the input speech has the wake-up words.

In one embodiment, the step S1 of performing frame windowing on the input speech, and acquiring a plurality of speech signals includes:

acquiring input voice;

As described above, when performing feature extraction on speech input into a chip, it is necessary to perform windowing and framing on the speech. Specifically, a speech with a window length of 25 ms may be preset, and a frame shift between adjacent frames is 10 ms, that is, every 10 ms is divided into one frame, and every 10 ms is divided into one speech signal, and further, because the phoneme duration of each language phoneme unit is different, the frame shift length may be set according to the language of the input speech to include the phoneme duration of the pronunciation unit.

In one embodiment, the step S2 of combining the plurality of speech signals to form a plurality of speech windows includes:

As described above, after the input speech is framed, a plurality of speech signals are combined to form a plurality of speech windows, each speech window is the speech feature of the extracted speech signal, the window length of each speech window can be set by itself, the moving step length of each speech window can also be set by itself, the moving step length of each speech window can be half the window length, in some embodiments, the moving step length can also be one third of the window length or other occupation ratio, generally, the moving step length of each speech window can be smaller than the window length, and in this embodiment, the moving step length of each speech window, the window length and the length of the speech signal are not limited. Specifically, the windowing may be performed in accordance with the time frame of the speech signal, that is, from the first frame of the division, if each speech signal is 10 ms, the window length of the speech window is 30 ms, and the step length of the speech window is 10 ms, the first speech window includes the first speech signal, the second speech signal, and the third speech signal, and after the speech window is moved rightward for 10 ms, the second speech window formed includes the second speech signal, the third speech signal, and the fourth speech signal.

In one embodiment, the deep neural network wake-up model comprises a characteristic input layer, a hidden layer, an output layer and an attention layer, wherein the hidden layer comprises at least one fully-connected layer, and a nonlinear activation function is arranged behind the fully-connected layer of each layer.

As described above, the deep neural network wake model includes a feature input layer, a hidden layer, an output layer, and an attention layer, the feature input layer is used for inputting a voice window containing voice features, the hidden layer is a full connection layer, the attention layer can enable the deep neural network wake-up model to pay more attention to a specific part of the input model so as to more effectively complete the task at hand, i.e., the model is guided to extract phonemes of a specific word in the input speech, the number of the hidden layers may be one or more, a relu nonlinear activation function is connected to each fc (fully connected layer), and the specific number of layers can depend on the computational power of a chip and the limitation of a Deep Neural Network (DNN) model; the backward propagation of fc (fully-connected layer) network with the number of layers within 5 is effective, so in the present embodiment, the number of hidden layers is preferably 3, and too many layers may cause the gradient vanishing accuracy to decrease instead.

In one embodiment, the step of inputting the plurality of speech windows into a trained deep neural network wake model for processing to obtain a plurality of posterior probabilities S5 of a list of phoneme symbol types matching the speech windows includes:

inputting the plurality of voice windows into a trained deep neural network awakening model according to a preset sequence for processing, wherein the processing method comprises the following steps:

acquiring a voice window corresponding to the voice window during processing, and generating a current voice window;

As mentioned above, the plurality of voice windows are put into the trained deep neural network wake-up model according to the preset sequence for processing, that is, the generated voice windows are put into the trained deep neural network wake-up model in turn according to the sequence of the generated voice windows in the moving direction of the windows for processing, the first voice window and the Nth voice window … … are generated in turn according to the moving direction of the windows, the first voice window is put into the deep neural network for processing first, then the second voice window, … … and the Nth voice window are put into the deep neural network wake-up model for processing in turn, meanwhile, the server marks the voice window corresponding to the voice window being processed by the deep neural network wake-up model as the current voice window, and marks the preset number of voice windows processed before the current voice window as the historical voice window, wherein the history voice window is the feature of an active layer (full connection layer) before outputting the phoneme classification probability after being processed by the neural network, specifically, if the deep neural network wake-up model is currently processing a fifth voice window, the server marks the preset number of voice windows before the fifth voice window, such as the second, third and fourth voice windows as history voice windows (the preset number is 3, and the specific preset number can be set by itself without limitation), when the deep neural network wake-up model processes the fifth voice window, the voices in the second, third and fourth voice windows and the fifth voice window are spliced and combined in the full connection layer close to the output layer before outputting the phoneme classification probability after being processed by the neural network to generate a spliced voice window, the features extracted by the 3 historical windows are used as auxiliary information of the current voice window to be processed together; therefore, the size of the receptive field is improved, and the accuracy is further improved; meanwhile, due to the characteristics after compression, the calculation amount and the storage occupation are greatly reduced; specifically, in this embodiment, 10 frames are selected for one-time prediction, where the 10 frames are 100ms, that is, 0.1s, and are sufficient to cover the duration of one phoneme, the fbank features of the 10 frames are spliced, and the dimension of each frame is 40 dimensions, then the input speech window is 10 × 40 as an input feature, where the selection of the 10 frames may be set according to different languages to include the duration of the phoneme of its pronunciation unit, and during streaming processing, the step size may be set to stride =5, so as to reduce the amount of calculation and ensure the accuracy of recognition. The general step length is half of the window length, after the historical speech window and the current speech window are spliced, the weight is learned, namely the historical speech window and the current speech window are multiplied by a weight matrix of 4 x 64, the weight matrix is a learnable matrix, then summation is carried out in the window direction, then the summation result is input into a full connection layer for feature processing, the features are compressed into 64 dimensions, the posterior probability of a column of phoneme mark types of the current speech window can be obtained through calculation, and the calculated amount can be reduced.

In one embodiment, the training step of the deep neural network wake model includes:

As described above, when the deep neural network wake-up model is trained, an initial model based on the deep neural network is trained by using a gradient descent method based on a CTC timing Classification training (connected Temporal Classification) criterion until the network converges, where the network converges refers to a gradual decrease in loss value, and the conditions for stopping training by the network may be as follows: for example, the word error rate of the target wer (word error rate) of the speech recognition training is not decreased any more, or a certain number of epochs (iteration times) are trained; the preliminary convergence model based on the deep neural network can be obtained, then, the training corpus is replaced to be the general corpus and the specific awakening word corpus, in each batch (the number of training samples), the training corpus is combined according to a certain proportion, for example, the proportion of the general corpus is 30% and the proportion of the specific corpus is 70% to the preliminary convergence model, the final deep neural network awakening model is obtained after the preliminary convergence model reaches the second convergence, so that the recognition capability of the deep neural network awakening model to the general corpus can be kept, the recognition capability of the specific awakening word is improved, and meanwhile, overfitting of the deep neural network awakening model can be prevented.

In one embodiment, the step of calculating the score of the specified keyword in the decoding matrix comprises:

acquiring the length of the decoding matrix;

As described above, the posterior probabilities of a column of phoneme symbol types predicted by each speech window are combined to form a decoding matrix, a 2-second decoding matrix can be processed once, and the step size of the decoding matrix processing can be set to stride = 100 ms; and calculating the score of the appointed keyword in the decoding matrix for the decoding matrix of each window, and if the score exceeds a certain threshold value, determining that the decoding matrix contains the awakening word (namely the appointed keyword).

Referring to fig. 2, in a second aspect, the present application further provides a low-computation speech recognition apparatus comprising:

a framing and windowing module 100, configured to perform framing and windowing on input speech to obtain multiple speech signals;

a speech window generating module 200, configured to combine the multiple speech signals to form multiple speech windows;

a posterior probability calculation module 300, configured to input the multiple speech windows into a trained deep neural network wake-up model for processing, so as to obtain posterior probabilities of a list of phoneme symbol types matching the speech windows;

a decoding matrix generating module 400, configured to combine the a posteriori probabilities of the multiple speech windows to form a decoding matrix;

the determining module 500 is configured to calculate a score of the specified keyword in the decoding matrix, and if the score exceeds a preset threshold, determine that the decoding matrix contains the specified keyword.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing low-computational speech recognition data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a low-computational speech recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.

An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a low-computational speech recognition method. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), dual data rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, apparatus, article or method that comprises the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.

Claims

1. A low-computation speech recognition method, comprising:

combining the plurality of voice signals to form a plurality of voice windows;

inputting the multiple voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a column of phoneme mark types matched with the voice windows;

2. The low-computational-power speech recognition method of claim 1, wherein the step of performing frame windowing on the input speech to obtain the plurality of speech signals comprises:

acquiring input voice;

3. The low-computational speech recognition method of claim 1 wherein the step of combining the plurality of speech signals to form a plurality of speech windows comprises:

4. The low-computation voice recognition method of claim 1, wherein the deep neural network wake model comprises a feature input layer, an implied layer, an output layer and an attention layer, wherein the implied layer comprises at least one fully-connected layer, and a non-linear activation function is provided after the fully-connected layer of each layer.

5. The method of claim 4, wherein the step of inputting the plurality of speech windows into a trained deep neural network wake model for processing to obtain a plurality of a posteriori probabilities of a list of phoneme token types matching the speech windows comprises:

6. The low-computational speech recognition method of claim 1 wherein the training step of the deep neural network wake model comprises:

and using the general linguistic data and the specific awakening word linguistic data to carry out secondary training on the primary convergence model until the primary convergence model is converged for the second time, and obtaining the deep neural network awakening model.

7. The low-computational speech recognition method of claim 1 wherein the step of calculating a score for a given keyword in the decoding matrix comprises:

acquiring the length of the decoding matrix;

8. A low-computation speech recognition apparatus, comprising:

a decoding matrix generating module, configured to combine the posterior probabilities of the multiple speech windows to form a decoding matrix;

9. A computer device, characterized in that it comprises a memory in which a computer program is stored and a processor which, when executing said computer program, implements the steps of the low-computation speech recognition method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the low-computation speech recognition method of any one of claims 1 to 7.