CN115101063A - Low-computation-power voice recognition method, device, equipment and medium - Google Patents

Low-computation-power voice recognition method, device, equipment and medium Download PDF

Info

Publication number
CN115101063A
CN115101063A CN202211014435.2A CN202211014435A CN115101063A CN 115101063 A CN115101063 A CN 115101063A CN 202211014435 A CN202211014435 A CN 202211014435A CN 115101063 A CN115101063 A CN 115101063A
Authority
CN
China
Prior art keywords
voice
window
windows
speech
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211014435.2A
Other languages
Chinese (zh)
Other versions
CN115101063B (en
Inventor
李�杰
王广新
杨汉丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Youjie Zhixin Technology Co ltd
Original Assignee
Shenzhen Youjie Zhixin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Youjie Zhixin Technology Co ltd filed Critical Shenzhen Youjie Zhixin Technology Co ltd
Priority to CN202211014435.2A priority Critical patent/CN115101063B/en
Publication of CN115101063A publication Critical patent/CN115101063A/en
Application granted granted Critical
Publication of CN115101063B publication Critical patent/CN115101063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)

Abstract

The low-computation-power voice recognition method provided by the application obtains a plurality of voice signals by performing operations such as framing and windowing on input voice, combining a plurality of voice signals to form a voice window, then placing a plurality of voice windows into a trained deep neural network awakening model for processing to obtain a posterior probability of a column of phoneme mark types corresponding to each voice window, then combining the posterior probabilities of the plurality of voice windows to form a decoding matrix, and then calculating the score of a specified keyword (awakening word) in the decoding matrix to judge whether the input voice has the awakening word, which is different from the traditional convolution neural network model which adopts a mode of processing and calculating large-scale high-computation-power operation by one frame, and can solve the problems of chips used by some artificial intelligent Internet of things devices in the prior art, the architecture design of the system does not support parallelism, so that the voice wake-up algorithm cannot be deployed on the chip.

Description

Low-computational-power voice recognition method, device, equipment and medium
Technical Field
The present application relates to the field of speech recognition interaction, and in particular, to a low-computation-effort speech recognition method, apparatus, device, and medium.
Background
With the popularization of computer technology, people's lives have gradually entered the intelligent era nowadays. Not only computers, mobile phones, PADs, but also people's clothes and food residents all begin to apply emerging intelligent technologies, such as smart televisions, smart navigation, smart homes, and the like; more and more intelligent devices introduce a voice interaction control system, people do not need to control the intelligent devices by using hands or a remote controller, and the like, but send a specific voice awakening instruction to the intelligent devices to awaken the intelligent devices to enter a voice control mode, and the operation of the intelligent devices can be controlled by voice;
voice wakeup is usually based on convolution calculation to recognize and acquire wakeup words, therefore, most voice wakeup algorithms contain convolution calculation, and convolution acceleration depends on parallelism, however, some AIOT devices (artificial intelligence internet of things devices) use chips whose architecture design does not support parallelism and cannot perform real-time convolution calculation, which poses a challenge on how to deploy voice wakeup algorithms on such chips.
Disclosure of Invention
The application provides a low-computation-power voice recognition method, device, equipment and medium, and aims to solve the problem that some chips used by artificial intelligence Internet of things equipment in the prior art cannot be deployed with a voice wake-up algorithm due to the fact that the structural design of the chips does not support parallelism.
In order to solve the above technical problem, in a first aspect, the present application provides a low-computation-effort speech recognition method, including:
performing frame windowing on input voice to obtain a plurality of voice signals;
combining the plurality of voice signals to form a plurality of voice windows;
inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a list of phoneme mark types matched with the voice windows;
combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;
and calculating the score of the appointed keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the decoding matrix contains the appointed keyword.
Preferably, the step of performing frame-division windowing on the input speech to obtain a plurality of speech signals includes:
acquiring input voice;
and carrying out windowing and framing processing on the voice at a voice window with a preset window length and a preset time interval to obtain a plurality of voice signals.
Preferably, the step of combining the plurality of speech signals to form a plurality of speech windows includes:
and combining the plurality of voice signals in sequence according to a preset window step length by taking the first voice signal as an initial voice signal and the preset window length as a window length to form a plurality of voice windows.
Preferably, the deep neural network wake-up model includes a feature input layer, a hidden layer, an output layer and an attention layer, wherein the hidden layer includes at least one fully-connected layer, and a nonlinear activation function is provided behind the fully-connected layer of each layer.
Preferably, the step of inputting the plurality of speech windows into a trained deep neural network wake-up model for processing to obtain a plurality of posterior probabilities of a list of phoneme symbol types matching the speech windows includes:
inputting the multiple voice windows into a trained deep neural network awakening model according to a preset sequence for processing, wherein the processing method comprises the following steps:
acquiring a corresponding voice window during processing, and generating a current voice window;
acquiring a preset number of voice windows closest to the current voice window, and generating historical voice windows;
splicing the current voice window and the historical voice window on the full connection layer to generate a spliced voice window;
multiplying the spliced voice window by a preset weight matrix, and summing in the window direction of the spliced voice window to obtain a summation result;
and inputting the summation result into the full-connection layer to obtain the posterior probability of a list of phoneme mark types of the current voice window.
Preferably, the training step of the deep neural network wake model includes:
training an initial model based on a deep neural network based on a time sequence classification training criterion by using a universal corpus until the initial model network converges to obtain a preliminary convergence model based on the deep neural network;
and using the general corpus and the specific awakening word corpus to train the primary convergence model for the second time until the primary convergence model converges for the second time, so as to obtain the deep neural network awakening model.
Preferably, the step of calculating the score of the specified keyword in the decoding matrix includes:
acquiring the length of the decoding matrix;
and when the length of the decoding matrix exceeds a preset length, calculating the score of the specified keyword in the decoding matrix according to a preset step length.
In a second aspect, the present application further provides a low computational power speech recognition apparatus comprising:
the frame windowing module is used for performing frame windowing processing on input voice to obtain a plurality of voice signals;
the voice window generating module is used for combining the voice signals to form a plurality of voice windows;
the posterior probability calculation module is used for inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a list of phoneme mark types matched with the voice windows;
a decoding matrix generating module for combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;
and the judging module is used for calculating the score of the specified keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the specified keyword is contained in the decoding matrix.
In a third aspect, the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the low-computation speech recognition method described in any one of the above when executing the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the low-computation speech recognition method of any of the above.
The application provides a low-computation speech recognition method, which comprises the steps of obtaining a plurality of speech signals by performing operations such as framing and windowing on input speech, combining the speech signals to form speech, inputting the speech into a trained deep neural network wake-up model in a plurality of speech windows for processing to obtain a posterior probability of a column of phoneme mark types corresponding to each speech window, combining the posterior probabilities of the speech windows to form a decoding matrix, and calculating scores of specified keywords (namely wake-up words) in the decoding matrix to judge whether wake-up words exist in the input speech, wherein the method is different from the traditional wake-up algorithm which depends on a convolutional neural network and needs a chip to support an operator with parallelization acceleration, but converts the speech signals of one frame into the form of the speech windows, reduces the computation requirement on the chip, and can solve the problems of chips used by some AIOT equipment in the prior art, the architecture design does not support parallelism, so that the voice wake-up algorithm cannot be deployed on the chip.
Drawings
FIG. 1 is a flow diagram of a low-computation speech recognition method according to an embodiment;
FIG. 2 is a schematic diagram of a low computational power speech recognition apparatus according to an embodiment;
FIG. 3 is a block diagram illustrating a computer device according to an embodiment.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the content clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, units, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, units, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Referring to fig. 1, a low-computation-power speech recognition method provided in an embodiment of the present application includes:
s1: performing framing and windowing processing on input voice to obtain a plurality of voice signals;
s2: combining the plurality of voice signals to form a plurality of voice windows;
s3: inputting the multiple voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a column of phoneme mark types matched with the voice windows;
s4: combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;
s5: and calculating the score of the appointed keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the decoding matrix contains the appointed keyword.
As described in step S1, before feature extraction is performed on the speech input into the chip, windowing and framing operations need to be performed on the speech. Specifically, a speech with a window length of 25 milliseconds may be preset, and a frame shift between adjacent frames is 10 milliseconds, that is, each frame is divided into 10 milliseconds every 10 milliseconds, and each 10 millisecond speech is divided into a speech signal, further, since the phoneme duration of each language phoneme unit is different, the frame shift length may be set according to the language of the input speech to include the duration of the phoneme of the pronunciation unit;
as described in step S2, the input speech is already divided into a plurality of speech signals in step S1, and therefore, a plurality of speech signals are combined to form a plurality of speech windows, the window length of a speech window can be set by itself, the moving step length of a speech window can also be set by itself, the moving step length of a speech window can be half of the window length, in some embodiments, the moving step length can be one third of the window length or another proportion, generally, the moving step length of a speech window is smaller than the window length, and in this embodiment, the moving step length of a speech window, the window length and the length of a speech signal are not limited. Specifically, the windowing may be performed according to a time frame of the voice signal, that is, starting from a first frame obtained by dividing, if each voice signal is 10 milliseconds, the window length of the voice window is 30 milliseconds, and the step length of the voice window is 10 milliseconds, the first voice window includes the first voice signal, the second voice signal, and the third voice signal, and after the voice window moves rightward for 10 milliseconds, the second voice window formed includes the second voice signal, the third voice signal, and the fourth voice signal;
specifically, a voice window formed after performing windowing and framing operation on online recorded voice is extracted voice features, and the voice features may be filter bank (hereinafter, referred to as fbank) features, or other voice features, such as Mel frequency cepstrum Coefficient (hereinafter, referred to as MFCC) features, which are not limited in this embodiment, and in the embodiment of the present invention, the voice features are fbank features, and 40-dimensional fbank is used, but the dimension of fbank may be set according to an actual usage scenario;
as described in step S3, after step S2, a plurality of speech windows including speech features may be obtained, and all the speech windows are sequentially input into the trained deep neural network wake-up model, the deep neural network wake-up model may correspond to the phoneme label type preset in the deep neural network wake-up model according to the phonemes in the speech window, and after weighting and averaging of the deep neural network wake-up model, the posterior probability of the phoneme mark type corresponding to each speech window may be calculated, and after one speech window is calculated by the deep neural network wake-up model, the posterior probability of a list of phoneme mark types may be obtained; the posterior probability of the phoneme mark types in a column can be obtained after the processing of the deep neural network awakening model;
as described in the above step S4, after the step S3, a plurality of a row of the posterior probabilities of the phoneme symbol types are obtained, and the posterior probabilities of the phoneme symbol types in the row are combined to form a decoding matrix;
as described in the foregoing step S5, after the foregoing step S4, the score of the specified keyword may be calculated in the decoding matrix, and if the score of the specified keyword in the decoding matrix exceeds a certain threshold, it is determined that the decoding matrix contains the specified keyword, that is, it indicates that the speech contains the wakeup word;
therefore, in the scheme, a plurality of speech windows are formed after the input speech is subjected to framing and windowing processing, then the plurality of speech windows are placed into a trained deep neural network wake-up model for processing, a column of posterior probabilities of phoneme mark types corresponding to each speech window is obtained, then the posterior probabilities of the plurality of speech windows are combined, a decoding matrix is formed, and then scores of specified keywords (wake-up words) in the decoding matrix are calculated to judge whether the input speech has the wake-up words.
In one embodiment, the step S1 of performing frame windowing on the input speech, and acquiring a plurality of speech signals includes:
acquiring input voice;
and carrying out windowing and framing processing on the voice at a voice window with a preset window length and a preset time interval to obtain a plurality of voice signals.
As described above, when performing feature extraction on speech input into a chip, it is necessary to perform windowing and framing on the speech. Specifically, a speech with a window length of 25 ms may be preset, and a frame shift between adjacent frames is 10 ms, that is, every 10 ms is divided into one frame, and every 10 ms is divided into one speech signal, and further, because the phoneme duration of each language phoneme unit is different, the frame shift length may be set according to the language of the input speech to include the phoneme duration of the pronunciation unit.
In one embodiment, the step S2 of combining the plurality of speech signals to form a plurality of speech windows includes:
and combining the plurality of voice signals in sequence according to a preset window step length by taking the first voice signal as an initial voice signal and the preset window length as a window length to form a plurality of voice windows.
As described above, after the input speech is framed, a plurality of speech signals are combined to form a plurality of speech windows, each speech window is the speech feature of the extracted speech signal, the window length of each speech window can be set by itself, the moving step length of each speech window can also be set by itself, the moving step length of each speech window can be half the window length, in some embodiments, the moving step length can also be one third of the window length or other occupation ratio, generally, the moving step length of each speech window can be smaller than the window length, and in this embodiment, the moving step length of each speech window, the window length and the length of the speech signal are not limited. Specifically, the windowing may be performed in accordance with the time frame of the speech signal, that is, from the first frame of the division, if each speech signal is 10 ms, the window length of the speech window is 30 ms, and the step length of the speech window is 10 ms, the first speech window includes the first speech signal, the second speech signal, and the third speech signal, and after the speech window is moved rightward for 10 ms, the second speech window formed includes the second speech signal, the third speech signal, and the fourth speech signal.
In one embodiment, the deep neural network wake-up model comprises a characteristic input layer, a hidden layer, an output layer and an attention layer, wherein the hidden layer comprises at least one fully-connected layer, and a nonlinear activation function is arranged behind the fully-connected layer of each layer.
As described above, the deep neural network wake model includes a feature input layer, a hidden layer, an output layer, and an attention layer, the feature input layer is used for inputting a voice window containing voice features, the hidden layer is a full connection layer, the attention layer can enable the deep neural network wake-up model to pay more attention to a specific part of the input model so as to more effectively complete the task at hand, i.e., the model is guided to extract phonemes of a specific word in the input speech, the number of the hidden layers may be one or more, a relu nonlinear activation function is connected to each fc (fully connected layer), and the specific number of layers can depend on the computational power of a chip and the limitation of a Deep Neural Network (DNN) model; the backward propagation of fc (fully-connected layer) network with the number of layers within 5 is effective, so in the present embodiment, the number of hidden layers is preferably 3, and too many layers may cause the gradient vanishing accuracy to decrease instead.
In one embodiment, the step of inputting the plurality of speech windows into a trained deep neural network wake model for processing to obtain a plurality of posterior probabilities S5 of a list of phoneme symbol types matching the speech windows includes:
inputting the plurality of voice windows into a trained deep neural network awakening model according to a preset sequence for processing, wherein the processing method comprises the following steps:
acquiring a voice window corresponding to the voice window during processing, and generating a current voice window;
acquiring a preset number of voice windows closest to the current voice window, and generating historical voice windows;
splicing the current voice window and the historical voice window on the full connection layer to generate a spliced voice window;
multiplying the spliced voice window by a preset weight matrix, and summing in the window direction of the spliced voice window to obtain a summation result;
and inputting the summation result into the full-connection layer to obtain the posterior probability of a list of phoneme mark types of the current voice window.
As mentioned above, the plurality of voice windows are put into the trained deep neural network wake-up model according to the preset sequence for processing, that is, the generated voice windows are put into the trained deep neural network wake-up model in turn according to the sequence of the generated voice windows in the moving direction of the windows for processing, the first voice window and the Nth voice window … … are generated in turn according to the moving direction of the windows, the first voice window is put into the deep neural network for processing first, then the second voice window, … … and the Nth voice window are put into the deep neural network wake-up model for processing in turn, meanwhile, the server marks the voice window corresponding to the voice window being processed by the deep neural network wake-up model as the current voice window, and marks the preset number of voice windows processed before the current voice window as the historical voice window, wherein the history voice window is the feature of an active layer (full connection layer) before outputting the phoneme classification probability after being processed by the neural network, specifically, if the deep neural network wake-up model is currently processing a fifth voice window, the server marks the preset number of voice windows before the fifth voice window, such as the second, third and fourth voice windows as history voice windows (the preset number is 3, and the specific preset number can be set by itself without limitation), when the deep neural network wake-up model processes the fifth voice window, the voices in the second, third and fourth voice windows and the fifth voice window are spliced and combined in the full connection layer close to the output layer before outputting the phoneme classification probability after being processed by the neural network to generate a spliced voice window, the features extracted by the 3 historical windows are used as auxiliary information of the current voice window to be processed together; therefore, the size of the receptive field is improved, and the accuracy is further improved; meanwhile, due to the characteristics after compression, the calculation amount and the storage occupation are greatly reduced; specifically, in this embodiment, 10 frames are selected for one-time prediction, where the 10 frames are 100ms, that is, 0.1s, and are sufficient to cover the duration of one phoneme, the fbank features of the 10 frames are spliced, and the dimension of each frame is 40 dimensions, then the input speech window is 10 × 40 as an input feature, where the selection of the 10 frames may be set according to different languages to include the duration of the phoneme of its pronunciation unit, and during streaming processing, the step size may be set to stride =5, so as to reduce the amount of calculation and ensure the accuracy of recognition. The general step length is half of the window length, after the historical speech window and the current speech window are spliced, the weight is learned, namely the historical speech window and the current speech window are multiplied by a weight matrix of 4 x 64, the weight matrix is a learnable matrix, then summation is carried out in the window direction, then the summation result is input into a full connection layer for feature processing, the features are compressed into 64 dimensions, the posterior probability of a column of phoneme mark types of the current speech window can be obtained through calculation, and the calculated amount can be reduced.
In one embodiment, the training step of the deep neural network wake model includes:
training an initial model based on a deep neural network based on a time sequence classification training criterion by using a universal corpus until the initial model network converges to obtain a preliminary convergence model based on the deep neural network;
and using the general corpus and the specific awakening word corpus to train the primary convergence model for the second time until the primary convergence model converges for the second time, so as to obtain the deep neural network awakening model.
As described above, when the deep neural network wake-up model is trained, an initial model based on the deep neural network is trained by using a gradient descent method based on a CTC timing Classification training (connected Temporal Classification) criterion until the network converges, where the network converges refers to a gradual decrease in loss value, and the conditions for stopping training by the network may be as follows: for example, the word error rate of the target wer (word error rate) of the speech recognition training is not decreased any more, or a certain number of epochs (iteration times) are trained; the preliminary convergence model based on the deep neural network can be obtained, then, the training corpus is replaced to be the general corpus and the specific awakening word corpus, in each batch (the number of training samples), the training corpus is combined according to a certain proportion, for example, the proportion of the general corpus is 30% and the proportion of the specific corpus is 70% to the preliminary convergence model, the final deep neural network awakening model is obtained after the preliminary convergence model reaches the second convergence, so that the recognition capability of the deep neural network awakening model to the general corpus can be kept, the recognition capability of the specific awakening word is improved, and meanwhile, overfitting of the deep neural network awakening model can be prevented.
In one embodiment, the step of calculating the score of the specified keyword in the decoding matrix comprises:
acquiring the length of the decoding matrix;
and when the length of the decoding matrix exceeds a preset length, calculating the score of the specified keyword in the decoding matrix according to a preset step length.
As described above, the posterior probabilities of a column of phoneme symbol types predicted by each speech window are combined to form a decoding matrix, a 2-second decoding matrix can be processed once, and the step size of the decoding matrix processing can be set to stride = 100 ms; and calculating the score of the appointed keyword in the decoding matrix for the decoding matrix of each window, and if the score exceeds a certain threshold value, determining that the decoding matrix contains the awakening word (namely the appointed keyword).
Referring to fig. 2, in a second aspect, the present application further provides a low-computation speech recognition apparatus comprising:
a framing and windowing module 100, configured to perform framing and windowing on input speech to obtain multiple speech signals;
a speech window generating module 200, configured to combine the multiple speech signals to form multiple speech windows;
a posterior probability calculation module 300, configured to input the multiple speech windows into a trained deep neural network wake-up model for processing, so as to obtain posterior probabilities of a list of phoneme symbol types matching the speech windows;
a decoding matrix generating module 400, configured to combine the a posteriori probabilities of the multiple speech windows to form a decoding matrix;
the determining module 500 is configured to calculate a score of the specified keyword in the decoding matrix, and if the score exceeds a preset threshold, determine that the decoding matrix contains the specified keyword.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing low-computational speech recognition data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a low-computational speech recognition method.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a low-computational speech recognition method. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), dual data rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, apparatus, article or method that comprises the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.

Claims (10)

1. A low-computation speech recognition method, comprising:
performing frame windowing on input voice to obtain a plurality of voice signals;
combining the plurality of voice signals to form a plurality of voice windows;
inputting the multiple voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a column of phoneme mark types matched with the voice windows;
combining the posterior probabilities of the plurality of speech windows to form a decoding matrix;
and calculating the score of the appointed keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the decoding matrix contains the appointed keyword.
2. The low-computational-power speech recognition method of claim 1, wherein the step of performing frame windowing on the input speech to obtain the plurality of speech signals comprises:
acquiring input voice;
and carrying out windowing and framing processing on the voice at a voice window with a preset window length and a preset time interval to obtain a plurality of voice signals.
3. The low-computational speech recognition method of claim 1 wherein the step of combining the plurality of speech signals to form a plurality of speech windows comprises:
and combining the plurality of voice signals in sequence according to a preset window step length by taking the first voice signal as an initial voice signal and the preset window length as a window length to form a plurality of voice windows.
4. The low-computation voice recognition method of claim 1, wherein the deep neural network wake model comprises a feature input layer, an implied layer, an output layer and an attention layer, wherein the implied layer comprises at least one fully-connected layer, and a non-linear activation function is provided after the fully-connected layer of each layer.
5. The method of claim 4, wherein the step of inputting the plurality of speech windows into a trained deep neural network wake model for processing to obtain a plurality of a posteriori probabilities of a list of phoneme token types matching the speech windows comprises:
inputting the multiple voice windows into a trained deep neural network awakening model according to a preset sequence for processing, wherein the processing method comprises the following steps:
acquiring a corresponding voice window during processing, and generating a current voice window;
acquiring a preset number of voice windows closest to the current voice window, and generating historical voice windows;
splicing the current voice window and the historical voice window on the full connection layer to generate a spliced voice window;
multiplying the spliced voice window by a preset weight matrix, and summing in the window direction of the spliced voice window to obtain a summation result;
and inputting the summation result into the full-connection layer to obtain the posterior probability of a list of phoneme mark types of the current voice window.
6. The low-computational speech recognition method of claim 1 wherein the training step of the deep neural network wake model comprises:
training an initial model based on a deep neural network based on a time sequence classification training criterion by using a universal corpus until the initial model network converges to obtain a preliminary convergence model based on the deep neural network;
and using the general linguistic data and the specific awakening word linguistic data to carry out secondary training on the primary convergence model until the primary convergence model is converged for the second time, and obtaining the deep neural network awakening model.
7. The low-computational speech recognition method of claim 1 wherein the step of calculating a score for a given keyword in the decoding matrix comprises:
acquiring the length of the decoding matrix;
and when the length of the decoding matrix exceeds a preset length, calculating the score of the specified keyword in the decoding matrix according to a preset step length.
8. A low-computation speech recognition apparatus, comprising:
the frame windowing module is used for performing frame windowing processing on input voice to obtain a plurality of voice signals;
the voice window generating module is used for combining the voice signals to form a plurality of voice windows;
the posterior probability calculation module is used for inputting the plurality of voice windows into a trained deep neural network awakening model for processing to obtain a plurality of posterior probabilities of a list of phoneme mark types matched with the voice windows;
a decoding matrix generating module, configured to combine the posterior probabilities of the multiple speech windows to form a decoding matrix;
and the judging module is used for calculating the score of the specified keyword in the decoding matrix, and if the score exceeds a preset threshold value, judging that the specified keyword is contained in the decoding matrix.
9. A computer device, characterized in that it comprises a memory in which a computer program is stored and a processor which, when executing said computer program, implements the steps of the low-computation speech recognition method of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the low-computation speech recognition method of any one of claims 1 to 7.
CN202211014435.2A 2022-08-23 2022-08-23 Low-computation-power voice recognition method, device, equipment and medium Active CN115101063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211014435.2A CN115101063B (en) 2022-08-23 2022-08-23 Low-computation-power voice recognition method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211014435.2A CN115101063B (en) 2022-08-23 2022-08-23 Low-computation-power voice recognition method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115101063A true CN115101063A (en) 2022-09-23
CN115101063B CN115101063B (en) 2023-01-06

Family

ID=83301766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211014435.2A Active CN115101063B (en) 2022-08-23 2022-08-23 Low-computation-power voice recognition method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115101063B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275484A (en) * 2023-11-17 2023-12-22 深圳市友杰智新科技有限公司 Command word recognition method, device, equipment and medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5608841A (en) * 1992-06-03 1997-03-04 Matsushita Electric Industrial Co., Ltd. Method and apparatus for pattern recognition employing the hidden Markov model
US20020120447A1 (en) * 2000-11-07 2002-08-29 Charlesworth Jason Peter Andrew Speech processing system
WO2017114201A1 (en) * 2015-12-31 2017-07-06 阿里巴巴集团控股有限公司 Method and device for executing setting operation
CN110619871A (en) * 2018-06-20 2019-12-27 阿里巴巴集团控股有限公司 Voice wake-up detection method, device, equipment and storage medium
CN110767231A (en) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 Voice control equipment awakening word identification method and device based on time delay neural network
CN110782882A (en) * 2019-11-04 2020-02-11 科大讯飞股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113823273A (en) * 2021-07-23 2021-12-21 腾讯科技(深圳)有限公司 Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN113963688A (en) * 2021-12-23 2022-01-21 深圳市友杰智新科技有限公司 Training method of voice awakening model, awakening word detection method and related equipment
CN114120979A (en) * 2022-01-25 2022-03-01 荣耀终端有限公司 Optimization method, training method, device and medium of voice recognition model
CN114360500A (en) * 2021-09-14 2022-04-15 腾讯科技(深圳)有限公司 Speech recognition method and device, electronic equipment and storage medium
CN114783438A (en) * 2022-06-17 2022-07-22 深圳市友杰智新科技有限公司 Adaptive decoding method, apparatus, computer device and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5608841A (en) * 1992-06-03 1997-03-04 Matsushita Electric Industrial Co., Ltd. Method and apparatus for pattern recognition employing the hidden Markov model
US20020120447A1 (en) * 2000-11-07 2002-08-29 Charlesworth Jason Peter Andrew Speech processing system
WO2017114201A1 (en) * 2015-12-31 2017-07-06 阿里巴巴集团控股有限公司 Method and device for executing setting operation
CN110619871A (en) * 2018-06-20 2019-12-27 阿里巴巴集团控股有限公司 Voice wake-up detection method, device, equipment and storage medium
CN110767231A (en) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 Voice control equipment awakening word identification method and device based on time delay neural network
CN110782882A (en) * 2019-11-04 2020-02-11 科大讯飞股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113823273A (en) * 2021-07-23 2021-12-21 腾讯科技(深圳)有限公司 Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN114360500A (en) * 2021-09-14 2022-04-15 腾讯科技(深圳)有限公司 Speech recognition method and device, electronic equipment and storage medium
CN113963688A (en) * 2021-12-23 2022-01-21 深圳市友杰智新科技有限公司 Training method of voice awakening model, awakening word detection method and related equipment
CN114120979A (en) * 2022-01-25 2022-03-01 荣耀终端有限公司 Optimization method, training method, device and medium of voice recognition model
CN114783438A (en) * 2022-06-17 2022-07-22 深圳市友杰智新科技有限公司 Adaptive decoding method, apparatus, computer device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘柏基: "基于注意力机制的端到端语音识别应用研究", 《中国优秀硕士学位论文全文数据库(信号与信息处理)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275484A (en) * 2023-11-17 2023-12-22 深圳市友杰智新科技有限公司 Command word recognition method, device, equipment and medium
CN117275484B (en) * 2023-11-17 2024-02-20 深圳市友杰智新科技有限公司 Command word recognition method, device, equipment and medium

Also Published As

Publication number Publication date
CN115101063B (en) 2023-01-06

Similar Documents

Publication Publication Date Title
US10902845B2 (en) System and methods for adapting neural network acoustic models
US11217225B2 (en) Multi-type acoustic feature integration method and system based on deep neural networks
US20240346285A1 (en) Feedforward generative neural networks
US11080589B2 (en) Sequence processing using online attention
CN111402891B (en) Speech recognition method, device, equipment and storage medium
WO2017135334A1 (en) Method and system for training language models to reduce recognition errors
CN111833845A (en) Multi-language speech recognition model training method, device, equipment and storage medium
CN111462756B (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN106940998A (en) A kind of execution method and device of setting operation
US10789942B2 (en) Word embedding system
CN110767231A (en) Voice control equipment awakening word identification method and device based on time delay neural network
EP3979098A1 (en) Data processing method and apparatus, storage medium, and electronic apparatus
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
KR20220164559A (en) Attention Neural Networks with Sparse Attention Mechanisms
CN113254613B (en) Dialogue question-answering method, device, equipment and storage medium
CN113646835A (en) Joint automatic speech recognition and speaker binarization
CN113506575A (en) Processing method and device for streaming voice recognition and computer equipment
CN111209380A (en) Control method and device for conversation robot, computer device and storage medium
US20210073645A1 (en) Learning apparatus and method, and program
CN115101063B (en) Low-computation-power voice recognition method, device, equipment and medium
CN114492758A (en) Training neural networks using layer-by-layer losses
CN113496282B (en) Model training method and device
WO2022121188A1 (en) Keyword detection method and apparatus, device and storage medium
CN110164431A (en) A kind of audio data processing method and device, storage medium
CN113870844B (en) Training method and device for voice recognition model and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Low computational power speech recognition methods, devices, equipment, and media

Granted publication date: 20230106

Pledgee: Shenzhen Shunshui Incubation Management Co.,Ltd.

Pledgor: SHENZHEN YOUJIE ZHIXIN TECHNOLOGY Co.,Ltd.

Registration number: Y2024980029366