CN111667819A - CRNN-based speech recognition method, system, storage medium and electronic equipment - Google Patents

CRNN-based speech recognition method, system, storage medium and electronic equipment Download PDF

Info

Publication number
CN111667819A
CN111667819A CN201910177117.XA CN201910177117A CN111667819A CN 111667819 A CN111667819 A CN 111667819A CN 201910177117 A CN201910177117 A CN 201910177117A CN 111667819 A CN111667819 A CN 111667819A
Authority
CN
China
Prior art keywords
layer
convolution
data
state value
rnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910177117.XA
Other languages
Chinese (zh)
Other versions
CN111667819B (en
Inventor
仇璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910177117.XA priority Critical patent/CN111667819B/en
Publication of CN111667819A publication Critical patent/CN111667819A/en
Application granted granted Critical
Publication of CN111667819B publication Critical patent/CN111667819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention discloses a speech recognition method, a system, a storage medium and electronic equipment based on CRNN, wherein the recognition method comprises the following steps: acquiring processed voice data, wherein the processed voice data is voice data in a preset filtering width value range from the position of a filter bank pointing to the current frame of the preprocessed voice data; inputting the processed voice data into a convolution layer, and outputting the convolution layer to obtain a frame of convolution output frame data; updating the orientation of the filter bank; judging whether the voice data of the preset characteristic length of the filter bank is acquired completely, if not, returning to the step of acquiring and processing the voice data, and if so, inputting the obtained convolution output frame data of all the frames into an RNN layer to obtain an output state value; and inputting the output state value to the full connection layer to obtain a voice recognition result. Compared with the traditional method, the method has the advantages that the processed data volume is greatly reduced, the calculation speed can be improved, the occupied space of the memory is reduced, and the aim of performing voice recognition in real time is fulfilled.

Description

CRNN-based speech recognition method, system, storage medium and electronic equipment
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method and system based on CRNN (convolutional recurrent neural network), a storage medium, and an electronic device.
Background
With the development of computer technology, electronic products applied by people tend to be intelligentized day by day, and voice is used as the most commonly used interaction mode of human, and the use of intelligent voice recognition and voice wake-up technology on end equipment also becomes a hot spot in the intelligent direction. Compared with the traditional voice recognition algorithm, the deep learning algorithm has the advantages of higher accuracy, stronger plasticity, universality and the like, and becomes the mainstream of the voice recognition technology applied to the voice recognition and voice awakening system. The CRNN with excellent recognition rate is one of the commonly used deep learning neural networks for speech recognition.
The CRNN includes a convolutional layer, an RNN (a recurrent neural network) layer, and a full-link layer, and receives preprocessed voice data of a fixed length when performing voice recognition, and outputs a recognition result after sequentially passing through the convolutional layer, the RNN layer, and the full-link layer, so that the calculation amount is large. Particularly, in a voice wakeup application environment, generally, in order to solve interference (echo, reverberation, and interference sound source) from all directions introduced by a surrounding environment and a propagation medium, a microphone array is used to capture and preprocess voice data, and then the voice data is transmitted to the CRNN for voice recognition of the preprocessed voice data.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a CRNN-based speech recognition method, system, storage medium and electronic device, in order to overcome the defects in the prior art that when performing speech information processing on CRNN end equipment, the calculation amount is large, and it is difficult to perform speech recognition calculation in real time.
The invention solves the technical problems through the following technical scheme:
the embodiment of the invention provides a speech recognition method based on CRNN, wherein the CRNN comprises a convolution layer, an RNN layer and a full connection layer, the convolution layer comprises a filter bank, and the speech recognition method is characterized by comprising the following steps:
acquiring processed voice data, wherein the processed voice data is voice data in a preset filtering width value range from the current frame position of the voice data after the filter bank points to the pre-processing;
inputting the processed voice data into the convolution layer, and outputting the convolution layer to obtain a frame of convolution output frame data;
updating the orientation of the filter bank;
judging whether the voice data with the preset characteristic length of the filter bank is acquired completely, if not, returning to the step of acquiring and processing the voice data, if so, inputting the obtained convolution output frame data of all the frames to the RNN layer, and outputting the convolution output frame data of all the frames to obtain an output state value by the RNN layer;
and inputting the output state value into the full-connection layer, and outputting the voice recognition result corresponding to the voice data with the preset characteristic length by the full-connection layer.
Preferably, the step of inputting the obtained convolution output frame data of all the frames into the RNN layer, where the step of obtaining the output state value by the RNN layer output includes:
inputting the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time into the RNN layer, and outputting the intermediate state value updated by the RNN layer;
judging whether the input of the convolution output frame data of all the frames is finished, if not, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated middle state value as the middle state value of the RNN layer updated at the previous time, and returning to the step of inputting the convolution output frame data of the current frame and the middle state value of the RNN layer updated at the previous time into the RNN layer; and if so, taking the last-time last-layer state value of the RNN layer as an output state value.
Preferably, the step of inputting the convolved output frame data of the current frame and the previously updated intermediate state value of the RNN layer into the RNN layer further comprises:
initializing the state of the RNN layer to an initial state value.
Another embodiment of the present invention provides a CRNN-based speech recognition system, wherein the CRNN includes a convolutional layer, an RNN layer, and a fully-connected layer, the convolutional layer includes a filter bank, and the speech recognition system includes an acquisition module, a convolutional module, an update module, a recognition module, and a fully-connected module;
the acquisition module is used for acquiring processed voice data, wherein the processed voice data is the voice data in a preset filtering width value range from the position of the filter bank pointing to the current frame of the preprocessed voice data;
the convolution module is used for inputting the processed voice data into the convolution layer, and the convolution layer outputs to obtain a frame of convolution output frame data;
the updating module is used for updating the orientation of the filter bank;
the identification module is used for judging whether the voice data with the preset characteristic length of the filter bank is acquired completely, if not, the acquisition module is called, if so, the obtained convolution output frame data of all the frames are input to the RNN layer, and the RNN layer outputs the convolution output frame data to obtain an output state value;
and the full-connection module is used for inputting the output state value to the full-connection layer, and the full-connection layer outputs a voice recognition result corresponding to the voice data with the preset characteristic length.
Preferably, the identification module is further configured to input the convolution output frame data of the current frame and the previously updated intermediate state value of the RNN layer to the RNN layer, and the RNN layer outputs the updated intermediate state value;
the step of determining whether the input of the convolution output frame data of all the frames is finished, if not, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated intermediate state value as the intermediate state value of the RNN layer updated last time, and returning the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time to be input to the RNN layer; and if so, taking the last-time last-layer state value of the RNN layer as an output state value.
Preferably, the identification module includes an initialization unit, and the initialization unit is configured to initialize a state of the RNN layer to an initial state value.
Another embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the CRNN-based speech recognition method as described above when executing the computer program.
Another embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the CRNN-based speech recognition method as described above.
The positive progress effects of the invention are as follows:
according to the invention, the voice data in the range of the preset filtering width value from the position of the current frame of the preprocessed voice data pointed by the filter bank is obtained as the processed voice data each time, the processed voice data is input into the convolution layer to obtain the number of convolution output frames of one frame, compared with the traditional data volume of reading the data of all the current frames of the preprocessed voice data to the convolution layer each time, the processed data volume is greatly reduced, so that the calculation speed can be improved, the memory occupation space can be reduced, and the real-time voice recognition of the on-site equipment can be realized.
Drawings
Fig. 1 is a flowchart of a CRNN-based speech recognition method according to embodiment 1 of the present invention.
Fig. 2 is a flow chart of a conventional CRNN according to embodiment 1 of the present invention.
Fig. 3 is a schematic circular flow chart of the incremental CRNN according to embodiment 1 of the present invention.
Fig. 4 is a flowchart of step 105 of the CRNN-based speech recognition method according to embodiment 2 of the present invention.
Fig. 5 is a schematic block diagram of a CRNN-based speech recognition system according to embodiment 3 of the present invention.
Fig. 6 is a schematic block diagram of a CRNN-based speech recognition system according to embodiment 4 of the present invention.
Fig. 7 is a schematic structural diagram of an electronic device according to embodiment 5 of the present invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
This embodiment provides a speech recognition method based on CRNN, where CRNN includes a convolutional layer, an RNN layer, and a full-link layer, where the convolutional layer includes a filter bank, as shown in fig. 1, the speech recognition method includes the steps of:
step 101, acquiring processed voice data, wherein the processed voice data is voice data in a preset filtering width value range from the position of a filter bank pointing to the current frame of the preprocessed voice data.
Step 102, inputting the processed voice data into the convolution layer, and outputting the convolution layer to obtain a frame of convolution output frame data.
And step 103, updating the orientation of the filter bank.
And 104, judging whether the voice data with the preset characteristic length of the filter bank is acquired completely, if not, returning to the step 101, and if so, executing the step 105.
105, inputting the obtained convolution output frame data of all the frames into an RNN layer, and outputting the convolution output frame data of all the frames by the RNN layer to obtain an output state value;
and 106, inputting the output state value into the full-connection layer, and outputting a voice recognition result corresponding to the voice data with the preset characteristic length by the full-connection layer.
In the scenario selected by this embodiment, in the end device, in order to solve the interference (echo, reverberation, and interference sound source) from all directions introduced by the surrounding environment and the propagation medium, a microphone array is usually used to capture voice data and preprocess the voice data to obtain preprocessed voice data, and then the preprocessed voice data is transmitted to the voice recognition module, i.e., the CRNN neural recognition network, to perform voice recognition of the preselected beam. In different practical application scenarios, the method of this embodiment may be used for recognizing dynamically real-time updated speech data, and is also suitable for speech recognition of historical data, and is not limited specifically herein.
A CRNN neural recognition network will now be described. Generally, input data of CRNN is an mfcc (mel frequency cepstrum coefficient) feature data stream with a fixed frame number, i.e. preprocessed voice data, and the recognition result is obtained through multiple cycles until iteration is finished. In the iterative process, the voice feature data streams input in the two cycles before and after the real-time recognition usually have a part of overlapped data. Inputting voice characteristic data, outputting frame data of a fixed frame through convolution layer convolution, if the fixed frame number is N frames, the convolution layer is called full convolution, sequentially inputting the N frame data to an RNN network layer to update and calculate an RNN state value, the RNN network layer is called full RNN, and the last layer state value of the RNN network layer is used as an output state value and is input to a full connection layer to calculate to obtain a full connection layer output result.
It is assumed here that input data of CRNN is 7 frames of feature data, the width value of the filter bank of the convolutional layer is 3 frames, the convolution step in the time direction is 2 frames, and the number of layers of the convolutional layer is 1.
In general, as shown in fig. 2, assuming a first loop, the speech feature data 110 in the loop is input to the convolutional layer 111, and after a condition that the number of convolutional frames, i.e., the width value of the filter bank, is 3 frames, the speech feature data 110 is processed to be i1-i7, as indicated by an arrow 112 in the figure, the filter bank will stop at the i7 frame position after convolution, convolution outputs 3-frame convolution output frame data o1-o3 data, and inputs o1-o3 data to the RNN layer 113 frame by frame to update the state (state value) of the RNN layer 113, and after the o3 frame updates the state of the RNN layer 113, outputs the last state of the RNN layer 113 as an output state value to the fully-connected layer 114 for calculation.
Assuming that the speech feature data 110 is updated by two frames after each loop, and the speech feature data 110 updated by the second loop is updated to i3-i9, in the processing flow of the conventional CRNN, the second loop is to process i3-i9 of the speech feature data 110, and it can be seen that i3-i7 of the speech feature data 110 is overlapped and processed by comparing the second loop with the first loop, so a large amount of repeated operations are included, but the speech recognition method in the present embodiment is to start from the current position i7 of the filter bank and then to take three frames i7-i9, and not to recalculate 7 frames i3-i9, as shown in fig. 3, the frame number is determined by the direction of arrow 115, and after convolution and stop at the frame number of i9, three frames i7-i9 are input to the convolution layer 116, the convolutional layer 116 generates a frame o4, then updates the orientation of the filter bank, continues to read new three frames of data from the orientation of the updated filter bank, inputs the new three frames of data to the convolutional layer 116, and the convolutional layer 116 correspondingly outputs a new frame of convolutional output frame data until the speech data of the filter bank with the preset characteristic length is obtained, which is called incremental convolution, and the CRNN corresponding to the method is defined as incremental CRNN.
The voice recognition method obtains the voice data in the preset filtering width value range from the current frame position of the filter group pointing to the preprocessed voice data every time as the processed voice data, and compared with the traditional data volume of reading the data of all the current frames of the preprocessed voice data to the convolution layer every time, the processed data volume is greatly reduced, so that the calculation speed can be improved, the occupied space of a memory is reduced, and the real-time voice recognition of the on-end equipment can be realized.
Example 2
The present embodiment provides a speech recognition method based on CRNN, and compared with embodiment 1, the difference of the present embodiment is that, as shown in fig. 4, step 105 includes:
step 1051, initializing the state of RNN layer to initial state value.
Step 1052, inputting the convolution output frame data of the current frame and the intermediate state value of the last updated RNN layer to the RNN layer, and outputting by the RNN layer to obtain the updated intermediate state value.
Step 1053, judging whether the input of the convolution output frame data of all the frames is finished, if the judgment result is no, executing step 1054; if yes, go to step 1055.
The convolution output frame data of all the frame numbers is data of all the frame numbers which are correspondingly output by the voice data of the preset characteristic length of the filter group.
And 1054, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated intermediate state value as the intermediate state value of the RNN layer updated last time, and returning to 1052.
And 1055, acquiring the intermediate state value of the last layer as an output state value.
And voice data with preset characteristic length in the preprocessed voice data are sequentially input to the convolution layer according to a preset convolution step length and a preset filtering width as a frame number unit to obtain corresponding multi-frame convolution output frame data, and the multi-frame convolution output frame data are sequentially input to the RNN layer to finally obtain output state values respectively.
The normal preprocessed voice data includes a plurality of groups of voice data with preset characteristic length, whether all the voice data are processed is judged, if not, the step 101 is returned, and the subsequent voice data recognition is continuously and circularly executed until all the voice data recognition is completed.
In the first loop, we can adopt the loop steps of the traditional CRNN, because the first loop is completely data in the initial state, there is not a great amount of repeated operations, and the increment CRNN can be applied to perform calculation from the second loop.
It can be further optimized that the computation is performed by using the increment CRNN in the first loop, only the last two frames (i6, i7) of the preprocessed feature data in the first loop are real preprocessed feature data, and i1-i5 are initial values of 0. If the width value of the filter bank of the convolutional layer is 3 frames, at this time, the two frames of data of i6 and i7 do not satisfy the number of convolutional frames, that is, the width value of the filter bank is 3 frames, the convolution is not performed this time, when the second cycle is performed, the speech feature data is updated to data (i4-i7) according to the condition that the speech feature data is updated to two frames of data after each cycle is assumed, the 4 frames of data satisfy the condition of the number of convolutional frames, a convolutional calculation can be performed once, so that the second cycle has a calculated amount, the calculated amount of the whole neural network is reduced to 0 after the first cycle, and the calculated amount is further reduced.
There are two ways to deliver convolutional layers:
1. continuing with the previous example, if the conventional approach, i.e., the second loop of the incremental CRNN approach, is used, o2, o3, o4 may be output to the RNN layer frame by frame, which is consistent with the conventional CRNN logic, and at the same time, it is demonstrated that the incremental convolution and full convolution results are consistent, and the incremental convolution results are correct.
2. Only the data o4 is output to the RNN layer.
Because RNN has the following properties:
the value of the state at each time is updated by the convolution output frame data of the state at the previous time and the current time, and the updating at each time is the same, so that the recalculation of o2 and o3 can be discarded, and an updating calculation is directly carried out by using the latest state value of the last cycle and the input o4 at this time. This part we call the incremental RNN, then outputs the last state of the RNN layer to the fully-connected layer computation.
In the second method, the calculation amount of the convolutional layer in the current and subsequent cycles is 1/3 in the original conventional method, and the calculation amount of the RNN layer is 1/3 in the original method. The calculation amount is further reduced, and the calculation performance is optimized.
And aiming at the dimensionality m of the voice data after different pre-processing, updating the frame number u, the filter width w, the convolution step length s, the convolution layer number p and the calculation optimization quantity are different. In the case of s < u (ensuring that at least one convolution can be performed in each cycle), and one layer of convolution, the calculated amount of the convolution and RNN part in the increment CRNN does not exceed (u + s)/(m-w) of the calculated amount of the convolution and RNN part in the increment CRNN, and when s > ═ u, the calculated amount is less. Generally, the number of input feature data frames of a conventional CRNN, i.e., the dimensionality of the preprocessed voice data, is hundreds or more, and the number of updated feature data frames per cycle is very small relative to the dimensionality, so the computational optimization of the incremental CRNN is very considerable. When the number of convolution layers is greater than 1, the skilled person can just convolute the input update data of each layer, so there is optimization of the calculation amount for each layer.
The speech recognition method reduces the calculation amount of each cycle by improving the calculation logic of the CRNN in real-time recognition. The convolution layer and the RNN layer respectively adopt incremental convolution and incremental RNN to reduce the calculation amount of convolution and RNN, which is called incremental CRNN. The limitation on the parameter size of the CRNN model is reduced by the small calculation amount of the increment CRNN, so that a larger model with higher recognition rate can be used in real-time recognition, and the limitation on the number of distributable beams and the number of selectable beams of the microphone array is reduced due to the occupation of the small calculation amount, and the influence on the positioning accuracy is reduced; similarly, the increment CRNN reduces the burden of the end device, so that the whole voice wake-up system can smoothly operate.
Example 3
This embodiment provides a speech recognition system based on CRNN, where the CRNN includes a convolutional layer, an RNN layer, and a fully-connected layer, where the convolutional layer includes a filter bank, as shown in fig. 5, the speech recognition system includes an obtaining module 201, a convolution module 202, an updating module 203, a recognition module 204, and a fully-connected module 205.
The obtaining module 201 is configured to obtain processed voice data, where the processed voice data is voice data in a preset filtering width value range from a current frame position of the pre-processed voice data pointed by the filter bank.
The convolution module 202 is configured to input the processed voice data into the convolution layer, and the convolution layer outputs the processed voice data to obtain a frame of convolution output frame data.
The updating module 203 is used for updating the orientation of the filter bank;
the identifying module 204 is configured to determine whether the obtaining of the voice data with the preset feature length of the filter bank is completed, if the determining result is no, call the obtaining module 201, if the determining result is yes, input the obtained convolution output frame data of all the frames to the RNN layer, and output the convolution output frame data of all the frames by the RNN layer to obtain an output state value.
The full-connection module 205 is configured to input the output state value to a full-connection layer, and the full-connection layer outputs a result of speech recognition corresponding to the speech data with the preset feature length.
In the scenario selected by this embodiment, in the end device, in order to solve the interference (echo, reverberation, and interference sound source) from all directions introduced by the surrounding environment and the propagation medium, a microphone array is usually used to capture voice data and preprocess the voice data to obtain preprocessed voice data, and then the preprocessed voice data is transmitted to the voice recognition module, i.e., the CRNN neural recognition network, to perform voice recognition of the preselected beam. In different practical application scenarios, the method of this embodiment may be used for recognizing dynamically real-time updated speech data, and is also suitable for speech recognition of historical data, and is not limited specifically herein.
A CRNN neural recognition network will now be described. Generally, input data of CRNN is an mfcc (mel frequency cepstrum coefficient) feature data stream with a fixed frame number, i.e. preprocessed voice data, and the recognition result is obtained through multiple cycles until iteration is finished. In the iterative process, the voice feature data streams input in the two cycles before and after the real-time recognition usually have a part of overlapped data. Inputting voice characteristic data, outputting frame data of a fixed frame through convolution layer convolution, if the fixed frame number is N frames, the convolution layer is called full convolution, sequentially inputting the N frame data to an RNN network layer to update RNN state values, the RNN network layer is called full RNN, and the last layer state values of the RNN network layer are input to a full connection layer to be calculated to obtain a full connection layer output result.
It is assumed here that input data of CRNN is 7 frames of feature data, the width value of the filter bank of the convolutional layer is 3 frames, the convolution step in the time direction is 2 frames, and the number of layers of the convolutional layer is 1.
Generally, as shown in fig. 2, assuming a first loop, the speech feature data 110 in the loop is input to the convolutional layer 111, and after a condition that the number of convolutional frames, i.e., the width value of the filter bank, is 3 frames, the speech feature data 110 is processed to i1-i7, as indicated by an arrow 112 in the figure, the filter bank will stop at the i7 frame position after convolution, convolution outputs 3 frames of convolution output frame data o1-o3 data, and input o1-o3 data to the RNN layer 113 frame by frame to update the state (state value) of the RNN layer 113, and after the o3 frame updates the state of the RNN layer 113, output the last state value of the RNN layer 113 to the fully-connected layer 114 to perform calculation.
Assuming that the speech feature data 110 is updated by two frames after each loop, and the speech feature data 110 updated by the second loop is updated to i3-i9, in the processing flow of the conventional CRNN, the second loop is to process i3-i9 of the speech feature data 110, and it can be seen that i3-i7 of the speech feature data 110 is overlapped and processed by comparing the second loop with the first loop, so a large amount of repeated operations are included, but the speech recognition method in the present embodiment is to start from the current position i7 of the filter bank and then to take three frames i7-i9, and not to recalculate 7 frames i3-i9, as shown in fig. 3, the frame number is determined by the direction of arrow 115, and after convolution and stop at the frame number of i9, three frames i7-i9 are input to the convolution layer 116, the convolutional layer 116 generates a frame o4, then updates the orientation of the filter bank, continues to read new three frames of data from the orientation of the updated filter bank, inputs the new three frames of data to the convolutional layer 116, and the convolutional layer 116 correspondingly outputs a new frame of convolutional output frame data until the speech data of the filter bank with the preset characteristic length is obtained, which is called incremental convolution, and the CRNN corresponding to the method is defined as incremental CRNN.
The voice recognition system obtains the voice data in the preset filtering width value range from the current frame position of the filter group pointing to the preprocessed voice data at each time as the processed voice data, and compared with the traditional data volume from all current frames of the preprocessed voice data to the convolution layer, the processed data volume is greatly reduced, so that the calculation speed can be improved, the occupied space of a memory is reduced, and the voice recognition of the terminal equipment in real time can be realized.
Example 4
The present embodiment provides a speech recognition system based on CRNN, and compared with embodiment 3, the present embodiment is different in that, as shown in fig. 6, the speech recognition system further includes an initialization unit 2041.
More specifically, the identifying module 204 is further configured to input the convolution output frame data of the current frame and the intermediate state value of the last updated RNN layer into the RNN layer, where the RNN layer outputs the updated intermediate state value; and a step of further judging whether the input of the convolution output frame data of all the frames is finished, if not, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated intermediate state value as the intermediate state value of the RNN layer updated last time, and returning to input the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time to the RNN layer; and if so, taking the intermediate state value of the last layer as an output state value.
The initialization unit 2041 is used to initialize the state of the RNN layer to an initial state value. And the common voice data comprises a plurality of groups of voice data with preset characteristic lengths, whether all the voice data are processed is judged, if not, the voice data are returned to the calling acquisition module, and the subsequent voice data are continuously and circularly identified until all the voice data are identified.
In the first loop, we can use the loop steps of the conventional CRNN because the first loop does not have a large number of repeated operations and can apply the incremental CRNN calculation from the second loop.
It can be further optimized that the computation is performed by using the increment CRNN in the first loop, only the last two frames (i6, i7) of the preprocessed feature data in the first loop are real preprocessed feature data, and i1-i5 are initial values of 0. If the preset condition is that the two frame data of i6 and i7 do not satisfy the convolution frame number, namely the width value of the filter bank is 3 frames, the convolution is not performed at this time, when the second cycle is performed, according to the previous condition, it is assumed that after each cycle, the voice feature data is updated by two frame data (i4-i7), and 4 frame data satisfy the convolution frame number condition, the convolution calculation can be performed once, so that the calculation amount is only required in the second cycle, the calculation amount of the whole neural network is reduced to 0 in the first cycle, and the calculation amount is further reduced.
There are two ways to deliver convolutional layers:
1. continuing with the previous example, if the conventional approach, i.e., the second loop of the incremental CRNN approach, is used, o2, o3, o4 may be output to the RNN layer frame by frame, which is consistent with the conventional CRNN logic, and at the same time, it is demonstrated that the incremental convolution and full convolution results are consistent, and the incremental convolution results are correct.
2. Only the data o4 is output to the RNN layer.
Because RNN has the following properties:
the value of the state at each time is updated by the convolution output frame data of the state at the previous time and the current time, and the updating at each time is the same, so that the recalculation of o2 and o3 can be discarded, and an updating calculation is directly carried out by using the latest state value of the last cycle and the input o4 at this time. This part we call the incremental RNN, then outputs the last state of the RNN layer to the fully-connected layer computation.
In the second embodiment, the calculation amount of the convolutional layer in the current and subsequent cycles is 1/3 as it is, and the calculation amount of the RNN layer is 1/3 as it is. The calculation amount is further reduced, and the calculation performance is optimized.
And aiming at the dimensionality m of the voice data after different pre-processing, updating the frame number u, the filter width w, the convolution step length s, the convolution layer number p and the calculation optimization quantity are different. In the case of s < u (ensuring that at least one convolution can be performed in each cycle), and one layer of convolution, the calculated amount of the convolution and RNN part in the increment CRNN does not exceed (u + s)/(m-w) of the calculated amount of the convolution and RNN part in the increment CRNN, and when s > ═ u, the calculated amount is less. Generally, the number of input feature data frames of a conventional CRNN, i.e., the dimensionality of the preprocessed voice data, is hundreds or more, and the number of updated feature data frames per cycle is very small relative to the dimensionality, so the computational optimization of the incremental CRNN is very considerable. When the number of convolution layers is greater than 1, the skilled person can just convolute the input update data of each layer, so there is optimization of the calculation amount for each layer.
The speech recognition system reduces the calculation amount of each cycle by improving the calculation logic of the CRNN in real-time recognition. The convolution layer and the RNN layer respectively adopt incremental convolution and incremental RNN to reduce the calculation amount of convolution and RNN, which is called incremental CRNN. The limitation on the parameter size of the CRNN model is reduced by the small calculation amount of the increment CRNN, so that a larger model with higher recognition rate can be used in real-time recognition, and the limitation on the number of distributable beams and the number of selectable beams of the microphone array is reduced due to the occupation of the small calculation amount, and the influence on the positioning accuracy is reduced; similarly, the increment CRNN reduces the burden of the end device, so that the whole voice wake-up system can smoothly operate.
Example 5
Fig. 7 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the CRNN-based speech recognition method of embodiment 1. The electronic device 30 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 7, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).
The bus 33 includes a data bus, an address bus, and a control bus.
The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.
Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The processor 31 executes various functional applications and data processing, such as the CRNN-based speech recognition method provided in embodiment 1 of the present invention, by running the computer program stored in the memory 32.
The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 6
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the steps of the CRNN-based speech recognition method provided in embodiment 1.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation manner, the present invention can also be implemented in the form of a program product, which includes program code for causing a terminal device to execute the steps in implementing the CRNN-based speech recognition method of embodiment 1 when the program product runs on the terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (8)

1. A speech recognition method based on CRNN, the CRNN comprising a convolutional layer, an RNN layer, a fully-connected layer, the convolutional layer comprising a filter bank, the speech recognition method comprising:
acquiring processed voice data, wherein the processed voice data is voice data in a preset filtering width value range from the current frame position of the voice data after the filter bank points to the pre-processing;
inputting the processed voice data into the convolution layer, and outputting the convolution layer to obtain a frame of convolution output frame data;
updating the orientation of the filter bank;
judging whether the voice data with the preset characteristic length of the filter bank is acquired completely, if not, returning to the step of acquiring and processing the voice data, if so, inputting the obtained convolution output frame data of all the frames to the RNN layer, and outputting the convolution output frame data of all the frames to obtain an output state value by the RNN layer;
and inputting the output state value into the full-connection layer, and outputting the voice recognition result corresponding to the voice data with the preset characteristic length by the full-connection layer.
2. The CRNN-based speech recognition method of claim 1, wherein the step of inputting the obtained convolution output frame data of all frame numbers to the RNN layer, the step of outputting by the RNN layer the output state value comprises:
inputting the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time into the RNN layer, and outputting the intermediate state value updated by the RNN layer;
judging whether the input of the convolution output frame data of all the frames is finished, if not, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated middle state value as the middle state value of the RNN layer updated at the previous time, and returning to the step of inputting the convolution output frame data of the current frame and the middle state value of the RNN layer updated at the previous time into the RNN layer; and if so, taking the last-time last-layer state value of the RNN layer as the output state value.
3. The CRNN-based speech recognition method of claim 2, wherein the step of inputting the convolved output frame data of the current frame with the previously updated intermediate state value of the RNN layer to the RNN layer further comprises:
initializing the state of the RNN layer to an initial state value.
4. A speech recognition system based on CRNN, the CRNN includes convolution layer, RNN layer, full connection layer, the convolution layer includes the filter bank, characterized by, the speech recognition system includes obtaining module, convolution module, updating module, recognition module and full connection module;
the acquisition module is used for acquiring processed voice data, wherein the processed voice data is the voice data in a preset filtering width value range from the position of the filter bank pointing to the current frame of the preprocessed voice data;
the convolution module is used for inputting the processed voice data into the convolution layer, and the convolution layer outputs to obtain a frame of convolution output frame data;
the updating module is used for updating the orientation of the filter bank;
the identification module is used for judging whether the voice data with the preset characteristic length of the filter bank is acquired completely, if not, the acquisition module is called, if so, the obtained convolution output frame data of all the frames are input to the RNN layer, and the RNN layer outputs the convolution output frame data to obtain an output state value;
and the full-connection module is used for inputting the output state value to the full-connection layer, and the full-connection layer outputs a voice recognition result corresponding to the voice data with the preset characteristic length.
5. The CRNN-based speech recognition system of claim 4, wherein the recognition module is further configured to input the convolved output frame data of a current frame with an intermediate state value of the RNN layer updated previously to the RNN layer, the RNN layer output resulting in an updated intermediate state value;
the step of determining whether the input of the convolution output frame data of all the frames is finished, if not, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated intermediate state value as the intermediate state value of the RNN layer updated last time, and returning the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time to be input to the RNN layer; and if so, taking the last-time last-layer state value of the RNN layer as an output state value.
6. The CRNN-based speech recognition system of claim 5, wherein the recognition module includes an initialization unit for initializing the state of the RNN layer to an initial state value.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the CRNN-based speech recognition method of any one of claims 1-3 when executing the computer program.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the CRNN-based speech recognition method according to any one of claims 1-3.
CN201910177117.XA 2019-03-08 2019-03-08 Voice recognition method, system, storage medium and electronic equipment based on CRNN Active CN111667819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910177117.XA CN111667819B (en) 2019-03-08 2019-03-08 Voice recognition method, system, storage medium and electronic equipment based on CRNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910177117.XA CN111667819B (en) 2019-03-08 2019-03-08 Voice recognition method, system, storage medium and electronic equipment based on CRNN

Publications (2)

Publication Number Publication Date
CN111667819A true CN111667819A (en) 2020-09-15
CN111667819B CN111667819B (en) 2023-09-01

Family

ID=72382405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910177117.XA Active CN111667819B (en) 2019-03-08 2019-03-08 Voice recognition method, system, storage medium and electronic equipment based on CRNN

Country Status (1)

Country Link
CN (1) CN111667819B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259080A (en) * 2020-10-20 2021-01-22 成都明杰科技有限公司 Speech recognition method based on neural network model
CN112259120A (en) * 2020-10-19 2021-01-22 成都明杰科技有限公司 Single-channel human voice and background voice separation method based on convolution cyclic neural network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1426048A (en) * 2001-12-13 2003-06-25 中国科学院自动化研究所 End detection method based on entropy
CN103995292A (en) * 2014-06-09 2014-08-20 桂林电子科技大学 Transient electromagnetic early signal reconstruction method
CN106448696A (en) * 2016-12-20 2017-02-22 成都启英泰伦科技有限公司 Adaptive high-pass filtering speech noise reduction method based on background noise estimation
CN108009635A (en) * 2017-12-25 2018-05-08 大连理工大学 A kind of depth convolutional calculation model for supporting incremental update
US20180182377A1 (en) * 2016-12-28 2018-06-28 Baidu Online Network Technology (Beijing) Co., Ltd Method and device for extracting speech feature based on artificial intelligence
US20180285715A1 (en) * 2017-03-28 2018-10-04 Samsung Electronics Co., Ltd. Convolutional neural network (cnn) processing method and apparatus
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1426048A (en) * 2001-12-13 2003-06-25 中国科学院自动化研究所 End detection method based on entropy
CN103995292A (en) * 2014-06-09 2014-08-20 桂林电子科技大学 Transient electromagnetic early signal reconstruction method
CN106448696A (en) * 2016-12-20 2017-02-22 成都启英泰伦科技有限公司 Adaptive high-pass filtering speech noise reduction method based on background noise estimation
US20180182377A1 (en) * 2016-12-28 2018-06-28 Baidu Online Network Technology (Beijing) Co., Ltd Method and device for extracting speech feature based on artificial intelligence
US20180285715A1 (en) * 2017-03-28 2018-10-04 Samsung Electronics Co., Ltd. Convolutional neural network (cnn) processing method and apparatus
CN108009635A (en) * 2017-12-25 2018-05-08 大连理工大学 A kind of depth convolutional calculation model for supporting incremental update
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
熊昌镇;车满强;王润玲;: "基于稀疏卷积特征和相关滤波的实时视觉跟踪算法", 计算机应用, vol. 38, no. 08, pages 2176 - 2179 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259120A (en) * 2020-10-19 2021-01-22 成都明杰科技有限公司 Single-channel human voice and background voice separation method based on convolution cyclic neural network
CN112259120B (en) * 2020-10-19 2021-06-29 南京硅基智能科技有限公司 Single-channel human voice and background voice separation method based on convolution cyclic neural network
CN112259080A (en) * 2020-10-20 2021-01-22 成都明杰科技有限公司 Speech recognition method based on neural network model

Also Published As

Publication number Publication date
CN111667819B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
US10971142B2 (en) Systems and methods for robust speech recognition using generative adversarial networks
US10032463B1 (en) Speech processing with learned representation of user interaction history
JP7109302B2 (en) Text generation model update method and text generation device
US6374212B2 (en) System and apparatus for recognizing speech
CN109891434A (en) Audio is generated using neural network
CN111066082B (en) Voice recognition system and method
Myer et al. Efficient keyword spotting using time delay neural networks
US20230368807A1 (en) Deep-learning based speech enhancement
JP5060006B2 (en) Automatic relearning of speech recognition systems
JP2020086436A (en) Decoding method in artificial neural network, speech recognition device, and speech recognition system
CN115362497A (en) Sequence-to-sequence speech recognition with delay threshold
CN114612749A (en) Neural network model training method and device, electronic device and medium
CN111667819B (en) Voice recognition method, system, storage medium and electronic equipment based on CRNN
CN112599141A (en) Neural network vocoder training method and device, electronic equipment and storage medium
US20200090657A1 (en) Adaptively recognizing speech using key phrases
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
Coman et al. An incremental turn-taking model for task-oriented dialog systems
CN111128174A (en) Voice information processing method, device, equipment and medium
KR20220066962A (en) Training a neural network to generate structured embeddings
US11501759B1 (en) Method, system for speech recognition, electronic device and storage medium
CN114399992B (en) Voice instruction response method, device and storage medium
CN114512123A (en) Training method and device of VAD model and voice endpoint detection method and device
CN114882151A (en) Method and device for generating virtual image video, equipment, medium and product
WO2021047103A1 (en) Voice recognition method and device
CN113160795B (en) Language feature extraction model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant