CN114974228A

CN114974228A - Rapid voice recognition method based on hierarchical recognition

Info

Publication number: CN114974228A
Application number: CN202210571189.4A
Authority: CN
Inventors: 吕志强
Original assignee: Mingri Dream Beijing Technology Co ltd
Current assignee: Mingri Dream Beijing Technology Co ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-08-30
Anticipated expiration: 2042-05-24
Also published as: CN114974228B

Abstract

A rapid speech recognition method based on hierarchical recognition is characterized in that speech with different difficulties is shunted, models are disassembled step by step, and the models with different levels are used for processing speech cases with different difficulties; the invention solves the problem of limited computing resources required by large model modeling by a hierarchical reasoning mode, greatly reduces the complexity of the whole reasoning, saves the computing resources and reduces the service delay.

Description

Rapid voice recognition method based on hierarchical recognition

Technical Field

The invention relates to the field of voice recognition, in particular to a rapid voice recognition method based on hierarchical recognition.

Background

With the continuous improvement of computational power and the accumulation of data, the effect of a voice recognition system is obviously improved, and an end-to-end modeling method represented by CTC and encoder-decoder is more sufficient in utilization of mass data and has stronger modeling capability. In the field of speech recognition, a Conformer model adopting convolution enhancement is proposed by Google in 2020, and the accuracy of speech recognition is continuously refreshed, so that the Conformer model becomes a conventional method for acoustic modeling of current speech recognition. Under massive training data, the multi-layer former model has more parameter quantities and is proved to have stronger modeling capability. In general, the 12-24-layer Conformer model has stronger modeling capability under the support of massive training data as the number of model layers increases. However, as the number of parameters increases, the more computation is required in the model inference process when performing speech recognition, the more energy consumption, delay and resources are required, and this limits the application of large models in practical scenarios. In order to enable the deep transformer network to be applied to the speech recognition task, the number of hidden layer neurons or the number of parameters and the amount of calculation are usually reduced by using methods such as matrix decomposition, but these methods usually bring a certain performance loss. Meanwhile, the computational complexity of model reasoning in the voice recognition process still shows linear growth along with the increase of the number of layers of the former.

Disclosure of Invention

The present invention is directed to a method for fast speech recognition based on hierarchical recognition, so as to solve the aforementioned problems in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a quick speech recognition method based on hierarchical recognition comprises the following steps:

s1, dividing the deep network of the Conformer model, dividing the deep network with R layer into shallow networks every M layers according to the sequence from bottom layer to top layer, and leading out a tap from the last layer of identification network in each shallow network to decode by using a shallow Decoder to form F Conformer models with shallow networks; wherein R and M represent the number of network hierarchies, F represents the number of former models with shallow networks, and F is R/M;

s2, according to the sequence from the bottom layer to the top layer in the deep layer network, carrying out level division and sequencing on the formed shallow layer network to form a speech recognition model with F shallow layer networks, carrying out level-by-level recognition on input speech according to the level of the shallow layer network in the speech recognition model, and judging the difficulty level of the input speech;

s3, according to the entropy of the input voice passing through the shallow network, judging the difficulty level of the input voice, and judging whether the input voice needs to be subjected to the calculation and recognition of the next-level shallow network; the smaller the entropy value of the shallow network output is, the more certain the speech recognition result of the shallow network output is, the smaller the ambiguity of the speech recognition result is; conversely, the larger the entropy value is, the more uncertain the speech recognition result output by the shallow network is, the larger the ambiguity of the speech recognition result is, the network with stronger modeling capability is required to recognize.

A rapid speech recognition method based on hierarchical recognition comprises the following steps

S1, dividing deep networks of the Conformer model, dividing the deep networks with R layers into shallow networks every M layers according to the sequence from the bottom layer to the top layer, and leading out a tap from the last layer of identification network in each shallow network to decode by using a shallow Decoder to form F Conformer models with shallow networks; wherein, R and M represent the number of network layers, F represents the number of former models with shallow networks, and F is R/M;

s3, selecting two adjacent shallow networks according to the order of the shallow networks from small to large, and judging the consistency of the speech recognition results output by the two shallow networks; when the voice recognition results of two adjacent shallow networks pass consistency judgment, the acoustic modeling is considered to be complete; otherwise, a network with stronger modeling capability is required for speech recognition.

Preferably, the calculation formula of the entropy of the input voice passing through the shallow network is as follows:

wherein E represents entropy, L represents number of speech frames, N represents total number of units required for speech recognition in input speech, and p _li Indicating the probability of the i-th speech recognition unit performing speech recognition in the l-th frame in the input speech.

Preferably, the judgment basis of the difficulty level of speech in step S3 is: setting an entropy threshold, when the entropy value output by the shallow network of the f-th level is smaller than the threshold, determining the difficulty degree of the input voice, and judging that the voice recognition result of the input voice output by the shallow network of the f-th level is a final result; otherwise, the input voice continues to be subjected to voice recognition step by step through the shallow network until the entropy value of the shallow network is smaller than the threshold value or the level of the shallow network is F level; wherein f represents a level of the shallow network in the speech recognition model.

Preferably, the judgment basis of the difficulty level of speech in step S3 is: setting a recognition result difference threshold, namely when the difference of the speech recognition results of the shallow networks of two levels is smaller than the difference threshold, namely diff (result1, result2) < threshold, then the acoustic modeling is considered to be complete; if the difference between the two speech recognition results is greater than the difference threshold value, namely diff (result1, result2) is greater than or equal to threshold, the speech recognition model formed by the current shallow network is considered to have insufficient modeling capability for speech, and speech recognition is continuously performed step by step through the shallow model upwards until the speech recognition results of two adjacent shallow networks are judged through consistency or the level of the shallow network is the F-th level.

Preferably, in step S3, when the speech recognition results of two adjacent shallow networks pass the consistency determination, the speech recognition results of the current two levels of the shallow networks are linearly weighted as the final output result.

Preferably, in step S1, the deep network of the former model is divided into: for the R-layer deep network, a tap is led out every M layers and decoded by using a shallow Decoder, and F shallow decoders are arranged to form F shallow networks.

Preferably, for the speech recognition model with the shallow networks formed in step S2, R/M branches are used to perform multitask joint training on the shallow networks with progressive levels.

Preferably, for models with different network depths, parameters are shared in the shallow network.

The invention has the beneficial effects that: the invention discloses a rapid speech recognition method based on hierarchical recognition, which divides speech with different difficulties and disassembles models step by step, so that the models with different levels can process cases with different difficulties; the invention solves the problem of limited computing resources required by large model modeling by a hierarchical reasoning mode, greatly reduces the complexity of the whole reasoning, saves the computing resources and reduces the service delay.

Drawings

FIG. 1 is a flow diagram of a fast speech recognition process for hierarchical recognition;

FIG. 2 is a block diagram of a fast speech recognition architecture for hierarchical recognition;

FIG. 3 is a diagram of a decision criteria structure for hierarchical recognition;

fig. 4 is a diagram of a resulting metric structure of hierarchical recognition.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

A rapid voice recognition method based on hierarchical recognition is characterized in that a multi-layer former voice model is divided into multiple stages, the input voice recognition difficulty is judged from the perspective of the voice model, and the voice recognition level is judged according to the voice recognition difficulty; as shown in fig. 1, the method comprises the following steps:

s1, dividing the deep network of the comfort model, dividing the deep network with R layer into shallow networks every M layers according to the sequence from bottom layer to top layer, and leading out a tap from the last layer of identification network in each shallow network to decode by using a shallow Decoder to form R/M comfort models with shallow networks;

the specific implementation mode is as follows: for a deep network of an R layer, a shallow Decoder is used to decode every M layers of extracted taps, and F shallow decoders are provided, where R and M represent the number of network layers, F represents the number of former models with a shallow network, and F is R/M. As shown in fig. 2, for the 12-layer former model, shallow decoders are connected every 3 layers, and 4 shallow decoders are sequentially arranged from the bottom layer to the top layer, thereby forming 4 former models having shallow networks.

S2, according to the sequence from the bottom layer to the top layer in the deep layer network, carrying out level division and sequencing on the formed shallow layer network to form a voice recognition model with R/M shallow layer networks, carrying out multi-task joint training on the shallow layer networks with progressive levels by adopting R/M branch voices, wherein each shallow layer network is internally provided with shared parameters;

s3, recognizing input voice step by step according to the level of the shallow network in the voice recognition model, and judging the difficulty level of the input voice;

s41, as shown in fig. 3, determining difficulty of a voice according to entropy of the voice passing through the shallow network, and determining whether the voice needs to be calculated and identified by the shallow network at a next level;

the calculation formula of the entropy of the voice passing through the shallow network is as follows:

wherein E represents entropy, L represents number of speech frames, N represents total number of units required for speech recognition in input speech, and p _li Representing the probability of the ith speech recognition unit in the input speech to perform speech recognition in the l frame; the smaller the entropy value of the shallow network output is, the more certain the speech recognition result of the shallow network output is, and the smaller the ambiguity of the speech recognition result is; conversely, the larger the entropy value is, the more uncertain the speech recognition result output by the shallow network is, the greater the ambiguity of the speech recognition result is, and the network with stronger modeling capability is required to recognize.

Setting a threshold value of entropy, determining the difficulty degree of the voice when the entropy value output by the f-th level shallow network is smaller than the threshold value, and judging that the voice recognition result output by the f-th level shallow network is the final result; otherwise, the voice continues to be recognized upwards step by step through the shallow network until the entropy value of the shallow network is smaller than the threshold value or the level of the shallow network is F level; wherein f represents a level of the shallow network in the speech recognition model.

By using the method for determining the difficulty of speech in step S41, even if the entropy outputted by the current shallow network is smaller than the threshold, an error may still occur in the speech recognition result outputted correspondingly; based on the method, the following voice difficulty judgment method can be adopted:

s42, as shown in fig. 4, selecting two adjacent shallow networks of different levels according to the order of the shallow networks from small to large, and determining the consistency of the speech recognition results output by the two shallow networks; setting a recognition result difference threshold, namely, when the difference of the voice recognition results of the two levels of the shallow network is smaller than the difference threshold

When diff (result1, result2) < threshold, the acoustic modeling is considered to be complete, the recognition result tends to converge, and the speech recognition result of the shallow network using the current two levels is used as the final output result through linear weighting; if the difference between the two speech recognition results is greater than the difference threshold value, namely diff (result1, result2) is greater than or equal to threshold, the speech recognition model formed by the current shallow network is considered to have insufficient modeling capability for speech, and speech recognition is continuously performed step by step through the shallow model upwards until the speech recognition results of two adjacent shallow networks are judged through consistency or the level of the shallow network is the F-th level.

Compared with the method in step S41, the speech difficulty determination method in step S42 has a larger calculation amount because at least the previous two-level speech recognition needs to be calculated, but the recognition accuracy is more guaranteed, and the weighted fusion of the two-level recognition results can also play a certain role of model averaging, so that the decoding accuracy of the speech recognition model at the same level can be improved.

Examples

In the actual voice recognition tasks, most voice recognition tasks are simple cases with clean backgrounds and clean pronunciations, and the voice recognition tasks can be completed by using a shallow network aiming at the voice recognition tasks.

Taking the deep network of the former model as 12 layers as an example, assuming that 20% of the selected detected speeches need to use the deep network for speech recognition, even if the determination method in step S42 is used, about 80% by 50% to 40% of the calculation amount can be saved, and the benefit is considerable;

taking a deep network of a Conformer model as an example of 24 layers, selecting 20% of cases in the detected voice requires the deep network to perform voice recognition, and assuming that 20% of cases need 24 layers of networks and 80% of cases do not exceed 12 layers of networks, the increased complexity does not exceed 100% by 20% or 20% of calculated amount, which is far less than 100% of calculated amount of all cases directly increased by 24 layers of networks.

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

the invention discloses a rapid speech recognition method based on hierarchical recognition, which divides speech with different difficulties and disassembles models step by step, so that the models with different levels can process cases with different difficulties; the invention solves the problem of limited computing resources required by large model modeling by a hierarchical reasoning mode, greatly reduces the complexity of the whole reasoning, saves the computing resources and reduces the service delay.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A quick speech recognition method based on hierarchical recognition is characterized by comprising the following steps:

s1, dividing deep networks of the Conformer model, dividing the deep networks with R layers into shallow networks every M layers according to the sequence from the bottom layer to the top layer, and leading out a tap from the last layer of identification network in each shallow network to decode by using a shallow Decoder to form F Conformer models with shallow networks; wherein, R and M represent the number of network layers, F represents the number of Conformer models with shallow networks, and F is R/M;

s3, according to the entropy of the input voice passing through the shallow network, judging the difficulty degree of the input voice, and judging whether the input voice needs to be calculated and identified by the shallow network of the next level; the smaller the entropy value of the shallow network output is, the more certain the speech recognition result of the shallow network output is, the smaller the ambiguity of the speech recognition result is; conversely, the larger the entropy value is, the more uncertain the speech recognition result output by the shallow network is, the larger the ambiguity of the speech recognition result is, the network with stronger modeling capability is required to recognize.

2. A rapid speech recognition method based on hierarchical recognition is characterized by comprising the following steps

3. The method according to claim 1, wherein the entropy of the input speech passing through the shallow network is calculated by the following formula:

4. The method according to claim 1, wherein the difficulty level of speech is determined in step S3 according to: setting an entropy threshold, when the entropy value output by the shallow network of the f-th level is smaller than the threshold, determining the difficulty degree of the input voice, and judging that the voice recognition result of the input voice output by the shallow network of the f-th level is a final result; otherwise, the input voice continues to be subjected to voice recognition step by step through the shallow network until the entropy value of the shallow network is smaller than the threshold value or the level of the shallow network is the F-th level; wherein f represents a level of the shallow network in the speech recognition model.

5. The method according to claim 2, wherein the difficulty level of speech is determined in step S3 according to: setting a recognition result difference threshold, namely when the speech recognition result difference of the shallow networks of two levels is less than the difference threshold, namely diff (result1, result2) < threshold, the acoustic modeling is considered to be complete; if the difference of the two speech recognition results is greater than the difference threshold value, namely diff (result1, result2) is greater than or equal to threshold, the speech recognition model formed by the current shallow network is considered to have insufficient modeling capacity for speech, and speech recognition is continuously carried out step by step upwards through the shallow model until the speech recognition results of two adjacent shallow networks are judged through consistency or the level of the shallow network is level F.

6. The method according to claim 2, wherein in step S3, when the speech recognition results of two adjacent shallow networks pass the consistency determination, the speech recognition results of the current two levels of the shallow networks are linearly weighted as the final output result.

7. The method for rapid speech recognition based on hierarchical recognition according to any one of claims 1 or 2, wherein the deep network of the former model is divided in step S1 in a manner that: for the R-layer deep network, a tap is led out every M layers and decoded by using a shallow Decoder, and F shallow decoders are arranged to form F shallow networks.

8. The method for fast speech recognition based on hierarchical recognition according to any one of claims 1 or 2, wherein R/M branches are used for multi-task joint training of the shallow networks with progressive levels for the speech recognition model with the shallow networks formed in step S2.

9. The method of claim 1 or 2, wherein the models with different network depths share parameters in a shallow network.