CN114974228B

CN114974228B - Rapid voice recognition method based on hierarchical recognition

Info

Publication number: CN114974228B
Application number: CN202210571189.4A
Authority: CN
Inventors: 吕志强
Original assignee: Mingri Dream Beijing Technology Co ltd
Current assignee: Mingri Dream Beijing Technology Co ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2023-04-11
Anticipated expiration: 2042-05-24
Also published as: CN114974228A

Abstract

A rapid speech recognition method based on hierarchical recognition is characterized in that speech with different difficulties is shunted, models are disassembled step by step, and the models with different levels are used for processing speech cases with different difficulties; the invention solves the problem of limited computing resources required by large model modeling by a hierarchical reasoning mode, greatly reduces the complexity of the whole reasoning, saves the computing resources and reduces the service delay.

Description

Rapid voice recognition method based on hierarchical recognition

Technical Field

The invention relates to the field of voice recognition, in particular to a rapid voice recognition method based on hierarchical recognition.

Background

With the continuous improvement of computing power and the accumulation of data, the effect of a voice recognition system is obviously improved, and an end-to-end modeling method represented by CTC and encorder-decoder is more sufficient in utilization of mass data and has stronger modeling capability. In the field of speech recognition, a Conformer model adopting convolution enhancement is proposed by Google in 2020, and the accuracy of speech recognition is continuously refreshed, so that the Conformer model becomes a conventional method for acoustic modeling of current speech recognition. Under massive training data, the multi-layer former model has more parameter quantities and is proved to have stronger modeling capability. In general, the 12-24-layer Conformer model has stronger modeling capability under the support of massive training data as the number of model layers increases. However, as the number of parameters increases, the more computation is required in the model inference process when performing speech recognition, the more energy consumption, delay and resources are required, and this limits the application of large models in practical scenarios. In order to enable the deep layer former network to be applied to the voice recognition task, methods such as reducing the number of hidden layer neurons or matrix decomposition are generally adopted to reduce the number of parameters and the amount of calculation, but these methods also generally bring a certain performance loss. Meanwhile, the computational complexity of model reasoning in the voice recognition process still presents linear growth along with the increase of the number of layers of the former.

Disclosure of Invention

The present invention aims to provide a method for fast speech recognition based on hierarchical recognition, thereby solving the aforementioned problems in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a quick speech recognition method based on hierarchical recognition comprises the following steps:

s1, dividing deep networks of a Conformer model, dividing the deep networks with R layers into shallow networks every M layers according to the sequence from a bottom layer to a top layer, and leading out a tap from a last layer identification network in each shallow network to decode by using a shallow Decoder to form F Conformer models with shallow networks; wherein R and M represent the number of network levels, F represents the number of Conformer models with shallow networks, F = R/M;

s2, according to the sequence from the bottom layer to the top layer in the deep layer network, carrying out level division and sequencing on the formed shallow layer network to form a voice recognition model with F shallow layer networks, carrying out level-by-level recognition on input voice according to the level of the shallow layer network in the voice recognition model, and judging the difficulty level of the input voice;

s3, judging the difficulty degree of the input voice according to the entropy of the input voice passing through the shallow network, and judging whether the input voice needs to be calculated and identified by the shallow network at the next level; the smaller the entropy value of the shallow network output is, the more certain the speech recognition result of the shallow network output is, the smaller the ambiguity of the speech recognition result is; conversely, the larger the entropy value is, the more uncertain the speech recognition result output by the shallow network is, the larger the ambiguity of the speech recognition result is, the network with stronger modeling capability is required to recognize.

A rapid speech recognition method based on hierarchical recognition comprises the following steps

S1, dividing deep networks of a Conformer model, dividing the deep networks with R layers into shallow networks every M layers according to the sequence from a bottom layer to a top layer, leading out a tap from the last layer of identification network in each shallow network, and decoding by using a shallow Decoder to form F Conformer models with shallow networks; wherein, R and M represent the number of network layers, F represents the number of Conformer models with shallow networks, and F = R/M;

s3, selecting two adjacent shallow networks according to the order of the shallow networks from small to large in level, and judging the consistency of the voice recognition results output by the two shallow networks; when the voice recognition results of two adjacent shallow networks pass consistency judgment, the acoustic modeling is considered to be complete; otherwise, a network with stronger modeling capability is required for speech recognition.

Preferably, the calculation formula of the entropy of the input voice passing through the shallow network is as follows:

wherein E represents entropy, L represents number of speech frames, N represents total number of units required for speech recognition in input speech, and p _li Indicating the probability of the i-th speech recognition unit performing speech recognition in the l-th frame in the input speech.

Preferably, the judgment basis of the difficulty level of the speech in step S3 is: setting an entropy threshold, when the entropy value output by the shallow network of the f-th level is smaller than the threshold, determining the difficulty degree of the input voice, and judging that the voice recognition result of the input voice output by the shallow network of the f-th level is a final result; otherwise, the input voice continues to be subjected to voice recognition step by step through the shallow network until the entropy value of the shallow network is smaller than the threshold value or the level of the shallow network is the F-th level; where f represents the level of the shallow network in the speech recognition model.

Preferably, the judgment basis of the difficulty level of the speech in step S3 is: setting a recognition result difference threshold, namely when the difference of the speech recognition results of the shallow networks of the two levels is smaller than the difference threshold, namely diff (result 1, result 2) < threshold, the acoustic modeling is considered to be complete; if the difference between the two speech recognition results is greater than the difference threshold value, namely diff (result 1, result 2) is greater than or equal to threshold, the speech recognition model formed by the current shallow network is considered to have insufficient modeling capability for speech, and speech recognition is continuously carried out step by step through the shallow model upwards until the speech recognition results of two adjacent shallow networks are judged through consistency or the level of the shallow network is the F-th level.

Preferably, in step S3, when the speech recognition results of two adjacent shallow networks pass the consistency determination, the speech recognition results of the current two levels of the shallow networks are linearly weighted as the final output result.

Preferably, in step S1, the deep network of the transformer model is divided into: for the R-layer deep network, a tap is led out every M layers and decoded by using a shallow Decoder, and F shallow decoders are arranged to form F shallow networks.

Preferably, for the speech recognition model with the shallow network formed in step S2, R/M branches are used to perform multi-task joint training on the shallow network with progressive levels.

Preferably, for models with different network depths, parameters are shared in the shallow network.

The invention has the beneficial effects that: the invention discloses a rapid speech recognition method based on hierarchical recognition, which divides speech with different difficulties and disassembles models step by step, so that the models with different levels can process cases with different difficulties; the invention solves the problem of limited computing resources required by large model modeling by a hierarchical reasoning mode, greatly reduces the complexity of the whole reasoning, saves the computing resources and reduces the service delay.

Drawings

FIG. 1 is a flow diagram of a fast speech recognition process for hierarchical recognition;

FIG. 2 is a block diagram of a fast speech recognition architecture for hierarchical recognition;

FIG. 3 is a diagram of a decision criteria structure for hierarchical recognition;

fig. 4 is a diagram of a resulting metric structure of hierarchical recognition.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

A rapid voice recognition method based on hierarchical recognition is characterized in that a multi-layer former voice model is divided into multiple stages, the input voice recognition difficulty is judged from the perspective of the voice model, and the voice recognition level is judged according to the voice recognition difficulty; as shown in fig. 1, the method comprises the following steps:

s1, dividing deep networks of a Conformer model, dividing the deep networks with R layers into shallow networks every M layers according to the sequence from a bottom layer to a top layer, and leading out a tap from a last layer identification network in each shallow network to decode by using a shallow Decoder to form R/M Conformer models with shallow networks;

the specific implementation mode is as follows: for a deep network of an R layer, decoding is performed by using a shallow Decoder every other tap led out from the M layer, and F shallow decoders are configured, where R and M represent the number of network layers, F represents the number of provider models with a shallow network, and F = R/M. As shown in fig. 2, for the 12-layer former model, shallow decoders are connected every 3 layers, and 4 shallow decoders are sequentially arranged from the bottom layer to the top layer, thereby forming 4 former models having shallow networks.

S2, according to the sequence from the bottom layer to the top layer in the deep layer network, carrying out level division and sequencing on the formed shallow layer network to form a voice recognition model with R/M shallow layer networks, carrying out multi-task joint training on the shallow layer networks with progressive levels by adopting R/M branch voices, wherein each shallow layer network is internally provided with shared parameters;

s3, recognizing input voice step by step according to the level of the shallow network in the voice recognition model, and judging the difficulty level of the input voice;

s41, as shown in FIG. 3, according to the entropy of the voice passing through the shallow network, the difficulty level of the voice is judged, and whether the voice needs to be calculated and identified by the shallow network of the next level is judged;

the calculation formula of the entropy of the voice passing through the shallow network is as follows:

wherein E represents entropy, L represents number of speech frames, N represents total number of units required for speech recognition in input speech, and p _li Representing the probability of the ith voice recognition unit in the input voice to perform voice recognition in the ith frame; the smaller the entropy value of the shallow network output is, the more certain the speech recognition result of the shallow network output is, the smaller the ambiguity of the speech recognition result is; conversely, the larger the entropy value is, the more uncertain the speech recognition result output by the shallow network is, and the greater the ambiguity of the speech recognition result is, the network with stronger modeling capability is required to recognize.

Setting a threshold value of entropy, determining the difficulty degree of the voice when the entropy value output by the f-th level shallow network is smaller than the threshold value, and judging that the voice recognition result output by the f-th level shallow network is the final result; otherwise, the voice continues to be recognized upwards through the shallow network stage by stage until the entropy value of the shallow network is smaller than the threshold value or the level of the shallow network is F level; wherein f represents a level of the shallow network in the speech recognition model.

By using the voice difficulty determination method in step S41, even if the entropy outputted by the current shallow network is smaller than the threshold, an error may still occur in the correspondingly outputted voice recognition result; based on the method, the following voice difficulty judging method can be adopted:

s42, as shown in FIG. 4, selecting two adjacent shallow networks according to the order of the shallow networks from small to large, and judging the consistency of the speech recognition results output by the two shallow networks; setting a recognition result difference threshold, namely, when the difference of the voice recognition results of the two levels of the shallow network is smaller than the difference threshold

When diff (result 1, result 2) < threshold, the acoustic modeling is considered to be complete, the recognition result tends to converge, and the speech recognition results of the current two levels of the shallow network are used as the final output result through linear weighting; if the difference between the two speech recognition results is greater than the difference threshold value, namely diff (result 1, result 2) is greater than or equal to threshold, the speech recognition model formed by the current shallow network is considered to have insufficient modeling capability for speech, and speech recognition is continuously carried out step by step through the shallow model upwards until the speech recognition results of two adjacent shallow networks are judged through consistency or the level of the shallow network is the F-th level.

Compared with the speech difficulty judging method in the step S41, the speech difficulty judging method in the step S42 has a larger calculation amount because at least the previous two-stage speech recognition needs to be calculated, but the accuracy of the recognition is more guaranteed, and meanwhile, the weighted fusion of the two-stage recognition results can also play a certain model averaging role, so that the decoding accuracy of the speech recognition model in the same level can be improved.

Examples

In the actual voice recognition task, most voice recognition tasks are simple cases with clean backgrounds and clean pronunciations, and the voice recognition task can be completed by using a shallow network aiming at the voice recognition task.

Taking the deep network of the former model as 12 layers as an example, assuming that 20% of the selected detected speeches need to use the deep network for speech recognition, even if the determination method in step S42 is used, the calculation amount of about 80% by 50% =40% can be saved, and the profit is considerable;

taking the deep network of the former model as 24 layers as an example, selecting 20% of cases in the detected speech needs to use the deep network for speech recognition, and assuming that 20% of cases need to use 24 layers of networks and 80% of cases need not exceed 12 layers of networks, the increased complexity does not exceed 100% × 20% =20% calculated amount, which is far less than 100% calculated amount of all cases directly increased by 24 layers of networks.

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

the invention discloses a rapid speech recognition method based on hierarchical recognition, which is characterized in that speech with different difficulties is shunted to disassemble models step by step, so that the models with different levels can process cases with different difficulties; the invention solves the problem of limited computing resources required by large model modeling by a hierarchical reasoning mode, greatly reduces the complexity of the whole reasoning, saves the computing resources and reduces the service delay.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A quick speech recognition method based on hierarchical recognition is characterized by comprising the following steps:

s3, judging the difficulty degree of the input voice according to the entropy of the input voice passing through the shallow network, and judging whether the input voice needs to be calculated and identified by the shallow network at the next level; the smaller the entropy value of the shallow network output is, the more certain the speech recognition result of the shallow network output is, the smaller the ambiguity of the speech recognition result is; on the contrary, the larger the entropy value is, the more uncertain the speech recognition result output by the shallow network is, the larger the ambiguity of the speech recognition result is, the network with stronger modeling capability is required to be recognized;

wherein, the calculation formula of the entropy of the input voice passing through the shallow network is as follows:

wherein E represents entropy, L represents number of speech frames, N represents total number of units required for speech recognition in input speech, and p _li Representing the probability of the ith speech recognition unit in the input speech to perform speech recognition in the l frame;

the judgment basis of the speech difficulty degree in the step S3 is as follows: setting an entropy threshold, when the entropy value output by the shallow network of the f-th level is smaller than the threshold, determining the difficulty degree of the input voice, and judging that the voice recognition result of the input voice output by the shallow network of the f-th level is a final result; otherwise, the input voice continues to be subjected to voice recognition step by step through the shallow network until the entropy value of the shallow network is smaller than the threshold value or the level of the shallow network is F level; wherein f represents a level of the shallow network in the speech recognition model;

alternatively, the method comprises the steps of,

s3, selecting two adjacent shallow networks according to the order of the shallow networks from small to large in level, and judging the consistency of the voice recognition results output by the two shallow networks; when the voice recognition results of two adjacent shallow networks pass consistency judgment, the acoustic modeling is considered to be complete; otherwise, a network with stronger modeling capability is needed for voice recognition;

the judgment basis of the difficulty level of the voice in step S3 is as follows: setting a recognition result difference threshold, namely when the difference of the speech recognition results of the shallow networks of two levels is smaller than the difference threshold, namely diff (result 1, result 2) < threshold, the acoustic modeling is considered to be complete; if the difference between the two speech recognition results is greater than the difference threshold value, namely diff (result 1, result 2) is greater than or equal to threshold, the speech recognition model formed by the current shallow network is considered to have insufficient modeling capacity for speech, and the speech recognition is continuously carried out step by step through the shallow network upwards until the speech recognition results of the two adjacent shallow networks are judged through consistency or the level of the shallow network is the F-th level;

in step S3, when the voice recognition results of two adjacent shallow networks pass consistency judgment, the voice recognition results of the current two levels of shallow networks are used as final output results through linear weighting;

in the step S1, the deep network of the transformer model is divided in the following manner: for the deep network of the R layer, a tap is led out every M layers and decoded by using a shallow Decoder, and F shallow decoders are arranged to form F shallow networks;

aiming at the voice recognition model with the shallow network formed in the step S2, performing multi-task joint training on the shallow network with progressive levels by adopting R/M branches;

for models with different network depths, all parameters are shared in the shallow network.