CN117558264A

CN117558264A - Dialect voice recognition training method and system based on self-knowledge distillation

Info

Publication number: CN117558264A
Application number: CN202410044546.0A
Authority: CN
Inventors: 赵文博; 吕召彪; 杜量; 许程冲; 肖清
Original assignee: China Unicom Guangdong Industrial Internet Co Ltd
Current assignee: China Unicom Guangdong Industrial Internet Co Ltd
Priority date: 2024-01-12
Filing date: 2024-01-12
Publication date: 2024-02-13

Abstract

The invention relates to the field of voice recognition, in particular to a dialect voice recognition training method and system based on self-knowledge distillation, comprising the following steps of: s1: obtaining dialect speech signalsIThe method comprises the steps of carrying out a first treatment on the surface of the S2: extracting dialect speech signalsIIs denoted as MFCC feature of (C)XThe method comprises the steps of carrying out a first treatment on the surface of the S3: will beXInputting the dialect voice recognition training to a transducer model; wherein, in step S3, the intermediate layer characterization by acquiring a transducer model is also includedR _M To perform posterior probability self-distillation and feature learning characterizing self-distillation. By performing posterior probability hierarchical self-distillation and characterization hierarchical self-distillation in the training process, oversubstance of model training is reducedThe degree of coincidence improves the accuracy and the robustness of dialect small language voice recognition.

Description

Dialect voice recognition training method and system based on self-knowledge distillation

Technical Field

The invention relates to the field of voice recognition, in particular to a dialect voice recognition training method and system based on self-knowledge distillation.

Background

Along with the development and progress of the voice recognition technology, the voice recognition technology plays an increasingly important role in the fields of big data analysis, man-machine interaction and the like, provides an important interface for intelligent and automatic social life, and creates great convenience for people's life. A voice recognition module with good performance often needs a large amount of voice text data to train, so as to meet the performance requirements of high precision and high robustness. This may be satisfied in the context of Mandarin speech recognition applications, because Mandarin data is relatively easy to obtain. In the application scenario of speech recognition of small language specific to minority, such as a guest, it is very difficult to collect a large amount of speech text data.

The traditional end-to-end voice recognition technical scheme is based on a deep learning technology, and a large amount of data is required for training to meet the requirements of good performance and robustness. This is because deep learning models require enough data to learn abstract features in the speech signal and build an efficient speech recognition model. However, for small language scenes, the data size is usually small, which causes problems of poor performance, poor robustness, overfitting and the like of the traditional end-to-end voice recognition technical scheme. Therefore, in practical applications, other methods are required to solve these problems. Therefore, in a small language scene, the traditional end-to-end voice recognition technical scheme needs more improvement and optimization to solve the problems of poor performance, poor robustness and the like caused by insufficient data volume.

In order to solve the problems, the invention adopts a self-distillation dialect voice recognition training method to extract the acoustic characteristics of the dialect, and then carries out posterior probability level self-distillation and characterization level self-distillation in the training process, so as to reduce the overfitting degree of model training and improve the accuracy and the robustness of the dialect voice recognition.

Disclosure of Invention

The invention aims to overcome at least one defect (deficiency) of the prior art, and provides a dialect voice recognition training method and system based on self-knowledge distillation, which are used for solving the problems of poor performance, poor robustness, overfitting and the like of the traditional end-to-end voice recognition technical scheme caused by small amount of data of dialect, thereby improving the accuracy and robustness of dialect small language voice recognition.

The technical scheme adopted by the invention is that the dialect voice recognition training method based on self-knowledge distillation comprises the following steps:

s1: obtaining dialect speech signalsI；

S2: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X；

S3: will beXInputting the dialect voice recognition training to a transducer model;

wherein, in step S3, the intermediate layer characterization by acquiring a transducer model is also includedR _M To perform posterior probability self-distillation and feature learning characterizing self-distillation.

By performing posterior probability hierarchical self-distillation and characterization hierarchical self-distillation in the training process, the overfitting degree of model training is reduced, and the accuracy and the robustness of dialect small language speech recognition are improved.

Preferably, in step S3 of the present embodiment, the posterior probability self-distillation specifically includes characterizing an intermediate layer of the transducer modelR _M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP _M Then combining the final output posterior probability of the transducer modelPThe posterior probability distribution self-distilling loss function is calculated using a standard MSE loss function.

The posterior probability hierarchical self-distillation is carried out to enable the output of the middle layer to be close to the posterior probability distribution output by the model, so that the middle layer of the model learns deeper knowledge, and the fitting capacity of the model when the dialect data amount is small is enhanced.

Preferably, inIn step S3 of the scheme, the characterization self-distillation specifically comprises obtaining an intermediate layer characterization of a transducer modelR _M And then combining the representation of the last layer output by the transducer modelRThe normalized MSE loss function is used to calculate a characteristic self-distillation loss function.

Through the learning of the characteristic self-distillation, the middle layer of the model can learn the higher-order characteristic of the last layer of the model, so that the understanding and the characteristic capability of the middle layer on acoustic information are enhanced, and the overall performance of the model is further improved especially under the condition of less dialect small-language data quantity.

It is further preferred that the composition comprises,

the calculation formula of the posterior probability distribution self-distillation loss function is as follows:L _P =MSE（P，P _M ）；

the calculation formula for characterizing the self-distillation loss function is as follows:L _R =MSE（R，R _M ）；

according toL _P AndL _R to calculate the loss function L of the final model, which is formulated as:L=L _CTC ＋L _P ＋L _R whereinL _CTC Representing final output posterior probability of modelPCTC loss through model outputPAnd calculating a true labeling result y.

The loss function L of the final model is obtained through calculation according to the formula, and counter propagation is carried out according to the gradient of the loss function L so as to update the parameters of the voice recognition model, so that the model output approaches or reaches an optimal value, the model is converged, and the stability of the model is improved.

Preferably, the step S2 specifically includes the following steps:

s21: dialect-to-dialect speech signalIPerforming front-end signal processing;

s22: the voice signal obtained in the step S21 is subjected to fast Fourier transform FFT, and then Mel filtering is carried out;

s23: performing logarithmic processing on the signal obtained in the step S22;

s24: the voice signal obtained by logarithmic processing is subjected to discrete cosine transform to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;

s25: and extracting the MFCC characteristics and the first-order differential parameters from the voice signals processed by the steps.

In the scheme, the MFCC characteristic of the dialect voice signal is used as the input of an acoustic model, wherein the MFCC is short for Mel frequency cepstrum coefficient, and the MFCC focuses on the auditory characteristics of human ears unlike the common actual frequency cepstrum analysis, and the frequency of sound heard by the human ears is not in direct proportion to the frequency of the sound, so that the Mel frequency scale can be more in line with the auditory characteristics of the human ears. The Mel frequency scale corresponds to the logarithmic distribution relation of the actual frequency, and the mathematical relation formula is as follows:

through the operation, the MFCC (multi-frequency component carrier) characteristics of the dialect voice signals can be effectively extracted, and the hearing characteristics which are more in line with the human ears are trained, so that the accuracy of finally trained dialect small language voice recognition is higher.

Further preferably, the step S21 specifically further includes the following steps:

s211: dialect speech signalIPre-emphasis is performed by a high pass filter;

s212: carrying out framing treatment on the pre-emphasized signal, and taking 256 sampling points as a frame;

s213: and multiplying each frame by a Hamming window for windowing.

The high-frequency part is promoted through pre-emphasis, so that the frequency spectrum of the signal becomes flatter, meanwhile, the continuity of the left end and the right end of the frame is increased by utilizing windowing processing, and the leakage of the frequency spectrum is reduced, so that the characteristics of the dialect voice signal are more obvious.

Still further preferably, in step S212, there is an overlapping area between two frames, where the overlapping area includes 128 sampling points, so as to avoid excessive variation between two adjacent frames, thereby improving the stability of the dialect small-language speech signal.

In a second aspect, the present solution further provides a dialect speech recognition training system based on self-knowledge distillation, including:

the signal acquisition module: obtaining dialect speech signalsI；

And the feature extraction module is used for: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X；

And the feature training module is used for: will beXInputting the dialect voice recognition training to a transducer model;

wherein the feature training module further comprises an intermediate layer characterization by acquiring a transducer modelR _M To perform posterior probability self-distillation and feature learning characterizing self-distillation.

By performing posterior probability hierarchical self-distillation and characterization hierarchical self-distillation in the training process, the degree of overfitting of model training is reduced, and accuracy and robustness of dialect small language speech recognition are improved.

Preferably, the posterior probability self-distillation of the system specifically comprises the characterization of the intermediate layer of the transducer modelR _M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP _M Then combining the final output posterior probability of the transducer modelPCalculating a posterior probability distribution self-distilling loss function using a standard MSE loss function;

the characterization self-distillation specifically comprises obtaining an intermediate layer characterization of a transducer modelR _M And then combining the representation of the last layer output by the transducer modelRCalculating a characteristic self-distillation loss function using the standard MSE loss function;

according toL _P AndL _R to calculate the loss function L of the final modelThe formula is as follows:L=L _CTC ＋L _P ＋L _R whereinL _CTC Representing final output posterior probability of modelPCTC loss through model outputPAnd calculating a true labeling result y.

The posterior probability hierarchical self-distillation is carried out to enable the output of the middle layer to be close to the posterior probability distribution output by the model, so that the middle layer of the model learns deeper knowledge, and the fitting capacity of the model when the dialect data amount is less is enhanced; meanwhile, through the learning of the characteristic self-distillation, the middle layer of the model can learn the higher-order characteristic of the last layer of the model, the understanding and characteristic capacity of the middle layer to acoustic information are enhanced, meanwhile, the loss function L of the final model is obtained through calculation according to the formula, and counter propagation is carried out according to the gradient of the loss function L so as to update the parameters of the voice recognition model, so that the model output approaches or reaches an optimal value, the model is converged, the stability of the model is improved, and the overall performance of the model is further improved especially under the condition of small dialect data quantity.

Preferably, the feature extraction module further includes:

front-end signal processing unit: dialect-to-dialect speech signalIPerforming front-end signal processing;

fourier transform and filter processing unit: the voice signal obtained in the step S21 is subjected to fast Fourier transform FFT, and then Mel filtering is carried out;

a logarithmic processing unit: performing logarithmic processing on the signal obtained in the step S22;

discrete cosine transform processing unit: the voice signal obtained by logarithmic processing is subjected to discrete cosine transform to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;

feature extraction unit: extracting MFCC characteristics and first-order differential parameters from the voice signals processed by the steps;

the front-end signal processing unit specifically comprises the following components:

pre-emphasis component: dialect speech signalIThrough a height ofPre-emphasis is carried out by a pass filter;

framing component: carrying out framing treatment on the pre-emphasized signal, and taking 256 sampling points as a frame;

a windowing component: and multiplying each frame by a Hamming window for windowing.

The high-frequency part is promoted through pre-emphasis, so that the frequency spectrum of the signal becomes flatter, meanwhile, a section of overlapping area is arranged between two frames, the overlapping area contains 128 sampling points, the overlarge change of two adjacent frames is avoided, the stability of dialect small language voice signals is improved, the continuity of the left end and the right end of the frames is increased through windowing, and the leakage of the frequency spectrum is reduced, so that the characteristics of the dialect voice signals are more obvious. Finally, through the operation, the MFCC characteristics of the dialect voice signals can be effectively extracted, and the hearing characteristics which are more in line with the human ears are trained, so that the accuracy of the finally trained dialect small language voice recognition is higher.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the posterior probability and the self-distillation method for representing two layers are adopted, the information of the deep model layer is introduced into the shallow layer, the accuracy and the robustness of dialect voice recognition can be improved by only utilizing the existing training data, the additional computing resource and data resource expenditure are not required to be increased, the training flow of dialect voice recognition is effectively improved, and the accuracy and the robustness of dialect small language voice recognition are obviously improved.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Fig. 2 is a schematic structural view of the present invention.

FIG. 3 is a schematic representation of posterior probability self-distillation according to the present invention.

FIG. 4 is a schematic representation of the self-distillation of the present invention.

Fig. 5 is a schematic view of the overall feature acquisition of the present invention.

Fig. 6 is a schematic diagram of a feature extraction process according to the present invention.

Fig. 7 is a schematic diagram of front-end signal processing according to the present invention.

FIG. 8 is a schematic diagram of a system module according to the present invention.

Fig. 9 is a schematic diagram of a feature extraction module unit according to the present invention.

Fig. 10 is a schematic structural diagram of the electronic device of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the invention. For better illustration of the following embodiments, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the actual product dimensions; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Example 1

As shown in fig. 1 and fig. 2, the technical solution adopted in this embodiment is that a dialect voice recognition training method based on self-knowledge distillation includes the following steps:

s1: obtaining dialect speech signalsI；

S2: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X；

In step S1, the general speech sources of the dialect speech signal are mainly three, the general training set data opened in the industry, the data recorded by a specific user and the data synthesized by adopting the TTS technology are obtained, then the feature extraction and training are performed on the obtained dialect speech signal, and the posterior probability level self-distillation and the characterization level self-distillation are performed in the training process, so that the overfitting degree of model training is reduced, and the accuracy and the robustness of dialect small language speech recognition are improved.

Preferably, as shown in fig. 3, fig. 3 is a schematic diagram of a posterior probability self-distillation in the present embodiment, and in step S3 of the present embodiment, the posterior probability self-distillation is specificIntermediate layer characterization involving the transformation of a former modelR _M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP _M Then combining the final output posterior probability of the transducer modelPThe posterior probability distribution self-distilling loss function is calculated using a standard MSE loss function.

Preferably, as shown in fig. 4, fig. 4 is a schematic diagram of the self-distillation characterization of the present embodiment, and in step S3 of the present embodiment, the self-distillation characterization specifically includes obtaining an intermediate layer characterization of a transducer modelR _M And then combining the representation of the last layer output by the transducer modelRThe normalized MSE loss function is used to calculate a characteristic self-distillation loss function.

Further preferably, in the present embodiment,

according toL _P AndL _R to calculate the loss function of the final model, the formula is:L=L _CTC ＋L _P ＋L _R whereinL _CTC Representing final output posterior probability of modelPCTC loss of (2) and finally outputting posterior probability through a modelPAnd calculating a true labeling result y.

The loss function (loss function) in this embodiment is also called cost function, and is used to measure the difference between the predicted value and the data true value obtained by the model, and is also an important index for measuring how good the model generalization capability is trained by us. The loss function is an objective function of neural network optimization, the neural network training or optimization process is a process of minimizing the loss function, and the smaller the loss function value is, the closer the values of the corresponding predicted result and the actual result are.

In addition, the CTC loss represents a loss function commonly used in the field of speech recognition, and is mainly used for solving the problems that the input and output sequences are different in length and cannot be aligned, calculating the loss function L of the final model through the formula, and performing back propagation according to the gradient of the loss function L, wherein the back propagation is used for updating the previous parameters of the speech recognition model, so that the model output approaches or reaches an optimal value, the model is converged, and the stability of the model is improved.

Preferably, in this embodiment, the MFCC characteristics of the dialect speech signal are used as the input of the acoustic model, and the overall flow of obtaining the MFCC characteristics is shown in fig. 5, through which the MFCC more conforming to the auditory characteristics of the human ear can be extracted to train the MFCC, so that the accuracy of the trained speech recognition model is higher.

Further, in this embodiment, as shown in fig. 6, the feature extraction process in step S2 specifically includes the following steps:

s21: dialect-to-dialect speech signalIPerforming front-end signal processing;

s23: performing logarithmic processing on the signal obtained in the step S22;

s24: the voice signal obtained by the logarithmic processing is subjected to discrete cosine transform DCT to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;

Further preferably, in this embodiment, the front-end signal processing flow is as shown in fig. 7, and the step S21 specifically further includes the following steps:

s211: dialect speech signalIPre-emphasis is performed by a high pass filter;

s213: and multiplying each frame by a Hamming window for windowing.

In a second aspect, as shown in fig. 8, fig. 8 is a schematic diagram of a system module in the present solution, where a dialect voice recognition training system based on self-knowledge distillation is provided, including:

signal acquisition module 01: obtaining dialect speech signalsI；

Feature extraction module 02: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X；

Feature training module 03: will beXInputting the dialect voice recognition training to a transducer model;

Preferably, the posterior probability self-distillation of the system in this embodiment specifically includes characterizing the intermediate layer of the transducer modelR _M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP _M Then combining the final output posterior probability of the transducer modelPCalculating a posterior probability distribution self-distilling loss function using a standard MSE loss function;

according toL _P AndL _R to calculate the loss function of the final model, the formula is:L=L _CTC ＋L _P ＋L _R whereinL _CTC Representing final output posterior probability of modelPCTC loss through model outputPAnd calculating a true labeling result y.

The loss function (loss function) described in this embodiment is also called cost function (cost function), and is used to measure the difference between the predicted value and the data true value obtained by the model, and is also an important index for measuring the generalization ability of the model trained by us. The loss function is an objective function of neural network optimization, the neural network training or optimization process is a process of minimizing the loss function, and the smaller the loss function value is, the closer the values of the corresponding predicted result and the actual result are.

The posterior probability hierarchical self-distillation is carried out to enable the output of the middle layer to be close to the posterior probability distribution output by the model, so that the middle layer of the model learns deeper knowledge, and the fitting capacity of the model when the dialect data amount is less is enhanced; meanwhile, through the learning of the characteristic self-distillation, the middle layer of the model can learn the higher-order characteristic of the last layer of the model, so that the understanding and characteristic capability of the middle layer on acoustic information are enhanced, and the overall performance of the model is further improved especially under the condition of small dialect data size.

Preferably, as shown in fig. 9, the feature extraction module further includes:

pre-emphasis component: dialect speech signalIPre-emphasis is performed by a high pass filter;

In a third aspect, fig. 10 is a schematic structural diagram of an electronic device provided in the present embodiment. As shown in fig. 10, the electronic device may include: processor (processor) 10, communication interface (Communications Interface) 20, memory (memory) 30 and communication bus 40, wherein processor 10, communication interface 20, memory 30 accomplish the communication between each other through communication bus 40. The processor 10 may invoke logic instructions in the memory 30 to execute a dialect based on self-knowledge distillationA method of voice recognition training, the method comprising: obtaining dialect speech signalsIThe method comprises the steps of carrying out a first treatment on the surface of the Extracting dialect speech signalsIIs denoted as MFCC feature of (C)XThe method comprises the steps of carrying out a first treatment on the surface of the Will beXInputting the dialect voice recognition training to a transducer model; wherein, also includes the intermediate layer characterization by obtaining a transducer modelR _M To perform posterior probability self-distillation and feature learning characterizing self-distillation.

Further, the logic instructions in the memory 30 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on such understanding, the technical solution of the present solution may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present solution. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In a fourth aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the method of dialect speech recognition training based on self-knowledge distillation provided by the methods described above, the method comprising: obtaining dialect speech signalsIThe method comprises the steps of carrying out a first treatment on the surface of the Extracting dialect speech signalsIIs denoted as MFCC feature of (C)XThe method comprises the steps of carrying out a first treatment on the surface of the Will beXInputting the dialect voice recognition training to a transducer model; wherein, also includes the intermediate layer characterization by obtaining a transducer modelR _M To perform posterior probability self-distillation and feature learning characterizing self-distillation.

In a fifth aspect, the present invention also provides a non-transitory computer readable mediumA readable storage medium having stored thereon a computer program which when executed by a processor is implemented to perform a dialect speech recognition training method based on self-knowledge distillation provided by the methods described above, the method comprising: obtaining dialect speech signalsIThe method comprises the steps of carrying out a first treatment on the surface of the Extracting dialect speech signalsIIs denoted as MFCC feature of (C)XThe method comprises the steps of carrying out a first treatment on the surface of the Will beXInputting the dialect voice recognition training to a transducer model; wherein, also includes the intermediate layer characterization by obtaining a transducer modelR _M To perform posterior probability self-distillation and feature learning characterizing self-distillation.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

It should be understood that the foregoing examples of the present invention are merely illustrative of the present invention and are not intended to limit the present invention to the specific embodiments thereof. Any modification, equivalent replacement, improvement, etc. that comes within the spirit and principle of the claims of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A dialect voice recognition training method based on self-knowledge distillation comprises the following steps:

s1: obtaining dialect speech signalsI；

S2: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X；

characterized in that in step S3, the method further comprises the step of obtaining an intermediate layer representation of the transducer modelR _M To perform posterior probability self-distillation and feature learning characterizing self-distillation.

2. The method according to claim 1, wherein in step S3, the posterior probability self-distillation specifically includes characterizing an intermediate layer of a transducer modelR _M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP _M Then combining the final output posterior probability of the transducer modelPThe posterior probability distribution self-distilling loss function is calculated using a standard MSE loss function.

3. The method of claim 2, wherein in step S3, the characterizing the self-distillation includes obtaining transformIntermediate layer characterization of mer modelsR _M And then combining the representation of the last layer output by the transducer modelRThe normalized MSE loss function is used to calculate a characteristic self-distillation loss function.

4. A method for training dialect speech recognition based on self-knowledge distillation as set forth in claim 3,

5. The method for training dialect speech recognition based on self-knowledge distillation as set forth in claim 1, wherein said step S2 specifically comprises the steps of:

s21: dialect-to-dialect speech signalIPerforming front-end signal processing;

s23: performing logarithmic processing on the signal obtained in the step S22;

6. The method for training dialect speech recognition based on self-knowledge distillation as set forth in claim 5, wherein said step S21 further comprises the steps of:

s211: dialect speech signalIPre-emphasis is performed by a high pass filter;

s213: and multiplying each frame by a Hamming window for windowing.

7. The method according to claim 6, wherein in step S212, there is an overlapping area between two frames, and the overlapping area includes 128 sampling points.

8. A dialect speech recognition training system based on self-knowledge distillation, comprising:

the signal acquisition module: obtaining dialect speech signalsI；

characterized in that the feature training module also comprises an intermediate layer characterization by acquiring a transducer modelR _M To perform posterior probability self-distillation and feature learning characterizing self-distillation.

9. The system for training speech recognition of a dialect based on self-knowledge distillation as set forth in claim 8,

the posterior probability self-distillation specifically comprises the intermediate layer characterization of a transducer modelR _M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP _M Then combining the final output posterior probability of the transducer modelPCalculating a posterior probability distribution self-distilling loss function using a standard MSE loss function;

10. The system for training dialect speech recognition based on self-knowledge distillation of claim 9, further comprising in the feature extraction module: