CN117558264A - Dialect voice recognition training method and system based on self-knowledge distillation - Google Patents

Dialect voice recognition training method and system based on self-knowledge distillation Download PDF

Info

Publication number
CN117558264A
CN117558264A CN202410044546.0A CN202410044546A CN117558264A CN 117558264 A CN117558264 A CN 117558264A CN 202410044546 A CN202410044546 A CN 202410044546A CN 117558264 A CN117558264 A CN 117558264A
Authority
CN
China
Prior art keywords
self
distillation
dialect
model
posterior probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410044546.0A
Other languages
Chinese (zh)
Inventor
赵文博
吕召彪
杜量
许程冲
肖清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unicom Guangdong Industrial Internet Co Ltd
Original Assignee
China Unicom Guangdong Industrial Internet Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unicom Guangdong Industrial Internet Co Ltd filed Critical China Unicom Guangdong Industrial Internet Co Ltd
Priority to CN202410044546.0A priority Critical patent/CN117558264A/en
Publication of CN117558264A publication Critical patent/CN117558264A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of voice recognition, in particular to a dialect voice recognition training method and system based on self-knowledge distillation, comprising the following steps of: s1: obtaining dialect speech signalsIThe method comprises the steps of carrying out a first treatment on the surface of the S2: extracting dialect speech signalsIIs denoted as MFCC feature of (C)XThe method comprises the steps of carrying out a first treatment on the surface of the S3: will beXInputting the dialect voice recognition training to a transducer model; wherein, in step S3, the intermediate layer characterization by acquiring a transducer model is also includedR M To perform posterior probability self-distillation and feature learning characterizing self-distillation. By performing posterior probability hierarchical self-distillation and characterization hierarchical self-distillation in the training process, oversubstance of model training is reducedThe degree of coincidence improves the accuracy and the robustness of dialect small language voice recognition.

Description

Dialect voice recognition training method and system based on self-knowledge distillation
Technical Field
The invention relates to the field of voice recognition, in particular to a dialect voice recognition training method and system based on self-knowledge distillation.
Background
Along with the development and progress of the voice recognition technology, the voice recognition technology plays an increasingly important role in the fields of big data analysis, man-machine interaction and the like, provides an important interface for intelligent and automatic social life, and creates great convenience for people's life. A voice recognition module with good performance often needs a large amount of voice text data to train, so as to meet the performance requirements of high precision and high robustness. This may be satisfied in the context of Mandarin speech recognition applications, because Mandarin data is relatively easy to obtain. In the application scenario of speech recognition of small language specific to minority, such as a guest, it is very difficult to collect a large amount of speech text data.
The traditional end-to-end voice recognition technical scheme is based on a deep learning technology, and a large amount of data is required for training to meet the requirements of good performance and robustness. This is because deep learning models require enough data to learn abstract features in the speech signal and build an efficient speech recognition model. However, for small language scenes, the data size is usually small, which causes problems of poor performance, poor robustness, overfitting and the like of the traditional end-to-end voice recognition technical scheme. Therefore, in practical applications, other methods are required to solve these problems. Therefore, in a small language scene, the traditional end-to-end voice recognition technical scheme needs more improvement and optimization to solve the problems of poor performance, poor robustness and the like caused by insufficient data volume.
In order to solve the problems, the invention adopts a self-distillation dialect voice recognition training method to extract the acoustic characteristics of the dialect, and then carries out posterior probability level self-distillation and characterization level self-distillation in the training process, so as to reduce the overfitting degree of model training and improve the accuracy and the robustness of the dialect voice recognition.
Disclosure of Invention
The invention aims to overcome at least one defect (deficiency) of the prior art, and provides a dialect voice recognition training method and system based on self-knowledge distillation, which are used for solving the problems of poor performance, poor robustness, overfitting and the like of the traditional end-to-end voice recognition technical scheme caused by small amount of data of dialect, thereby improving the accuracy and robustness of dialect small language voice recognition.
The technical scheme adopted by the invention is that the dialect voice recognition training method based on self-knowledge distillation comprises the following steps:
s1: obtaining dialect speech signalsI
S2: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X
S3: will beXInputting the dialect voice recognition training to a transducer model;
wherein, in step S3, the intermediate layer characterization by acquiring a transducer model is also includedR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
By performing posterior probability hierarchical self-distillation and characterization hierarchical self-distillation in the training process, the overfitting degree of model training is reduced, and the accuracy and the robustness of dialect small language speech recognition are improved.
Preferably, in step S3 of the present embodiment, the posterior probability self-distillation specifically includes characterizing an intermediate layer of the transducer modelR M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP M Then combining the final output posterior probability of the transducer modelPThe posterior probability distribution self-distilling loss function is calculated using a standard MSE loss function.
The posterior probability hierarchical self-distillation is carried out to enable the output of the middle layer to be close to the posterior probability distribution output by the model, so that the middle layer of the model learns deeper knowledge, and the fitting capacity of the model when the dialect data amount is small is enhanced.
Preferably, inIn step S3 of the scheme, the characterization self-distillation specifically comprises obtaining an intermediate layer characterization of a transducer modelR M And then combining the representation of the last layer output by the transducer modelRThe normalized MSE loss function is used to calculate a characteristic self-distillation loss function.
Through the learning of the characteristic self-distillation, the middle layer of the model can learn the higher-order characteristic of the last layer of the model, so that the understanding and the characteristic capability of the middle layer on acoustic information are enhanced, and the overall performance of the model is further improved especially under the condition of less dialect small-language data quantity.
It is further preferred that the composition comprises,
the calculation formula of the posterior probability distribution self-distillation loss function is as follows:L P =MSE(P,P M
the calculation formula for characterizing the self-distillation loss function is as follows:L R =MSE(R,R M
according toL P AndL R to calculate the loss function L of the final model, which is formulated as:L=L CTCL PL R whereinL CTC Representing final output posterior probability of modelPCTC loss through model outputPAnd calculating a true labeling result y.
The loss function L of the final model is obtained through calculation according to the formula, and counter propagation is carried out according to the gradient of the loss function L so as to update the parameters of the voice recognition model, so that the model output approaches or reaches an optimal value, the model is converged, and the stability of the model is improved.
Preferably, the step S2 specifically includes the following steps:
s21: dialect-to-dialect speech signalIPerforming front-end signal processing;
s22: the voice signal obtained in the step S21 is subjected to fast Fourier transform FFT, and then Mel filtering is carried out;
s23: performing logarithmic processing on the signal obtained in the step S22;
s24: the voice signal obtained by logarithmic processing is subjected to discrete cosine transform to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;
s25: and extracting the MFCC characteristics and the first-order differential parameters from the voice signals processed by the steps.
In the scheme, the MFCC characteristic of the dialect voice signal is used as the input of an acoustic model, wherein the MFCC is short for Mel frequency cepstrum coefficient, and the MFCC focuses on the auditory characteristics of human ears unlike the common actual frequency cepstrum analysis, and the frequency of sound heard by the human ears is not in direct proportion to the frequency of the sound, so that the Mel frequency scale can be more in line with the auditory characteristics of the human ears. The Mel frequency scale corresponds to the logarithmic distribution relation of the actual frequency, and the mathematical relation formula is as follows:
through the operation, the MFCC (multi-frequency component carrier) characteristics of the dialect voice signals can be effectively extracted, and the hearing characteristics which are more in line with the human ears are trained, so that the accuracy of finally trained dialect small language voice recognition is higher.
Further preferably, the step S21 specifically further includes the following steps:
s211: dialect speech signalIPre-emphasis is performed by a high pass filter;
s212: carrying out framing treatment on the pre-emphasized signal, and taking 256 sampling points as a frame;
s213: and multiplying each frame by a Hamming window for windowing.
The high-frequency part is promoted through pre-emphasis, so that the frequency spectrum of the signal becomes flatter, meanwhile, the continuity of the left end and the right end of the frame is increased by utilizing windowing processing, and the leakage of the frequency spectrum is reduced, so that the characteristics of the dialect voice signal are more obvious.
Still further preferably, in step S212, there is an overlapping area between two frames, where the overlapping area includes 128 sampling points, so as to avoid excessive variation between two adjacent frames, thereby improving the stability of the dialect small-language speech signal.
In a second aspect, the present solution further provides a dialect speech recognition training system based on self-knowledge distillation, including:
the signal acquisition module: obtaining dialect speech signalsI
And the feature extraction module is used for: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X
And the feature training module is used for: will beXInputting the dialect voice recognition training to a transducer model;
wherein the feature training module further comprises an intermediate layer characterization by acquiring a transducer modelR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
By performing posterior probability hierarchical self-distillation and characterization hierarchical self-distillation in the training process, the degree of overfitting of model training is reduced, and accuracy and robustness of dialect small language speech recognition are improved.
Preferably, the posterior probability self-distillation of the system specifically comprises the characterization of the intermediate layer of the transducer modelR M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP M Then combining the final output posterior probability of the transducer modelPCalculating a posterior probability distribution self-distilling loss function using a standard MSE loss function;
the characterization self-distillation specifically comprises obtaining an intermediate layer characterization of a transducer modelR M And then combining the representation of the last layer output by the transducer modelRCalculating a characteristic self-distillation loss function using the standard MSE loss function;
the calculation formula of the posterior probability distribution self-distillation loss function is as follows:L P =MSE(P,P M
the calculation formula for characterizing the self-distillation loss function is as follows:L R =MSE(R,R M
according toL P AndL R to calculate the loss function L of the final modelThe formula is as follows:L=L CTCL PL R whereinL CTC Representing final output posterior probability of modelPCTC loss through model outputPAnd calculating a true labeling result y.
The posterior probability hierarchical self-distillation is carried out to enable the output of the middle layer to be close to the posterior probability distribution output by the model, so that the middle layer of the model learns deeper knowledge, and the fitting capacity of the model when the dialect data amount is less is enhanced; meanwhile, through the learning of the characteristic self-distillation, the middle layer of the model can learn the higher-order characteristic of the last layer of the model, the understanding and characteristic capacity of the middle layer to acoustic information are enhanced, meanwhile, the loss function L of the final model is obtained through calculation according to the formula, and counter propagation is carried out according to the gradient of the loss function L so as to update the parameters of the voice recognition model, so that the model output approaches or reaches an optimal value, the model is converged, the stability of the model is improved, and the overall performance of the model is further improved especially under the condition of small dialect data quantity.
Preferably, the feature extraction module further includes:
front-end signal processing unit: dialect-to-dialect speech signalIPerforming front-end signal processing;
fourier transform and filter processing unit: the voice signal obtained in the step S21 is subjected to fast Fourier transform FFT, and then Mel filtering is carried out;
a logarithmic processing unit: performing logarithmic processing on the signal obtained in the step S22;
discrete cosine transform processing unit: the voice signal obtained by logarithmic processing is subjected to discrete cosine transform to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;
feature extraction unit: extracting MFCC characteristics and first-order differential parameters from the voice signals processed by the steps;
the front-end signal processing unit specifically comprises the following components:
pre-emphasis component: dialect speech signalIThrough a height ofPre-emphasis is carried out by a pass filter;
framing component: carrying out framing treatment on the pre-emphasized signal, and taking 256 sampling points as a frame;
a windowing component: and multiplying each frame by a Hamming window for windowing.
The high-frequency part is promoted through pre-emphasis, so that the frequency spectrum of the signal becomes flatter, meanwhile, a section of overlapping area is arranged between two frames, the overlapping area contains 128 sampling points, the overlarge change of two adjacent frames is avoided, the stability of dialect small language voice signals is improved, the continuity of the left end and the right end of the frames is increased through windowing, and the leakage of the frequency spectrum is reduced, so that the characteristics of the dialect voice signals are more obvious. Finally, through the operation, the MFCC characteristics of the dialect voice signals can be effectively extracted, and the hearing characteristics which are more in line with the human ears are trained, so that the accuracy of the finally trained dialect small language voice recognition is higher.
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, the posterior probability and the self-distillation method for representing two layers are adopted, the information of the deep model layer is introduced into the shallow layer, the accuracy and the robustness of dialect voice recognition can be improved by only utilizing the existing training data, the additional computing resource and data resource expenditure are not required to be increased, the training flow of dialect voice recognition is effectively improved, and the accuracy and the robustness of dialect small language voice recognition are obviously improved.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 is a schematic structural view of the present invention.
FIG. 3 is a schematic representation of posterior probability self-distillation according to the present invention.
FIG. 4 is a schematic representation of the self-distillation of the present invention.
Fig. 5 is a schematic view of the overall feature acquisition of the present invention.
Fig. 6 is a schematic diagram of a feature extraction process according to the present invention.
Fig. 7 is a schematic diagram of front-end signal processing according to the present invention.
FIG. 8 is a schematic diagram of a system module according to the present invention.
Fig. 9 is a schematic diagram of a feature extraction module unit according to the present invention.
Fig. 10 is a schematic structural diagram of the electronic device of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the invention. For better illustration of the following embodiments, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the actual product dimensions; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Example 1
As shown in fig. 1 and fig. 2, the technical solution adopted in this embodiment is that a dialect voice recognition training method based on self-knowledge distillation includes the following steps:
s1: obtaining dialect speech signalsI
S2: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X
S3: will beXInputting the dialect voice recognition training to a transducer model;
wherein, in step S3, the intermediate layer characterization by acquiring a transducer model is also includedR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
In step S1, the general speech sources of the dialect speech signal are mainly three, the general training set data opened in the industry, the data recorded by a specific user and the data synthesized by adopting the TTS technology are obtained, then the feature extraction and training are performed on the obtained dialect speech signal, and the posterior probability level self-distillation and the characterization level self-distillation are performed in the training process, so that the overfitting degree of model training is reduced, and the accuracy and the robustness of dialect small language speech recognition are improved.
Preferably, as shown in fig. 3, fig. 3 is a schematic diagram of a posterior probability self-distillation in the present embodiment, and in step S3 of the present embodiment, the posterior probability self-distillation is specificIntermediate layer characterization involving the transformation of a former modelR M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP M Then combining the final output posterior probability of the transducer modelPThe posterior probability distribution self-distilling loss function is calculated using a standard MSE loss function.
The posterior probability hierarchical self-distillation is carried out to enable the output of the middle layer to be close to the posterior probability distribution output by the model, so that the middle layer of the model learns deeper knowledge, and the fitting capacity of the model when the dialect data amount is small is enhanced.
Preferably, as shown in fig. 4, fig. 4 is a schematic diagram of the self-distillation characterization of the present embodiment, and in step S3 of the present embodiment, the self-distillation characterization specifically includes obtaining an intermediate layer characterization of a transducer modelR M And then combining the representation of the last layer output by the transducer modelRThe normalized MSE loss function is used to calculate a characteristic self-distillation loss function.
Through the learning of the characteristic self-distillation, the middle layer of the model can learn the higher-order characteristic of the last layer of the model, so that the understanding and the characteristic capability of the middle layer on acoustic information are enhanced, and the overall performance of the model is further improved especially under the condition of less dialect small-language data quantity.
Further preferably, in the present embodiment,
the calculation formula of the posterior probability distribution self-distillation loss function is as follows:L P =MSE(P,P M
the calculation formula for characterizing the self-distillation loss function is as follows:L R =MSE(R,R M
according toL P AndL R to calculate the loss function of the final model, the formula is:L=L CTCL PL R whereinL CTC Representing final output posterior probability of modelPCTC loss of (2) and finally outputting posterior probability through a modelPAnd calculating a true labeling result y.
The loss function (loss function) in this embodiment is also called cost function, and is used to measure the difference between the predicted value and the data true value obtained by the model, and is also an important index for measuring how good the model generalization capability is trained by us. The loss function is an objective function of neural network optimization, the neural network training or optimization process is a process of minimizing the loss function, and the smaller the loss function value is, the closer the values of the corresponding predicted result and the actual result are.
In addition, the CTC loss represents a loss function commonly used in the field of speech recognition, and is mainly used for solving the problems that the input and output sequences are different in length and cannot be aligned, calculating the loss function L of the final model through the formula, and performing back propagation according to the gradient of the loss function L, wherein the back propagation is used for updating the previous parameters of the speech recognition model, so that the model output approaches or reaches an optimal value, the model is converged, and the stability of the model is improved.
Preferably, in this embodiment, the MFCC characteristics of the dialect speech signal are used as the input of the acoustic model, and the overall flow of obtaining the MFCC characteristics is shown in fig. 5, through which the MFCC more conforming to the auditory characteristics of the human ear can be extracted to train the MFCC, so that the accuracy of the trained speech recognition model is higher.
Further, in this embodiment, as shown in fig. 6, the feature extraction process in step S2 specifically includes the following steps:
s21: dialect-to-dialect speech signalIPerforming front-end signal processing;
s22: the voice signal obtained in the step S21 is subjected to fast Fourier transform FFT, and then Mel filtering is carried out;
s23: performing logarithmic processing on the signal obtained in the step S22;
s24: the voice signal obtained by the logarithmic processing is subjected to discrete cosine transform DCT to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;
s25: and extracting the MFCC characteristics and the first-order differential parameters from the voice signals processed by the steps.
In the scheme, the MFCC characteristic of the dialect voice signal is used as the input of an acoustic model, wherein the MFCC is short for Mel frequency cepstrum coefficient, and the MFCC focuses on the auditory characteristics of human ears unlike the common actual frequency cepstrum analysis, and the frequency of sound heard by the human ears is not in direct proportion to the frequency of the sound, so that the Mel frequency scale can be more in line with the auditory characteristics of the human ears. The Mel frequency scale corresponds to the logarithmic distribution relation of the actual frequency, and the mathematical relation formula is as follows:
through the operation, the MFCC (multi-frequency component carrier) characteristics of the dialect voice signals can be effectively extracted, and the hearing characteristics which are more in line with the human ears are trained, so that the accuracy of finally trained dialect small language voice recognition is higher.
Further preferably, in this embodiment, the front-end signal processing flow is as shown in fig. 7, and the step S21 specifically further includes the following steps:
s211: dialect speech signalIPre-emphasis is performed by a high pass filter;
s212: carrying out framing treatment on the pre-emphasized signal, and taking 256 sampling points as a frame;
s213: and multiplying each frame by a Hamming window for windowing.
The high-frequency part is promoted through pre-emphasis, so that the frequency spectrum of the signal becomes flatter, meanwhile, the continuity of the left end and the right end of the frame is increased by utilizing windowing processing, and the leakage of the frequency spectrum is reduced, so that the characteristics of the dialect voice signal are more obvious.
Still further preferably, in step S212, there is an overlapping area between two frames, where the overlapping area includes 128 sampling points, so as to avoid excessive variation between two adjacent frames, thereby improving the stability of the dialect small-language speech signal.
In a second aspect, as shown in fig. 8, fig. 8 is a schematic diagram of a system module in the present solution, where a dialect voice recognition training system based on self-knowledge distillation is provided, including:
signal acquisition module 01: obtaining dialect speech signalsI
Feature extraction module 02: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X
Feature training module 03: will beXInputting the dialect voice recognition training to a transducer model;
wherein the feature training module further comprises an intermediate layer characterization by acquiring a transducer modelR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
By performing posterior probability hierarchical self-distillation and characterization hierarchical self-distillation in the training process, the degree of overfitting of model training is reduced, and accuracy and robustness of dialect small language speech recognition are improved.
Preferably, the posterior probability self-distillation of the system in this embodiment specifically includes characterizing the intermediate layer of the transducer modelR M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP M Then combining the final output posterior probability of the transducer modelPCalculating a posterior probability distribution self-distilling loss function using a standard MSE loss function;
the characterization self-distillation specifically comprises obtaining an intermediate layer characterization of a transducer modelR M And then combining the representation of the last layer output by the transducer modelRCalculating a characteristic self-distillation loss function using the standard MSE loss function;
the calculation formula of the posterior probability distribution self-distillation loss function is as follows:L P =MSE(P,P M
the calculation formula for characterizing the self-distillation loss function is as follows:L R =MSE(R,R M
according toL P AndL R to calculate the loss function of the final model, the formula is:L=L CTCL PL R whereinL CTC Representing final output posterior probability of modelPCTC loss through model outputPAnd calculating a true labeling result y.
The loss function (loss function) described in this embodiment is also called cost function (cost function), and is used to measure the difference between the predicted value and the data true value obtained by the model, and is also an important index for measuring the generalization ability of the model trained by us. The loss function is an objective function of neural network optimization, the neural network training or optimization process is a process of minimizing the loss function, and the smaller the loss function value is, the closer the values of the corresponding predicted result and the actual result are.
In addition, the CTC loss represents a loss function commonly used in the field of speech recognition, and is mainly used for solving the problems that the input and output sequences are different in length and cannot be aligned, calculating the loss function L of the final model through the formula, and performing back propagation according to the gradient of the loss function L, wherein the back propagation is used for updating the previous parameters of the speech recognition model, so that the model output approaches or reaches an optimal value, the model is converged, and the stability of the model is improved.
The posterior probability hierarchical self-distillation is carried out to enable the output of the middle layer to be close to the posterior probability distribution output by the model, so that the middle layer of the model learns deeper knowledge, and the fitting capacity of the model when the dialect data amount is less is enhanced; meanwhile, through the learning of the characteristic self-distillation, the middle layer of the model can learn the higher-order characteristic of the last layer of the model, so that the understanding and characteristic capability of the middle layer on acoustic information are enhanced, and the overall performance of the model is further improved especially under the condition of small dialect data size.
Preferably, as shown in fig. 9, the feature extraction module further includes:
front-end signal processing unit: dialect-to-dialect speech signalIPerforming front-end signal processing;
fourier transform and filter processing unit: the voice signal obtained in the step S21 is subjected to fast Fourier transform FFT, and then Mel filtering is carried out;
a logarithmic processing unit: performing logarithmic processing on the signal obtained in the step S22;
discrete cosine transform processing unit: the voice signal obtained by logarithmic processing is subjected to discrete cosine transform to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;
feature extraction unit: extracting MFCC characteristics and first-order differential parameters from the voice signals processed by the steps;
the front-end signal processing unit specifically comprises the following components:
pre-emphasis component: dialect speech signalIPre-emphasis is performed by a high pass filter;
framing component: carrying out framing treatment on the pre-emphasized signal, and taking 256 sampling points as a frame;
a windowing component: and multiplying each frame by a Hamming window for windowing.
The high-frequency part is promoted through pre-emphasis, so that the frequency spectrum of the signal becomes flatter, meanwhile, a section of overlapping area is arranged between two frames, the overlapping area contains 128 sampling points, the overlarge change of two adjacent frames is avoided, the stability of dialect small language voice signals is improved, the continuity of the left end and the right end of the frames is increased through windowing, and the leakage of the frequency spectrum is reduced, so that the characteristics of the dialect voice signals are more obvious. Finally, through the operation, the MFCC characteristics of the dialect voice signals can be effectively extracted, and the hearing characteristics which are more in line with the human ears are trained, so that the accuracy of the finally trained dialect small language voice recognition is higher.
In a third aspect, fig. 10 is a schematic structural diagram of an electronic device provided in the present embodiment. As shown in fig. 10, the electronic device may include: processor (processor) 10, communication interface (Communications Interface) 20, memory (memory) 30 and communication bus 40, wherein processor 10, communication interface 20, memory 30 accomplish the communication between each other through communication bus 40. The processor 10 may invoke logic instructions in the memory 30 to execute a dialect based on self-knowledge distillationA method of voice recognition training, the method comprising: obtaining dialect speech signalsIThe method comprises the steps of carrying out a first treatment on the surface of the Extracting dialect speech signalsIIs denoted as MFCC feature of (C)XThe method comprises the steps of carrying out a first treatment on the surface of the Will beXInputting the dialect voice recognition training to a transducer model; wherein, also includes the intermediate layer characterization by obtaining a transducer modelR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
Further, the logic instructions in the memory 30 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on such understanding, the technical solution of the present solution may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present solution. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In a fourth aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the method of dialect speech recognition training based on self-knowledge distillation provided by the methods described above, the method comprising: obtaining dialect speech signalsIThe method comprises the steps of carrying out a first treatment on the surface of the Extracting dialect speech signalsIIs denoted as MFCC feature of (C)XThe method comprises the steps of carrying out a first treatment on the surface of the Will beXInputting the dialect voice recognition training to a transducer model; wherein, also includes the intermediate layer characterization by obtaining a transducer modelR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
In a fifth aspect, the present invention also provides a non-transitory computer readable mediumA readable storage medium having stored thereon a computer program which when executed by a processor is implemented to perform a dialect speech recognition training method based on self-knowledge distillation provided by the methods described above, the method comprising: obtaining dialect speech signalsIThe method comprises the steps of carrying out a first treatment on the surface of the Extracting dialect speech signalsIIs denoted as MFCC feature of (C)XThe method comprises the steps of carrying out a first treatment on the surface of the Will beXInputting the dialect voice recognition training to a transducer model; wherein, also includes the intermediate layer characterization by obtaining a transducer modelR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
It should be understood that the foregoing examples of the present invention are merely illustrative of the present invention and are not intended to limit the present invention to the specific embodiments thereof. Any modification, equivalent replacement, improvement, etc. that comes within the spirit and principle of the claims of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A dialect voice recognition training method based on self-knowledge distillation comprises the following steps:
s1: obtaining dialect speech signalsI
S2: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X
S3: will beXInputting the dialect voice recognition training to a transducer model;
characterized in that in step S3, the method further comprises the step of obtaining an intermediate layer representation of the transducer modelR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
2. The method according to claim 1, wherein in step S3, the posterior probability self-distillation specifically includes characterizing an intermediate layer of a transducer modelR M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP M Then combining the final output posterior probability of the transducer modelPThe posterior probability distribution self-distilling loss function is calculated using a standard MSE loss function.
3. The method of claim 2, wherein in step S3, the characterizing the self-distillation includes obtaining transformIntermediate layer characterization of mer modelsR M And then combining the representation of the last layer output by the transducer modelRThe normalized MSE loss function is used to calculate a characteristic self-distillation loss function.
4. A method for training dialect speech recognition based on self-knowledge distillation as set forth in claim 3,
the calculation formula of the posterior probability distribution self-distillation loss function is as follows:L P =MSE(P,P M
the calculation formula for characterizing the self-distillation loss function is as follows:L R =MSE(R,R M
according toL P AndL R to calculate the loss function L of the final model, which is formulated as:L=L CTCL PL R whereinL CTC Representing final output posterior probability of modelPCTC loss through model outputPAnd calculating a true labeling result y.
5. The method for training dialect speech recognition based on self-knowledge distillation as set forth in claim 1, wherein said step S2 specifically comprises the steps of:
s21: dialect-to-dialect speech signalIPerforming front-end signal processing;
s22: the voice signal obtained in the step S21 is subjected to fast Fourier transform FFT, and then Mel filtering is carried out;
s23: performing logarithmic processing on the signal obtained in the step S22;
s24: the voice signal obtained by logarithmic processing is subjected to discrete cosine transform to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;
s25: and extracting the MFCC characteristics and the first-order differential parameters from the voice signals processed by the steps.
6. The method for training dialect speech recognition based on self-knowledge distillation as set forth in claim 5, wherein said step S21 further comprises the steps of:
s211: dialect speech signalIPre-emphasis is performed by a high pass filter;
s212: carrying out framing treatment on the pre-emphasized signal, and taking 256 sampling points as a frame;
s213: and multiplying each frame by a Hamming window for windowing.
7. The method according to claim 6, wherein in step S212, there is an overlapping area between two frames, and the overlapping area includes 128 sampling points.
8. A dialect speech recognition training system based on self-knowledge distillation, comprising:
the signal acquisition module: obtaining dialect speech signalsI
And the feature extraction module is used for: extracting dialect speech signalsIIs denoted as MFCC feature of (C)X
And the feature training module is used for: will beXInputting the dialect voice recognition training to a transducer model;
characterized in that the feature training module also comprises an intermediate layer characterization by acquiring a transducer modelR M To perform posterior probability self-distillation and feature learning characterizing self-distillation.
9. The system for training speech recognition of a dialect based on self-knowledge distillation as set forth in claim 8,
the posterior probability self-distillation specifically comprises the intermediate layer characterization of a transducer modelR M The output is passed through a linear conversion layer to obtain intermediate layer posterior probabilityP M Then combining the final output posterior probability of the transducer modelPCalculating a posterior probability distribution self-distilling loss function using a standard MSE loss function;
the characterization self-distillation specifically comprises obtaining an intermediate layer characterization of a transducer modelR M And then combining the representation of the last layer output by the transducer modelRCalculating a characteristic self-distillation loss function using the standard MSE loss function;
the calculation formula of the posterior probability distribution self-distillation loss function is as follows:L P =MSE(P,P M
the calculation formula for characterizing the self-distillation loss function is as follows:L R =MSE(R,R M
according toL P AndL R to calculate the loss function L of the final model, which is formulated as:L=L CTCL PL R whereinL CTC Representing final output posterior probability of modelPCTC loss through model outputPAnd calculating a true labeling result y.
10. The system for training dialect speech recognition based on self-knowledge distillation of claim 9, further comprising in the feature extraction module:
front-end signal processing unit: dialect-to-dialect speech signalIPerforming front-end signal processing;
fourier transform and filter processing unit: the voice signal obtained in the step S21 is subjected to fast Fourier transform FFT, and then Mel filtering is carried out;
a logarithmic processing unit: performing logarithmic processing on the signal obtained in the step S22;
discrete cosine transform processing unit: the voice signal obtained by logarithmic processing is subjected to discrete cosine transform to de-correlate the filter bank coefficient, and a compressed representation of the filter bank is generated;
feature extraction unit: extracting MFCC characteristics and first-order differential parameters from the voice signals processed by the steps;
the front-end signal processing unit specifically comprises the following components:
pre-emphasis component: dialect speech signalIPre-emphasis is performed by a high pass filter;
framing component: carrying out framing treatment on the pre-emphasized signal, and taking 256 sampling points as a frame;
a windowing component: and multiplying each frame by a Hamming window for windowing.
CN202410044546.0A 2024-01-12 2024-01-12 Dialect voice recognition training method and system based on self-knowledge distillation Pending CN117558264A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410044546.0A CN117558264A (en) 2024-01-12 2024-01-12 Dialect voice recognition training method and system based on self-knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410044546.0A CN117558264A (en) 2024-01-12 2024-01-12 Dialect voice recognition training method and system based on self-knowledge distillation

Publications (1)

Publication Number Publication Date
CN117558264A true CN117558264A (en) 2024-02-13

Family

ID=89811535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410044546.0A Pending CN117558264A (en) 2024-01-12 2024-01-12 Dialect voice recognition training method and system based on self-knowledge distillation

Country Status (1)

Country Link
CN (1) CN117558264A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509555A (en) * 2020-11-25 2021-03-16 平安科技(深圳)有限公司 Dialect voice recognition method, dialect voice recognition device, dialect voice recognition medium and electronic equipment
CN113516968A (en) * 2021-06-07 2021-10-19 北京邮电大学 End-to-end long-term speech recognition method
CN114822518A (en) * 2022-04-29 2022-07-29 思必驰科技股份有限公司 Knowledge distillation method, electronic device, and storage medium
CN115064155A (en) * 2022-06-09 2022-09-16 福州大学 End-to-end voice recognition incremental learning method and system based on knowledge distillation
CN115222059A (en) * 2022-06-24 2022-10-21 天津大学 Self-distillation model compression algorithm based on high-level information supervision
CN116312628A (en) * 2023-02-09 2023-06-23 安徽大学 False audio detection method and system based on self knowledge distillation
CN116844529A (en) * 2023-05-25 2023-10-03 深圳华为云计算技术有限公司 Speech recognition method, device and computer storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509555A (en) * 2020-11-25 2021-03-16 平安科技(深圳)有限公司 Dialect voice recognition method, dialect voice recognition device, dialect voice recognition medium and electronic equipment
WO2021213161A1 (en) * 2020-11-25 2021-10-28 平安科技(深圳)有限公司 Dialect speech recognition method, apparatus, medium, and electronic device
CN113516968A (en) * 2021-06-07 2021-10-19 北京邮电大学 End-to-end long-term speech recognition method
CN114822518A (en) * 2022-04-29 2022-07-29 思必驰科技股份有限公司 Knowledge distillation method, electronic device, and storage medium
CN115064155A (en) * 2022-06-09 2022-09-16 福州大学 End-to-end voice recognition incremental learning method and system based on knowledge distillation
CN115222059A (en) * 2022-06-24 2022-10-21 天津大学 Self-distillation model compression algorithm based on high-level information supervision
CN116312628A (en) * 2023-02-09 2023-06-23 安徽大学 False audio detection method and system based on self knowledge distillation
CN116844529A (en) * 2023-05-25 2023-10-03 深圳华为云计算技术有限公司 Speech recognition method, device and computer storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张治民: "《基于注意力机制和知识蒸馏的场景文本识别》", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 January 2022 (2022-01-15), pages 138 - 1480 *

Similar Documents

Publication Publication Date Title
WO2020173133A1 (en) Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium
CN113012720B (en) Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN111653289B (en) Playback voice detection method
CN102664010B (en) Robust speaker distinguishing method based on multifactor frequency displacement invariant feature
CN109360554A (en) A kind of language identification method based on language deep neural network
CN112259080B (en) Speech recognition method based on neural network model
CN113488058A (en) Voiceprint recognition method based on short voice
CN114203163A (en) Audio signal processing method and device
CN114242044A (en) Voice quality evaluation method, voice quality evaluation model training method and device
CN114708855B (en) Voice awakening method and system based on binary residual error neural network
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN111489763A (en) Adaptive method for speaker recognition in complex environment based on GMM model
CN112837670B (en) Speech synthesis method and device and electronic equipment
CN113539243A (en) Training method of voice classification model, voice classification method and related device
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN117558264A (en) Dialect voice recognition training method and system based on self-knowledge distillation
CN113160796B (en) Language identification method, device and equipment for broadcast audio and storage medium
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
CN111477248B (en) Audio noise detection method and device
CN110767238B (en) Blacklist identification method, device, equipment and storage medium based on address information
CN114302301A (en) Frequency response correction method and related product
CN112992157A (en) Neural network noisy line identification method based on residual error and batch normalization
Xiao et al. Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN
CN113178204A (en) Low-power consumption method and device for single-channel noise reduction and storage medium
CN115312029B (en) Voice translation method and system based on voice depth characterization mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination