WO2021213161A1 - 方言语音识别方法、装置、介质及电子设备 - Google Patents

方言语音识别方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2021213161A1
WO2021213161A1 PCT/CN2021/084305 CN2021084305W WO2021213161A1 WO 2021213161 A1 WO2021213161 A1 WO 2021213161A1 CN 2021084305 W CN2021084305 W CN 2021084305W WO 2021213161 A1 WO2021213161 A1 WO 2021213161A1
Authority
WO
WIPO (PCT)
Prior art keywords
output
dialect
model
dimensional
module
Prior art date
Application number
PCT/CN2021/084305
Other languages
English (en)
French (fr)
Inventor
魏文琦
王健宗
张之勇
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021213161A1 publication Critical patent/WO2021213161A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • This application relates to the field of communication technology, and in particular to a dialect speech recognition method, device, medium and electronic equipment.
  • neural network models have been widely used.
  • the performance of the model depends on the algorithm used on the one hand, and the amount of training sample data on the other hand.
  • the inventor realizes that in the speech recognition model, because more Mandarin training samples can be obtained, the trained Mandarin recognition model is usually more accurate.
  • the Mandarin recognition model cannot accurately recognize Dialects, and when training speech recognition models corresponding to various dialects, the number of samples of various dialects cannot be guaranteed, so that dialect speech cannot be accurately recognized.
  • This application aims to provide a dialect speech recognition method, device, medium, and electronic equipment, which can increase the accuracy of dialect speech recognition to a certain extent.
  • a dialect speech recognition method including: acquiring the speech of the dialect to be recognized; Sequence, the coding model is obtained based on a first alignment model trained using a Mandarin training sample set and a second alignment model trained using a dialect training sample set, wherein the first alignment model includes: The first feature extraction module is used to extract Mandarin voice features of Mandarin voice samples, each of the first feature extraction modules includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and an output connection to the pooling layer The fully connected layer, wherein there are multiple convolutional layers, and the multiple convolutional layers are connected by jump connection; the first high-dimensional encoding module is connected to the multiple first feature extraction modules The output connection of the last module is used to perform high-dimensional coding of the Mandarin speech features to obtain a high-dimensional sequence of Mandarin; the first regression module is connected to the output of the first high-dimensional coding module and is used to convert the Mandarin The high-dimensional sequence is converted into a Mandarin
  • the first regression module includes a hidden layer of a recurrent neural network. There are multiple hidden layers. One of the multiple hidden layers is set between two adjacent hidden layers Layer focusing layer; the second comparison model includes: a plurality of second feature extraction modules for extracting dialect voice features of dialect voice samples, each of the second feature extraction modules includes a convolutional layer, and a convolutional layer The output connection of the pooling layer and the fully connected layer connected to the output of the pooling layer, wherein there are multiple convolutional layers, and the multiple convolutional layers are connected by jump connection; the second highest The dimensional coding module is connected to the output of the last module among the plurality of second feature extraction modules, and is used to perform high-dimensional coding of the dialect speech features to obtain a dialect high-dimensional sequence; the second regression module is connected to the first The output connection of the two high-dimensional encoding module is used to convert the dialect high-dimensional sequence into the dialect low-dimensional sequence.
  • the second regression module includes a hidden layer of a cyclic neural network.
  • a focusing layer is set between two adjacent hidden layers in the hidden layer;
  • the process of obtaining an encoding model based on the first comparison model and the second comparison model includes: comparing the first comparison model
  • the model uses an unsupervised knowledge distillation method to perform knowledge distillation on the second comparison model to obtain the coding model; input the example Mandarin speech into the first comparison model, and use the dialect speech with the same semantics as the example Mandarin speech Input the coding model; obtain the output of the first comparison model and the output of the coding model, calculate the degree of difference between the output of the first comparison model and the output of the coding model; based on the The degree of difference between the output of the first comparison model and the output of the coding model adjusts the coding model; the low-dimensional sequence to be recognized is decoded to obtain the text corresponding to the voice of the dialect to be recognized.
  • a dialect speech recognition device including: an acquiring unit configured to acquire the speech of the dialect to be recognized; The low-dimensional sequence to be recognized corresponding to the speech of the dialect to be recognized, and the coding model is obtained based on the first alignment model trained using the Mandarin training sample set and the second alignment model trained using the dialect training sample set, where all
  • the first comparison model includes: a plurality of first feature extraction modules for extracting Mandarin speech features of Mandarin speech samples, each of the first feature extraction modules includes a convolutional layer and a pool connected to the output of the convolutional layer A fully connected layer connected to the output of the pooling layer, where there are multiple convolutional layers, and the multiple convolutional layers are connected in a jump connection mode; the first high-dimensional encoding module, and The output connection of the last module of the plurality of first feature extraction modules is used to perform high-dimensional encoding of the Mandarin speech features to obtain a high-dimensional sequence of Mandarin; the first regression module is connected to the first high-dimensional
  • the first regression module includes a hidden layer of a recurrent neural network. There are multiple hidden layers. A focusing layer is set between two adjacent hidden layers; the second comparison model includes: a plurality of second feature extraction modules for extracting the dialect voice features of the dialect voice samples, each of the second feature extraction
  • the module includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and a fully connected layer connected to the output of the pooling layer. There are multiple convolutional layers, and there are multiple convolutional layers.
  • the second high-dimensional encoding module is connected to the output of the last module of the plurality of second feature extraction modules, and is used to perform high-dimensional encoding of the dialect voice features to obtain a dialect high-dimensional sequence
  • a second regression module connected to the output of the second high-dimensional encoding module, used to convert the dialect high-dimensional sequence into a dialect low-dimensional sequence, the second regression module includes the hidden layer of the recurrent neural network, so There are multiple hidden layers, and a focusing layer is set between two adjacent hidden layers among the multiple hidden layers; the process of obtaining an encoding model based on the first comparison model and the second comparison model , Including: performing knowledge distillation of the first comparison model to the second comparison model through an unsupervised knowledge distillation method to obtain the coding model; inputting the example Mandarin speech into the first comparison model, and combining it with The dialect speech with the same semantics of the example Mandarin speech is input into the coding model; the output of the first comparison model and the output of the coding model are obtained, and
  • a computer-readable program medium which stores computer program instructions, and when the computer program instructions are executed by a computer, the computer executes:
  • the first comparison model includes:
  • a plurality of first feature extraction modules for extracting Mandarin voice features of Mandarin voice samples each of the first feature extraction modules includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and a pooling layer connected to the output of the convolutional layer.
  • the first high-dimensional encoding module is connected to the output of the last module of the plurality of first feature extraction modules, and is used to perform high-dimensional encoding of the Mandarin speech features to obtain a Mandarin high-dimensional sequence;
  • the first regression module is connected to the output of the first high-dimensional encoding module, and is used to convert the high-dimensional sequence of Mandarin into a low-dimensional sequence of Mandarin.
  • the first regression module includes a hidden layer of a recurrent neural network. There are multiple hidden layers, and a focusing layer is set between two adjacent hidden layers among the multiple hidden layers;
  • the second comparison model includes:
  • a plurality of second feature extraction modules are used to extract the dialect voice features of the dialect voice samples.
  • Each of the second feature extraction modules includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and a pooling layer connected to the output of the convolutional layer.
  • a second high-dimensional encoding module connected to the output of the last module of the plurality of second feature extraction modules, and configured to perform high-dimensional encoding of the dialect speech features to obtain a dialect high-dimensional sequence
  • the second regression module is connected to the output of the second high-dimensional encoding module and is used to convert the dialect high-dimensional sequence into the dialect low-dimensional sequence.
  • the second regression module includes the hidden layer of the recurrent neural network, and the There are multiple hidden layers, and a focusing layer is set between two adjacent hidden layers among the multiple hidden layers;
  • the process of obtaining an encoding model based on the first comparison model and the second comparison model includes:
  • the low-dimensional sequence to be recognized is decoded to obtain the text corresponding to the voice of the dialect to be recognized.
  • an electronic device including: a processor; a memory, where computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, it realizes: Acquire the voice of the dialect to be recognized; input the voice of the dialect to be recognized into the coding model to obtain the low-dimensional sequence to be recognized corresponding to the voice of the dialect to be recognized, and the coding model is based on the first comparison obtained by training using the Mandarin training sample set Model and a second comparison model trained using a dialect training sample set, where the first comparison model includes: a plurality of first feature extraction modules for extracting Mandarin voice features of Mandarin voice samples, each of which The first feature extraction module includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and a fully connected layer connected to the output of the pooling layer, wherein there are multiple convolutional layers, and multiple The convolutional layers are connected in a jump connection; the first high-dimensional encoding module is connected to the output of
  • the embodiment of the present application has the beneficial effect that: in the technical solutions provided by some embodiments of the present application, since the output codes of the first comparison model and the second comparison model already include training
  • the characteristics of the sample can reduce the time required to train the coding model, and there is no need to decode in the process of obtaining the coding model.
  • the decoding simplifies the two decodings into one decoding, which improves the efficiency of dialect speech recognition while obtaining the coding model for accurately recognizing dialect speech.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present application can be applied;
  • Fig. 2 schematically shows a flowchart of a dialect speech recognition method according to an embodiment of the present application
  • Fig. 3 schematically shows a schematic structural diagram of a first feature extraction module according to an embodiment of the present application
  • FIG. 4 schematically shows a schematic structural diagram of a first regression module according to an embodiment of the present application
  • Fig. 5 schematically shows a block diagram of a dialect speech recognition device according to an embodiment of the present application
  • Fig. 6 is a hardware diagram of an electronic device according to an exemplary embodiment
  • Fig. 7 shows a computer-readable storage medium for implementing the above method according to an exemplary embodiment.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture 100 to which the technical solutions of the embodiments of the present application can be applied.
  • the system architecture 100 may include a terminal device 101 (the terminal device may be one or more of a smart phone, a tablet computer, a portable computer, and a desktop computer), a network 102 and a server 103.
  • the network 102 is used to provide a medium of a communication link between the terminal device 101 and the server 103.
  • the network 102 may include various connection types, such as wired communication links, wireless communication links, and so on.
  • the numbers of the terminal device 101, the network 102, and the server 103 in FIG. 1 are merely illustrative. According to implementation needs, there may be any number of terminal devices 101, networks 102, and servers 103.
  • the server 103 may be a server cluster composed of multiple servers.
  • the server 103 may obtain the voice of the dialect to be recognized; input the voice of the dialect to be recognized into the coding model to obtain the low-dimensional sequence to be recognized corresponding to the voice of the dialect to be recognized.
  • the coding model is based on the use of a Mandarin training sample set.
  • the first comparison model obtained by training and the second comparison model trained using the dialect training sample set, wherein the first comparison model includes: a plurality of first feature extraction modules, which are used to extract the Mandarin speech of the Mandarin speech sample Features, each first feature extraction module includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and a fully connected layer connected to the output of the pooling layer. Among them, there are multiple convolutional layers and multiple convolutions.
  • the layers are connected by jump connection to obtain features in a longer range;
  • the first high-dimensional encoding module is connected to the output of the last module among the multiple first feature extraction modules, and is used to perform high-level Mandarin speech features.
  • Dimensional coding to obtain the high-dimensional sequence of Mandarin to obtain the features of multiple dimensions in the Mandarin speech sample;
  • the first regression module connected with the output of the first high-dimensional coding module, is used to convert the high-dimensional sequence of Mandarin into a low-dimensional sequence of Mandarin , Extract key features from features in multiple dimensions.
  • the first regression module includes the hidden layer of the cyclic neural network. There are multiple hidden layers. A focus layer is set between two adjacent hidden layers in the multiple hidden layers.
  • the second comparison model includes: multiple second feature extraction modules for extracting dialect voice features of dialect voice samples, each second feature extraction module includes convolutional layer, and convolution The pooling layer connected to the output of the layer and the fully connected layer connected to the output of the pooling layer. Among them, there are multiple convolutional layers, and the multiple convolutional layers are connected by jump connection to obtain a farther range.
  • the second high-dimensional encoding module connected to the output of the last module of the multiple second feature extraction modules, is used to perform high-dimensional encoding of dialect voice features to obtain dialect high-dimensional sequences to obtain multiple dialect voice samples Dimensional features;
  • the second regression module connected to the output of the second high-dimensional encoding module, used to convert dialect high-dimensional sequences into dialect low-dimensional sequences, and extract key features from features in multiple dimensions, the second regression module Including the hidden layer of the cyclic neural network, there are multiple hidden layers, and a focusing layer is set between the two adjacent hidden layers in the multiple hidden layers to extract sample features more accurately; based on the first comparison model and the first comparison model
  • the process of obtaining the coding model from the second comparison model includes: performing knowledge distillation from the first comparison model to the second comparison model through an unsupervised knowledge distillation method to obtain the coding model; inputting the example Mandarin speech into the first comparison model, and Input the dialect speech with the same semantics as the example Mandarin speech into the coding model; obtain the output of the first comparison
  • Adjust the coding model Although there are certain differences between Mandarin and dialects, there are many similarities. Mandarin samples can make up for dialect voice samples. The problem of insufficient quantity, unsupervised distillation does not require data labeling to make the training process easier. At the same time, adjusting the code based on the degree of difference can also correct the difference between the Mandarin speech sample and the dialect speech sample, which can increase the number of samples while supplementing.
  • the coding model has the ability to recognize dialect speech, so that the obtained coding module can improve the accuracy of dialect speech recognition to a certain extent, decode the low-dimensional sequence to be recognized, and obtain the text corresponding to the dialect speech to be recognized, due to the first comparison
  • the code output by the model and the second comparison model already contains the characteristics of the training sample.
  • the dialect speech recognition method provided by the embodiment of the present application is generally executed by the server 103, and correspondingly, the dialect speech recognition device is generally set in the server 103.
  • the terminal device 101 may also have a similar function to the server 103, so as to execute the dialect speech recognition method provided in the embodiment of the present application.
  • FIG. 2 schematically shows a flowchart of a dialect speech recognition method according to an embodiment of the present application.
  • the execution subject of the dialect speech recognition method may be a server, for example, the server 103 shown in FIG. 1.
  • the dialect speech recognition method includes at least step S210 to step S230, which are described in detail as follows:
  • step S210 the dialect voice to be recognized is acquired.
  • the dialect may be a local dialect, such as Sichuan dialect, Cantonese, Hokkien, etc., and the dialect may also be a minority language, such as Uyghur language, Korean language, etc.
  • step S220 the speech of the dialect to be recognized is input into the coding model to obtain a low-dimensional sequence to be recognized corresponding to the speech of the dialect to be recognized.
  • the coding model is based on the first alignment model trained using the Mandarin training sample set and the dialect training sample set Obtained from the trained second comparison model, where the first comparison model includes:
  • each first feature extraction module includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and an output connection to the pooling layer
  • the first high-dimensional encoding module is connected to the output of the last module among the multiple first feature extraction modules, and is used to perform high-dimensional encoding of Mandarin speech features to obtain a Mandarin high-dimensional sequence;
  • the first regression module is connected to the output of the first high-dimensional encoding module, and is used to convert the high-dimensional sequence of Mandarin Chinese into the low-dimensional sequence of Mandarin.
  • the first regression module includes the hidden layer of the recurrent neural network. There are multiple hidden layers. A focusing layer is set between two adjacent hidden layers in the hidden layer;
  • the second comparison model includes:
  • each second feature extraction module includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and an output connection to the pooling layer
  • the second high-dimensional encoding module is connected to the output of the last module among the plurality of second feature extraction modules, and is used to perform high-dimensional encoding of dialect speech features to obtain a dialect high-dimensional sequence;
  • the second regression module is connected to the output of the second high-dimensional encoding module, and is used to convert the dialect high-dimensional sequence into the dialect low-dimensional sequence.
  • the second regression module includes the hidden layer of the recurrent neural network. There are multiple hidden layers. A focusing layer is set between two adjacent hidden layers in the hidden layer;
  • the process of obtaining a coding model based on the first comparison model and the second comparison model includes:
  • the window length of the dialect speech to be recognized can be obtained, the dialect speech to be recognized is divided into frames according to the window length, and the divided dialect speech to be recognized is input into the coding model, and the window lengths are respectively compared with Recognizing the dialect voice of the unit can make the recognition result more accurate.
  • the voice can be processed without using the Fourier transform, and the voice sampling point is directly divided into frames according to the window length, which does not affect the voice recognition effect, but also improves the efficiency of voice recognition.
  • the number of first feature extraction modules may be five.
  • Each first feature extraction module includes jump connections, which can capture farther and deeper receptive fields.
  • Each first feature extraction module includes a pooling layer and a fully connected layer, which can calculate weights of different dimensions.
  • FIG. 3 schematically shows a schematic structural diagram of a first feature extraction module according to an embodiment of the present application.
  • the first high-dimensional encoding module may be an embedding layer composed of a one-dimensional convolutional layer.
  • the first high-dimensional encoding module may be a high-dimensional nonlinear encoder.
  • the first regression module may include a 5-layer cyclic neural network (RNN network) hidden layer, each layer includes 4096 hidden states, and a focus module is added between every two hidden layers of the RNN network,
  • RNN network 5-layer cyclic neural network
  • the hidden layer of the RNN network learns which positions to focus on, so that the obtained dialect low-dimensional sequence can better contain sample features, as shown in FIG. 4, which schematically shows a first regression of an embodiment of the present application
  • FIG. 4 schematically shows a first regression of an embodiment of the present application
  • FIG. 4 schematically shows a first regression of an embodiment of the present application
  • FIG. 4 schematic diagram of the module structure, where the focusing module can use the attention mechanism, which can calculate the attention of each hidden unit by weighting, so as to achieve the focusing effect.
  • the first comparison model may further include a first output module, which is connected to the output of the first regression module, and is used to output the low-dimensional sequence of Mandarin Chinese.
  • the loss of the first output module is The function is the estimated loss function for noise convergence.
  • This loss function is used to judge the results of future information prediction. Only the original result is regarded as a positive sample, and other results are regarded as a negative sample. The training of the model is completed through continuous training of the difference between the positive and negative samples.
  • the second high-dimensional encoding module may be an embedding layer composed of a one-dimensional convolutional layer.
  • the second high-dimensional encoding module may be a high-dimensional nonlinear encoder.
  • the structure of the second feature extraction module may be the same as the structure of the first feature extraction module
  • the structure of the second high-dimensional encoding module may be the same as the structure of the first high-dimensional encoding module
  • the first regression The structure of the module can be the same as the structure of the second regression module.
  • the second comparison model may further include a second output module, which is connected to the output of the second regression module, and is used to output the dialect low-dimensional sequence.
  • the loss of the second output module The function is the estimated loss function for noise convergence.
  • the KL divergence between the output of the first comparison model and the output of the encoding model may be calculated as the degree of difference between the output of the first comparison model and the output of the encoding model.
  • the cosine similarity distance between the output of the first comparison model and the output of the coding model may be calculated as the degree of difference between the output of the first comparison model and the output of the coding model.
  • the trained first comparison model is subjected to knowledge distillation to the second comparison model through an unsupervised knowledge distillation method to obtain an encoding model
  • the example Mandarin speech is input into the first comparison model
  • Input the dialect speech with the same semantics as the example Mandarin speech into the coding model obtain the output of the first comparison model and the output of the coding model, calculate the difference between the output of the first comparison model and the output of the coding model, according to the first 1.
  • the process of obtaining an encoding model based on the first comparison model and the second comparison model may include: inputting the example Mandarin speech into the first comparison model, and inserting the same dialect as the example Mandarin speech voice Input the second comparison model; obtain the output of the first comparison model and the output of the second comparison model, and calculate the degree of difference between the output of the first comparison model and the output of the second comparison model; based on the first comparison
  • the degree of difference between the output of the model and the output of the second comparison model is knowledge distilled from the first comparison model to the second comparison model to obtain an encoding model.
  • the KL divergence between the output of the first comparison model and the output of the second comparison model can be calculated as the difference between the output of the first comparison model and the output of the second comparison model. The degree of difference between.
  • the cosine similarity distance between the output of the first comparison model and the output of the second comparison model can be calculated as the difference between the output of the first comparison model and the output of the second comparison model. The degree of difference between.
  • an unsupervised training method may be used to train the first comparison model, and an unsupervised training method may be used to train the second comparison model.
  • step S230 the low-dimensional sequence to be recognized is decoded to obtain the text corresponding to the voice of the dialect to be recognized.
  • a trained GRU neural network can be used to decode the low-dimensional sequence to be recognized.
  • a dialect sample training set and a Mandarin sample training set adjusted according to the degree of difference can be used to train the GRU neural network.
  • the dialect may include multiple languages, and each dialect corresponds to a coding model.
  • the dialect training sample set of the dialect is used.
  • the coding model is based on the first training sample set obtained by using Mandarin.
  • the first high-dimensional coding module is connected to the output of the last module among the first feature extraction modules, and is used to perform high-dimensional coding of Mandarin speech features to obtain Putonghua high-dimensional sequence to obtain the features of multiple dimensions in the Mandarin speech sample;
  • the first regression module connected to the output of the first high-dimensional encoding module, is used to convert the high-dimensional sequence of Putonghua into the low-dimensional sequence of Putonghua, from multiple The key features are extracted from the dimensional features.
  • the first regression module includes the hidden layer of the cyclic neural network. There are multiple hidden layers. A focusing layer is set between the two adjacent hidden layers in the multiple hidden layers to be more accurate.
  • the extracted sample features; the second comparison model includes: multiple second feature extraction modules, used to extract the dialect voice features of the dialect voice samples, each second feature extraction module includes a convolutional layer, and the output connection of the convolutional layer The pooling layer and the fully connected layer connected to the output of the pooling layer.
  • the high-dimensional encoding module is connected to the output of the last module of the multiple second feature extraction modules, and is used to perform high-dimensional encoding of dialect voice features to obtain a dialect high-dimensional sequence to obtain features of multiple dimensions in the dialect voice sample;
  • the second regression module is connected to the output of the second high-dimensional encoding module, used to convert the dialect high-dimensional sequence into the dialect low-dimensional sequence, and extract key features from the features of multiple dimensions.
  • the second regression module includes a recurrent neural network There are multiple hidden layers, and a focusing layer is set between two adjacent hidden layers in the multiple hidden layers to extract sample features more accurately; based on the first comparison model and the second comparison model
  • the process of obtaining the coding model includes: performing knowledge distillation from the first comparison model to the second comparison model through an unsupervised knowledge distillation method to obtain the coding model; inputting the example Mandarin speech into the first comparison model, and adding the example to the second comparison model.
  • Dialect speech input coding model with the same semantics of Mandarin speech obtain the output of the first comparison model and the output of the coding model, and calculate the difference between the output of the first comparison model and the output of the coding model; based on the first comparison model
  • the degree of difference between the output of the coding model and the output of the coding model adjusts the coding model.
  • the resulting encoding module can improve the accuracy of dialect speech recognition to a certain extent, decode the low-dimensional sequence to be recognized, and obtain the text corresponding to the dialect speech to be recognized, because the first comparison model and the second comparison model
  • the output code already contains the characteristics of the training sample. No need to decode before the knowledge distillation can reduce the time required to train the coding model. There is no need to decode in the process of obtaining the coding model.
  • the comparison model is decoded separately, and the decoding after the coding model simplifies the two decodings into one decoding, which improves the efficiency of dialect speech recognition while obtaining the coding model for accurately recognizing dialect speech.
  • the problem of less training data in small language speech recognition is solved by using an unsupervised method.
  • the unsupervised method can obtain the low-dimensional features of the common dialect of speech, thereby solving the problem of less dialect speech data.
  • the knowledge distillation method distilling the Mandarin model parameters trained with a large amount of data into the dialect model can greatly improve the performance of the model. Therefore, the algorithm proposed in this paper not only solves the problem of less training data in dialect training, but also obtains more robust expression of the general underlying features, and uses the method of knowledge distillation to improve the utilization of data and improve the network performance. performance.
  • Fig. 5 schematically shows a block diagram of a dialect speech recognition device according to an embodiment of the present application.
  • the dialect speech recognition device 500 includes an acquiring unit 501, an input unit 502, and a decoding unit 503.
  • the acquiring unit 501 is configured to acquire the voice of the dialect to be recognized;
  • the input unit 502 is configured to input the voice of the dialect to be recognized into the coding model to obtain the low-dimensional sequence to be recognized corresponding to the voice of the dialect to be recognized ,
  • the coding model is obtained based on the first comparison model trained using the Mandarin training sample set and the second comparison model trained using the dialect training sample set, where the first comparison model includes: multiple first feature extraction modules , For extracting Mandarin voice features of Mandarin voice samples, each first feature extraction module includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and a fully connected layer connected to the output of the pooling layer, where, There are multiple convolutional layers, and the multiple convolutional layers are connected by jump connection; the first high-dimensional encoding module is connected to the output of the last module of the multiple first feature extraction modules, and is used to combine Mandarin speech features Perform high-dimensional coding to obtain a high-dimensional
  • the first regression module includes the recurrent neural network Hidden layers, there are multiple hidden layers, and a focusing layer is set between two adjacent hidden layers in the multiple hidden layers;
  • the second comparison model includes: multiple second feature extraction modules for extracting dialect voice samples Each second feature extraction module includes a convolutional layer, a pooling layer connected to the output of the convolutional layer, and a fully connected layer connected to the output of the pooling layer.
  • the multiple convolutional layers are connected by jump connection;
  • the second high-dimensional encoding module is connected to the output of the last module of the multiple second feature extraction modules, and is used to perform high-dimensional encoding of the dialect voice features to obtain the dialect High-dimensional sequence;
  • the second regression module connected to the output of the second high-dimensional encoding module, is used to convert the dialect high-dimensional sequence into the dialect low-dimensional sequence.
  • the second regression module includes the hidden layer of the recurrent neural network. How many hidden layers are there?
  • a focusing layer is set between two adjacent hidden layers among multiple hidden layers;
  • the process of obtaining an encoding model based on the first comparison model and the second comparison model includes: passing the first comparison model through the The supervised knowledge distillation method performs knowledge distillation on the second comparison model to obtain the coding model; input the example Mandarin speech into the first comparison model, and input the dialect speech with the same semantics as the example Mandarin speech into the coding model; obtain the first comparison
  • the output of the model and the output of the coding model are calculated, and the degree of difference between the output of the first comparison model and the output of the coding model is calculated; the coding model is adjusted based on the degree of difference between the output of the first comparison model and the output of the coding model;
  • the decoding unit 503 is configured to decode the low-dimensional sequence to be recognized to obtain text corresponding to the voice of the dialect to be recognized.
  • the acquiring unit 501 is configured to: acquire the window length of the dialect voice to be recognized; divide the voice of the dialect to be recognized into frames according to the window length; and input the voice of the dialect to be recognized after the framing To the coding model.
  • the first high-dimensional encoding module is an embedding layer composed of a one-dimensional convolutional layer.
  • the first comparison model further includes a first output module.
  • the first output module is connected to the output of the first regression module and is used to output the low-dimensional sequence of Mandarin Chinese.
  • the loss function of the module is the estimated loss function of noise convergence.
  • the second high-dimensional encoding module is an embedding layer composed of a one-dimensional convolutional layer.
  • the second comparison model further includes a second output module, the second output module is connected to the output of the second regression module, and is used to output the dialect low-dimensional sequence.
  • the loss function of the module is the estimated loss function of noise convergence.
  • the acquiring unit 501 is configured to: input the example Mandarin speech into the first comparison model, and input the same dialect as the example Mandarin speech into the second comparison model; The output of the comparison model and the output of the second comparison model are calculated, and the degree of difference between the output of the first comparison model and the output of the second comparison model is calculated; based on the output of the first comparison model and the second comparison model The degree of difference between the outputs of, the knowledge distilled from the first comparison model to the second comparison model to obtain the coding model.
  • the electronic device 60 according to this embodiment of the present application will be described below with reference to FIG. 6.
  • the electronic device 60 shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the electronic device 60 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 60 may include, but are not limited to: the aforementioned at least one processing unit 61, the aforementioned at least one storage unit 62, a bus 63 connecting different system components (including the storage unit 62 and the processing unit 61), and a display unit 64.
  • the storage unit stores program code, and the program code can be executed by the processing unit 61, so that the processing unit 61 executes the various exemplary methods described in the above-mentioned "Embodiment Method" section of this specification. Steps of implementation.
  • the storage unit 62 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 621 and/or a cache storage unit 622, and may further include a read-only storage unit (ROM) 623.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 62 may also include a program/utility tool 624 having a set of (at least one) program module 625.
  • program module 625 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 63 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 60 can also communicate with one or more external devices (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable a user to interact with the electronic device 60, and/or communicate with
  • the electronic device 60 can communicate with any device (such as a router, modem, etc.) that communicates with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 65.
  • the electronic device 60 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 66. As shown in the figure, the network adapter 66 communicates with other modules of the electronic device 60 through the bus 63.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium is also provided.
  • the computer-readable storage medium may be non-volatile or volatile, and a program capable of implementing the above-mentioned method in this specification is stored thereon. product.
  • each aspect of the present application can also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
  • a program product 70 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
  • CD-ROM compact disk read-only memory
  • the program product of this application is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of the present application can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本申请提供了一种方言语音识别方法、装置、介质及电子设备。该方法包括:获取待识别方言语音;将待识别方言语音输入编码模型得到与待识别方言语音对应的待识别低维序列,编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,将待识别低维序列进行解码,得到与待识别方言语音对应的文本,能够在一定程度上增加方言语音识别的准确性。

Description

方言语音识别方法、装置、介质及电子设备
本申请要求于2020年11月25日提交中国专利局、申请号为202011339518.X,发明名称为“方言语音识别方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,特别涉及一种方言语音识别方法、装置、介质及电子设备。
背景技术
随着人工智能的逐步发展,神经网络模型得到了广泛应用,在模型训练的过程中,模型的性能一方面取决于所使用的算法,另一方面取决于训练样本数据量的多少。
技术问题
综上,发明人意识到,在语音识别模型中,由于能够获得较多的普通话训练样本,训练出的普通话识别模型通常比较准确,但是,各地方言与普通话有一定差异,普通话识别模型不能准确识别方言,而训练各种方言对应的语音识别模型时,各种方言的样本数量又不能保证,使得方言语音不能被准确识别。
技术解决方案
有本申请旨在提供一种方言语音识别方法、装置、介质及电子设备,其能够在一定程度上增加方言语音识别的准确性。
根据本申请实施例的一个方面,提供了一种方言语音识别方法,包括:获取待识别方言语音;将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,所述编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,所述第一比对模型包括:多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个所述第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;第一高维编码模块,与多个所述第一特征提取模块中最后一个模块的输出连接,用于将所述普通话语音特征进行高维编码,得到普通话高维序列;第一回归模块,与所述第一高维编码模块的输出连接,用于将所述普通话高维序列转化为普通话低维序列,所述第一回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;所述第二比对模型包括:多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个所述第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;第二高维编码模块,与多个所述第二特征提取模块中最后一个模块的输出连接,用于将所述方言语音特征进行高维编码,得到方言高维序列;第二回归模块,与所述第二高维编码模块的输出连接,用于将所述方言高维序列转化为方言低维序列,所述第二回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:将所述第一比对模型通过无监督的知识蒸馏方法向所述第二比对模型进行知识蒸馏得到所述编码模型;将示例普通话语音输入第一比对模型,并将与和所述示例普通话语音相同语义的方言语音输入所述编码模型;获取所述第一比对模型的输出和所述编码模型的输出,计算所述第一比对模型的输出和所述编码模型的输出之间的差异度;基于所述第一比对模型的输出和所述编码模型的输出之间的差异度调整所述编码模型;将所述待识别低维序列进行解码,得到与所述待识别方言语音对应的文本。
根据本申请实施例的一个方面,提供了一种方言语音识别装置,包括:获取单元,配置为获取待识别方言语音;输入单元,配置为将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,所述编码模型是基于使用普通话 训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,所述第一比对模型包括:多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个所述第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;第一高维编码模块,与多个所述第一特征提取模块中最后一个模块的输出连接,用于将所述普通话语音特征进行高维编码,得到普通话高维序列;第一回归模块,与所述第一高维编码模块的输出连接,用于将所述普通话高维序列转化为普通话低维序列,所述第一回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;所述第二比对模型包括:多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个所述第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;第二高维编码模块,与多个所述第二特征提取模块中最后一个模块的输出连接,用于将所述方言语音特征进行高维编码,得到方言高维序列;第二回归模块,与所述第二高维编码模块的输出连接,用于将所述方言高维序列转化为方言低维序列,所述第二回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:将所述第一比对模型通过无监督的知识蒸馏方法向所述第二比对模型进行知识蒸馏得到所述编码模型;将示例普通话语音输入第一比对模型,并将与和所述示例普通话语音相同语义的方言语音输入所述编码模型;获取所述第一比对模型的输出和所述编码模型的输出,计算所述第一比对模型的输出和所述编码模型的输出之间的差异度;基于所述第一比对模型的输出和所述编码模型的输出之间的差异度调整所述编码模型;解码单元,配置为将所述待识别低维序列进行解码,得到与所述待识别方言语音对应的文本。
根据本申请实施例的一个方面,提供了一种计算机可读程序介质,其存储有计算机程序指令,当所述计算机程序指令被计算机执行时,使计算机执行:
获取待识别方言语音;
将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,所述编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,
所述第一比对模型包括:
多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个所述第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
第一高维编码模块,与多个所述第一特征提取模块中最后一个模块的输出连接,用于将所述普通话语音特征进行高维编码,得到普通话高维序列;
第一回归模块,与所述第一高维编码模块的输出连接,用于将所述普通话高维序列转化为普通话低维序列,所述第一回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
所述第二比对模型包括:
多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个所述第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
第二高维编码模块,与多个所述第二特征提取模块中最后一个模块的输出连接,用于将所述方言语音特征进行高维编码,得到方言高维序列;
第二回归模块,与所述第二高维编码模块的输出连接,用于将所述方言高维序列转化为方言低维序列,所述第二回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:
将所述第一比对模型通过无监督的知识蒸馏方法向所述第二比对模型进行知识蒸馏得到所述编码模型;
将示例普通话语音输入第一比对模型,并将与和所述示例普通话语音相同语义的方言语音输入所述编码模型;
获取所述第一比对模型的输出和所述编码模型的输出,计算所述第一比对模型的输出和所述编码模型的输出之间的差异度;
基于所述第一比对模型的输出和所述编码模型的输出之间的差异度调整所述编码模型;
将所述待识别低维序列进行解码,得到与所述待识别方言语音对应的文本。
根据本申请实施例的一个方面,提供了一种电子装置,包括:处理器;存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,实现:获取待识别方言语音;将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,所述编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,所述第一比对模型包括:多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个所述第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;第一高维编码模块,与多个所述第一特征提取模块中最后一个模块的输出连接,用于将所述普通话语音特征进行高维编码,得到普通话高维序列;第一回归模块,与所述第一高维编码模块的输出连接,用于将所述普通话高维序列转化为普通话低维序列,所述第一回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;所述第二比对模型包括:多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个所述第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;第二高维编码模块,与多个所述第二特征提取模块中最后一个模块的输出连接,用于将所述方言语音特征进行高维编码,得到方言高维序列;第二回归模块,与所述第二高维编码模块的输出连接,用于将所述方言高维序列转化为方言低维序列,所述第二回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:将所述第一比对模型通过无监督的知识蒸馏方法向所述第二比对模型进行知识蒸馏得到所述编码模型;将示例普通话语音输入第一比对模型,并将与和所述示例普通话语音相同语义的方言语音输入所述编码模型;获取所述第一比对模型的输出和所述编码模型的输出,计算所述第一比对模型的输出和所述编码模型的输出之间的差异度;基于所述第一比对模型的输出和所述编码模型的输出之间的差异度调整所述编码模型;将所述待识别低维序列进行解码,得到与所述待识别方言语音对应的文本。
有益效果
本申请实施例与现有技术相比存在的有益效果是:在本申请的一些实施例所提供的技术方案中,由于第一比对模型与第二比对模型输出的编码中已经包含了训练样本的特征,在进行知识蒸馏之前无需解码能够减少训练编码模型需要的时间,在得到编码模型的过程中无需解码,相比于在第一比对模型和第二比对模型中分别解码,在编 码模型后解码将两次解码简化为一次解码,在获得准确识别方言语音的编码模型的同时,提高了方言语音识别的效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。
图1示出了可以应用本申请实施例的技术方案的示例性系统架构的示意图;
图2示意性示出了根据本申请的一个实施例的方言语音识别方法的流程图;
图3示意性示出了本申请的一个实施例的一个第一特征提取模块结构示意图;
图4示意性示出了本申请的一个实施例的一个第一回归模块结构示意图;
图5示意性示出了根据本申请的一个实施例的方言语音识别装置的框图;
图6是根据一示例性实施例示出的一种电子装置的硬件图;
图7是根据一示例性实施例示出的一种用于实现上述方法的计算机可读存储介质。
本发明的实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
图1示出了可以应用本申请实施例的技术方案的示例性系统架构100的示意图。
如图1所示,系统架构100可以包括终端设备101(终端设备可以为智能手机、平板电脑、便携式计算机、台式计算机中的一种或多种)、网络102和服务器103。网络102用以在终端设备101和服务器103之间提供通信链路的介质。网络102可以包括各种连接类型,例如有线通信链路、无线通信链路等等。
应该理解,图1中的终端设备101、网络102和服务器103的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备101、网络102和服务器103。比如服务器103可以是多个服务器组成的服务器集群等。
在本申请的一个实施例中,服务器103可以通过获取待识别方言语音;将待识别方言语音输入编码模型得到与待识别方言语音对应的待识别低维序列,编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,第一比对模型包括:多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,卷积层有多个,多个卷积层之间采用跳跃连接的方式连接,以获得更远范围的特征;第一高维编码模块,与多个第一特征提取模块中最后一个模块的输出连接,用于将普通话语音特征进行高维编码, 得到普通话高维序列,以得到普通话语音样本中多个维度的特征;第一回归模块,与第一高维编码模块的输出连接,用于将普通话高维序列转化为普通话低维序列,从多个维度的特征中提取出关键特征,第一回归模块包括循环神经网络的隐藏层,隐藏层有多个,多个隐藏层中相邻的两个隐藏层之间设置一层聚焦层,以更准确的提取样本特征;第二比对模型包括:多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,卷积层有多个,多个卷积层之间采用跳跃连接的方式连接,以获得更远范围的特征;第二高维编码模块,与多个第二特征提取模块中最后一个模块的输出连接,用于将方言语音特征进行高维编码,得到方言高维序列,以得到方言语音样本中多个维度的特征;第二回归模块,与第二高维编码模块的输出连接,用于将方言高维序列转化为方言低维序列,从多个维度的特征中提取出关键特征,第二回归模块包括循环神经网络的隐藏层,隐藏层有多个,多个隐藏层中相邻的两个隐藏层之间设置一层聚焦层,以更准确的提取样本特征;基于第一比对模型和第二比对模型得到编码模型的过程,包括:将第一比对模型通过无监督的知识蒸馏方法向第二比对模型进行知识蒸馏得到编码模型;将示例普通话语音输入第一比对模型,并将与和示例普通话语音相同语义的方言语音输入编码模型;获取第一比对模型的输出和编码模型的输出,计算第一比对模型的输出和编码模型的输出之间的差异度;基于第一比对模型的输出和编码模型的输出之间的差异度调整编码模型,由于普通话和方言之间虽然有一定差异,但是也有很多相同之处,普通话样本可以弥补方言语音样本数量不足的问题,无监督蒸馏无需进行数据标记使训练过程更加简单,同时,基于差异度调整编码也能对普通话语音样本中与方言语音样本中差异进行校正,在补充样本数量的同时又能够提高编码模型识别方言语音的能力,使得到的编码模块能够在一定程度上提高方言语音识别的准确性,将待识别低维序列进行解码,得到与待识别方言语音对应的文本,由于第一比对模型与第二比对模型输出的编码中已经包含了训练样本的特征,在进行知识蒸馏之前无需解码能够减少训练编码模型需要的时间,在得到编码模型的过程中无需解码,相比于在第一比对模型和第二比对模型中分别解码,在编码模型后解码将两次解码简化为一次解码,在获得准确识别方言语音的编码模型的同时,提高了方言语音识别的效率。
需要说明的是,本申请实施例所提供的方言语音识别方法一般由服务器103执行,相应地,方言语音识别装置一般设置于服务器103中。但是,在本申请的其它实施例中,终端设备101也可以与服务器103具有相似的功能,从而执行本申请实施例所提供的方言语音识别方法。
以下对本申请实施例的技术方案的实现细节进行详细阐述:
图2示意性示出了根据本申请的一个实施例的方言语音识别方法的流程图,该方言语音识别方法的执行主体可以是服务器,比如可以是图1中所示的服务器103。
参照图2所示,该方言语音识别方法至少包括步骤S210至步骤S230,详细介绍如下:
在步骤S210中,获取待识别方言语音。
在本申请的一个实施例中,方言可以是各地地方话,如四川话、粤语、闽南语等,方言也可以是少数民族语言,如维吾尔族语、朝鲜族语等。
在步骤S220中,将待识别方言语音输入编码模型得到与待识别方言语音对应的待识别低维序列,编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,第一比对模型包括:
多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,卷积层有多个,多个卷积层之间采用跳跃连接的方式连接;
第一高维编码模块,与多个第一特征提取模块中最后一个模块的输出连接,用于将普通话语音特征进行高维编码,得到普通话高维序列;
第一回归模块,与第一高维编码模块的输出连接,用于将普通话高维序列转化为普通话低维序列,第一回归模块包括循环神经网络的隐藏层,隐藏层有多个,多个隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
第二比对模型包括:
多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,卷积层有多个,多个卷积层之间采用跳跃连接的方式连接;
第二高维编码模块,与多个第二特征提取模块中最后一个模块的输出连接,用于将方言语音特征进行高维编码,得到方言高维序列;
第二回归模块,与第二高维编码模块的输出连接,用于将方言高维序列转化为方言低维序列,第二回归模块包括循环神经网络的隐藏层,隐藏层有多个,多个隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
基于第一比对模型和第二比对模型得到编码模型的过程,包括:
将第一比对模型通过无监督的知识蒸馏方法向第二比对模型进行知识蒸馏得到编码模型;将示例普通话语音输入第一比对模型,并将与和示例普通话语音相同语义的方言语音输入编码模型;获取第一比对模型的输出和编码模型的输出,计算第一比对模型的输出和编码模型的输出之间的差异度;基于第一比对模型的输出和编码模型的输出之间的差异度调整编码模型。
在本申请的一个实施例中,可以获取待识别方言语音的窗长,根据窗长将待识别方言语音进行分帧,将分帧后的待识别方言语音输入至编码模型,分别对以窗长为单位的方言语音进行识别,可以使识别结果更加准确。可以不使用傅里叶变换的方式来对语音进行处理,直接在语音的采样点上根据窗长来分帧,既不影响语音识别效果,又能提高语音识别的效率。
在本申请的一个实施例中,第一特征提取模块的数量可以为5个。每个第一特征提取模块都包括跳跃连接,可以捕捉到更远更深的感受野,每个第一特征提取模块中都包括池化层(pooling)与全连接层,可以计算不同的维度的权重,从而获得最好的结果,如图3所示,图3示意性示出了本申请的一个实施例的一个第一特征提取模块结构示意图。
在本申请的一个实施例中,第一高维编码模块可以是由一维卷积层组成的嵌入层。
在本申请的一个实施例中,第一高维编码模块可以是高维非线性编码器。
在本申请的一个实施例中,第一回归模块可以包含5层循环神经网络(RNN网络)的隐藏层,每层包含4096个隐状态,在每两层RNN网络隐藏层之间加入聚焦模块,使得RNN网络隐藏层学会关注哪些位置,以使得到的方言低维序列能够更好的包含样本特征,如图4所示,图4示意性示出了本申请的一个实施例的一个第一回归模块结构示意图,其中,聚焦模块可以使用attention机制,attention机制可以通过加权计算出每个隐含单元的注意力,从而实现聚焦效果。
在本申请的一个实施例中,第一比对模型还可以包括第一输出模块,第一输出模块与第一回归模块的输出连接,用于将普通话低维序列输出,第一输出模块的损失函数为噪声收敛估计损失函数。
Figure PCTCN2021084305-appb-000001
这种损失函数用来对将来信息预测结果来进行判定,只有原始的结果被认为正样本,其他结果都被认为负样本,通过不断的训练正负样本之间的差异完成模型的训练。
Figure PCTCN2021084305-appb-000002
在本申请的一个实施例中,第二高维编码模块可以是由一维卷积层组成的嵌入层。
在本申请的一个实施例中,第二高维编码模块可以是高维非线性编码器。
在本申请的一个实施例中,第二特征提取模块的结构与第一特征提取模块的结构可以相同,第二高维编码模块的结构与第一高维编码模块的结构可以相同,第一回归模块的结构与第二回归模块的结构可以相同。
在本申请的一个实施例中,第二比对模型还可以包括第二输出模块,第二输出模块与第二回归模块的输出连接,用于将方言低维序列输出,第二输出模块的损失函数为噪声收敛估计损失函数。
在本申请的一个实施例中,可以计算第一比对模型的输出和编码模型的输出之间的KL散度,作为第一比对模型的输出和编码模型的输出之间的差异度。
在本申请的一个实施例中,可以计算第一比对模型的输出和编码模型的输出之间的余弦相似距离,作为第一比对模型的输出和编码模型的输出之间的差异度。
在本申请的一个实施例中,将训练后的第一比对模型通过无监督的知识蒸馏方法向第二比对模型进行知识蒸馏得到编码模型,将示例普通话语音输入第一比对模型,并将与和示例普通话语音相同语义的方言语音输入编码模型,获取第一比对模型的输出和编码模型的输出,计算第一比对模型的输出和编码模型的输出之间的差异度,根据第一比对模型的输出和编码模型的输出之间的差异度调整编码模型,在训练过程中无需进行标签标注,不需要质量较高的标注标签将两个分布之间的差异计算清楚又能提高方言语音识别的准确性。
在本申请的一个实施例中,基于第一比对模型和第二比对模型得到编码模型的过程可以包括:将示例普通话语音输入第一比对模型,并将与示例普通话语音语音相同的方言输入第二比对模型;获取第一比对模型的输出和第二比对模型的输出,计算第一比对模型的输出和第二比对模型的输出之间的差异度;基于第一比对模型的输出和第二比对模型的输出之间的差异度,将第一比对模型向第二比对模型进行知识蒸馏,得到编码模型。
在本申请的一个实施例中,可以计算第一比对模型的输出和第二比对模型的输出之间的KL散度,作为第一比对模型的输出和第二比对模型的输出之间的差异度。
在本申请的一个实施例中,可以计算第一比对模型的输出和第二比对模型的输出之间的余弦相似距离,作为第一比对模型的输出和第二比对模型的输出之间的差异度。
在本申请的一个实施例中,可以采用无监督训练方法训练第一比对模型,可以采用无监督训练方法训练第二比对模型。
继续参照图2,在步骤S230中,将待识别低维序列进行解码,得到与待识别方言语音对应的文本。
在本申请的一个实施例中,可以使用训练好的GRU神经网络,对待识别低维序列进行解码。
在本申请的一个实施例中,可以使用方言样本训练集和根据差异度调整的普通话样本训练集训练GRU神经网络。
在本申请的一个实施例中,方言可以包括多种语言,每种方言对应一个编码模型,训练该种方言对应的编码模型时,使用该种方言的方言训练样本集。
在图2的实施例中,通过获取待识别方言语音;将待识别方言语音输入编码模型得到与待识别方言语音对应的待识别低维序列,编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,第一比对模型包括:多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,卷积层有多个,多个卷积层之间采用跳跃连接的方式连接,以获得更远范围的特征;第一高维编码模块,与多个第一特征提取模块中最后一个模块的输出连接,用于将普通话语音特征进行高维编码,得到普通话高维序列,以得到普通话语音样本中多个维度的特征;第一回归模块,与第一高维编码模块的输出连接,用于将普通话高维序列转化为普通话低维序列,从多个维度的特征中提取出关键特征,第一回归模块包括循环神经网络的隐藏层,隐藏层有多个,多个隐藏层中相邻的两个隐藏层之间设置一层聚焦层,以更准确的提取样本特征;第二比对模型包括:多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,卷积层有多个,多个卷积层之间采用跳跃连接的方式连接,以获得更远范围的特征;第二高维编码模块,与多个第二特征提取模块中最后一个模块的输出连接,用于将方言语音特征进行高维编码,得到方言高维序列,以得到方言语音样本中多个维度的特征;第二回归模块,与第二高维编码模块的输出连接,用于将方言高维序列转化为方言低维序列,从多个维度的特征中提取出关键特征,第二回归模块包括循环神经网络的隐藏层,隐藏层有多个,多个隐藏层中相邻的两个隐藏层之间设置一层聚焦层,以更准确的提取样本特征;基于第一比对模型和第二比对模型得到编码模型的过程,包括:将第一比对模型通过无监督的知识蒸馏方法向第二比对模型进行知识蒸馏得到编码模型;将示例普通话语音输入第一比对模型,并将与和示例普通话语音相同语义的方言语音输入编码模型;获取第一比对模型的输出和编码模型的输出,计算第一比对模型的输出和编码模型的输出之间的差异度;基于第一比对模型的输出和编码模型的输出之间的差异度调整编码模型,由于普通话和方言之间虽然有一定差异,但是也有很多相同之处,普通话样本可以弥补方言语音样本数量不足的问题,无监督蒸馏无需进行数据标记使训练过程更加简单,同时,基于差异度调整编码也能对普通话语音样本中与方言语音样本中差异进行校正,在补充样本数量的同时又能够提高编码模型识别方言语音的能力,使得到的编码模块能够在一定程度上提高方言语音识别的准确性,将待识别低维序列进行解码,得到与待识别方言语音对应的文本,由于第一比对模型与第二比对模型输出的编码中已经包含了训练样本的特征,在进行知识蒸馏之前无需解码能够减少训练编码模型需要的时间,在得到编码模型的过程中无需解码,相比于在第一比对模型和第二比对模型中分别解码,在编码模型后解码将两次解码简化为一次解码,在获得准确识别方言语音的编码模型的同时,提高了方言语音识别的效率。
在本申请中,通过使用无监督的方法解决了在小语种语音识别中,训练的数据量较少的问题。通过无监督的方法可以获取语音通用的方言低维特征,从而解决了方言语音数据量较少的问题。另一方面通过使用知识蒸馏的方法,将经过大量数据训练的普通话模型参数蒸馏到方言的模型中可以很好的提升模型的性能。因此本文所提出的算法既解决了方言训练中训练数据量较少的问题获取的通用的底层特征有更加鲁棒的表达,又使用知识蒸馏的方法提高了数据的利用率,并提升了网络的性能。
以下介绍本申请的装置实施例,可以用于执行本申请上述实施例中的方言语音识别方法。对于本申请装置实施例中未披露的细节,请参照本申请上述的方言语音识别 方法的实施例。
图5示意性示出了根据本申请的一个实施例的方言语音识别装置的框图。
参照图5所示,本申请提供的一个实施例的方言语音识别装置500,包括获取单元501、输入单元502和解码单元503。
在本申请的一些实施例中,基于前述方案,获取单元501配置为获取待识别方言语音;输入单元502配置为将待识别方言语音输入编码模型得到与待识别方言语音对应的待识别低维序列,编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,第一比对模型包括:多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,卷积层有多个,多个卷积层之间采用跳跃连接的方式连接;第一高维编码模块,与多个第一特征提取模块中最后一个模块的输出连接,用于将普通话语音特征进行高维编码,得到普通话高维序列;第一回归模块,与第一高维编码模块的输出连接,用于将普通话高维序列转化为普通话低维序列,第一回归模块包括循环神经网络的隐藏层,隐藏层有多个,多个隐藏层中相邻的两个隐藏层之间设置一层聚焦层;第二比对模型包括:多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,卷积层有多个,多个卷积层之间采用跳跃连接的方式连接;第二高维编码模块,与多个第二特征提取模块中最后一个模块的输出连接,用于将方言语音特征进行高维编码,得到方言高维序列;第二回归模块,与第二高维编码模块的输出连接,用于将方言高维序列转化为方言低维序列,第二回归模块包括循环神经网络的隐藏层,隐藏层有多个,多个隐藏层中相邻的两个隐藏层之间设置一层聚焦层;基于第一比对模型和第二比对模型得到编码模型的过程,包括:将第一比对模型通过无监督的知识蒸馏方法向第二比对模型进行知识蒸馏得到编码模型;将示例普通话语音输入第一比对模型,并将与和示例普通话语音相同语义的方言语音输入编码模型;获取第一比对模型的输出和编码模型的输出,计算第一比对模型的输出和编码模型的输出之间的差异度;基于第一比对模型的输出和编码模型的输出之间的差异度调整编码模型;解码单元503配置为将待识别低维序列进行解码,得到与待识别方言语音对应的文本。
在本申请的一些实施例中,基于前述方案,获取单元501配置为:获取待识别方言语音的窗长;根据窗长将待识别方言语音进行分帧;将分帧后的待识别方言语音输入至编码模型。
在本申请的一些实施例中,基于前述方案,第一高维编码模块是由一维卷积层组成的嵌入层。
在本申请的一些实施例中,基于前述方案,第一比对模型还包括第一输出模块,第一输出模块与第一回归模块的输出连接,用于将普通话低维序列输出,第一输出模块的损失函数为噪声收敛估计损失函数。
在本申请的一些实施例中,基于前述方案,第二高维编码模块是由一维卷积层组成的嵌入层。
在本申请的一些实施例中,基于前述方案,第二比对模型还包括第二输出模块,第二输出模块与第二回归模块的输出连接,用于将方言低维序列输出,第二输出模块的损失函数为噪声收敛估计损失函数。
在本申请的一些实施例中,基于前述方案,获取单元501配置为:将示例普通话语音输入第一比对模型,并将与示例普通话语音语音相同的方言输入第二比对模型;获取第一比对模型的输出和第二比对模型的输出,计算第一比对模型的输出和第二比对模型的输出之间的差异度;基于第一比对模型的输出和第二比对模型的输出之间的 差异度,将第一比对模型向第二比对模型进行知识蒸馏,得到编码模型。
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。
下面参照图6来描述根据本申请的这种实施方式的电子设备60。图6显示的电子设备60仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,电子设备60以通用计算设备的形式表现。电子设备60的组件可以包括但不限于:上述至少一个处理单元61、上述至少一个存储单元62、连接不同系统组件(包括存储单元62和处理单元61)的总线63、显示单元64。
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元61执行,使得所述处理单元61执行本说明书上述“实施例方法”部分中描述的根据本申请各种示例性实施方式的步骤。
存储单元62可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)621和/或高速缓存存储单元622,还可以进一步包括只读存储单元(ROM)623。
存储单元62还可以包括具有一组(至少一个)程序模块625的程序/实用工具624,这样的程序模块625包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线63可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备60也可以与一个或多个外部设备(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备60交互的设备通信,和/或与使得该电子设备60能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口65进行。并且,电子设备60还可以通过网络适配器66与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器66通过总线63与电子设备60的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备60使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。
根据本申请一个实施例,还提供了一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本申请的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。
参考图7所示,描述了根据本申请的实施方式的用于实现上述方法的程序产品70,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本申请的程序产品不限于此,在本文件中,可读存储介 质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
此外,上述附图仅是根据本申请示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种方言语音识别方法,其中:
    获取待识别方言语音;
    将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,所述编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,
    所述第一比对模型包括:
    多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个所述第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
    第一高维编码模块,与多个所述第一特征提取模块中最后一个模块的输出连接,用于将所述普通话语音特征进行高维编码,得到普通话高维序列;
    第一回归模块,与所述第一高维编码模块的输出连接,用于将所述普通话高维序列转化为普通话低维序列,所述第一回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
    所述第二比对模型包括:
    多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个所述第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
    第二高维编码模块,与多个所述第二特征提取模块中最后一个模块的输出连接,用于将所述方言语音特征进行高维编码,得到方言高维序列;
    第二回归模块,与所述第二高维编码模块的输出连接,用于将所述方言高维序列转化为方言低维序列,所述第二回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
    基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:
    将所述第一比对模型通过无监督的知识蒸馏方法向所述第二比对模型进行知识蒸馏得到所述编码模型;
    将示例普通话语音输入第一比对模型,并将与和所述示例普通话语音相同语义的方言语音输入所述编码模型;
    获取所述第一比对模型的输出和所述编码模型的输出,计算所述第一比对模型的输出和所述编码模型的输出之间的差异度;
    基于所述第一比对模型的输出和所述编码模型的输出之间的差异度调整所述编码模型;
    将所述待识别低维序列进行解码,得到与所述待识别方言语音对应的文本。
  2. 根据权利要求1所述的方言语音识别方法,其中,所述将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,包括:
    获取所述待识别方言语音的窗长;
    根据所述窗长将所述待识别方言语音进行分帧;
    将分帧后的待识别方言语音输入至所述编码模型。
  3. 根据权利要求1所述的方言语音识别方法,其中,所述第一高维编码模块是由一维卷积层组成的嵌入层。
  4. 根据权利要求1所述的方言语音识别方法,其中,所述第一比对模型还包括第一输出模块,所述第一输出模块与所述第一回归模块的输出连接,用于将所述普通话低维序列输出,所述第一输出模块的损失函数为噪声收敛估计损失函数。
  5. 根据权利要求1所述的方言语音识别方法,其中,所述第二高维编码模块是由一 维卷积层组成的嵌入层。
  6. 根据权利要求1所述的方言语音识别方法,其中,所述第二比对模型还包括第二输出模块,所述第二输出模块与所述第二回归模块的输出连接,用于将所述方言低维序列输出,所述第二输出模块的损失函数为噪声收敛估计损失函数。
  7. 根据权利要求1所述的方言语音识别方法,其中,所述基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:
    将所述示例普通话语音输入所述第一比对模型,并将与所述示例普通话语音语音相同的方言输入所述第二比对模型;
    获取所述第一比对模型的输出和所述第二比对模型的输出,计算所述第一比对模型的输出和所述第二比对模型的输出之间的差异度;
    基于所述第一比对模型的输出和所述第二比对模型的输出之间的差异度,将所述第一比对模型向所述第二比对模型进行知识蒸馏,得到所述编码模型。
  8. 一种方言语音识别装置,其中,包括:
    获取单元,配置为获取待识别方言语音;
    输入单元,配置为将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,所述编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,
    所述第一比对模型包括:
    多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个所述第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
    第一高维编码模块,与多个所述第一特征提取模块中最后一个模块的输出连接,用于将所述普通话语音特征进行高维编码,得到普通话高维序列;
    第一回归模块,与所述第一高维编码模块的输出连接,用于将所述普通话高维序列转化为普通话低维序列,所述第一回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
    所述第二比对模型包括:
    多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个所述第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
    第二高维编码模块,与多个所述第二特征提取模块中最后一个模块的输出连接,用于将所述方言语音特征进行高维编码,得到方言高维序列;
    第二回归模块,与所述第二高维编码模块的输出连接,用于将所述方言高维序列转化为方言低维序列,所述第二回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
    基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:
    将所述第一比对模型通过无监督的知识蒸馏方法向所述第二比对模型进行知识蒸馏得到所述编码模型;
    将示例普通话语音输入第一比对模型,并将与和所述示例普通话语音相同语义的方言语音输入所述编码模型;
    获取所述第一比对模型的输出和所述编码模型的输出,计算所述第一比对模型的输出和所述编码模型的输出之间的差异度;
    基于所述第一比对模型的输出和所述编码模型的输出之间的差异度调整所述编码模型;
    解码单元,配置为将所述待识别低维序列进行解码,得到与所述待识别方言语音 对应的文本。
  9. 一种计算机可读程序介质,其上存储有计算机程序指令,其中,包括:
    所述计算机程序指令被计算机执行时,使计算机执行:
    获取待识别方言语音;
    将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,所述编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,
    所述第一比对模型包括:
    多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个所述第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
    第一高维编码模块,与多个所述第一特征提取模块中最后一个模块的输出连接,用于将所述普通话语音特征进行高维编码,得到普通话高维序列;
    第一回归模块,与所述第一高维编码模块的输出连接,用于将所述普通话高维序列转化为普通话低维序列,所述第一回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
    所述第二比对模型包括:
    多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个所述第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
    第二高维编码模块,与多个所述第二特征提取模块中最后一个模块的输出连接,用于将所述方言语音特征进行高维编码,得到方言高维序列;
    第二回归模块,与所述第二高维编码模块的输出连接,用于将所述方言高维序列转化为方言低维序列,所述第二回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
    基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:
    将所述第一比对模型通过无监督的知识蒸馏方法向所述第二比对模型进行知识蒸馏得到所述编码模型;
    将示例普通话语音输入第一比对模型,并将与和所述示例普通话语音相同语义的方言语音输入所述编码模型;
    获取所述第一比对模型的输出和所述编码模型的输出,计算所述第一比对模型的输出和所述编码模型的输出之间的差异度;
    基于所述第一比对模型的输出和所述编码模型的输出之间的差异度调整所述编码模型;
    将所述待识别低维序列进行解码,得到与所述待识别方言语音对应的文本。
  10. 根据权利要求9所述的计算机可读程序介质,其中,所述将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,包括:
    获取所述待识别方言语音的窗长;
    根据所述窗长将所述待识别方言语音进行分帧;
    将分帧后的待识别方言语音输入至所述编码模型。
  11. 根据权利要求9所述的计算机可读程序介质,其中,所述第一高维编码模块是由一维卷积层组成的嵌入层。
  12. 根据权利要求9所述的计算机可读程序介质,其中,所述第一比对模型还包括第一输出模块,所述第一输出模块与所述第一回归模块的输出连接,用于将所述普通话低维序列输出,所述第一输出模块的损失函数为噪声收敛估计损失函数。
  13. 根据权利要求9所述的计算机可读程序介质,其中,所述第二高维编码模块是由一维卷积层组成的嵌入层。
  14. 根据权利要求9所述的计算机可读程序介质,其中,所述第二比对模型还包括第二输出模块,所述第二输出模块与所述第二回归模块的输出连接,用于将所述方言低维序列输出,所述第二输出模块的损失函数为噪声收敛估计损失函数。
  15. 根据权利要求9所述的计算机可读程序介质,其中,所述基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:
    将所述示例普通话语音输入所述第一比对模型,并将与所述示例普通话语音语音相同的方言输入所述第二比对模型;
    获取所述第一比对模型的输出和所述第二比对模型的输出,计算所述第一比对模型的输出和所述第二比对模型的输出之间的差异度;
    基于所述第一比对模型的输出和所述第二比对模型的输出之间的差异度,将所述第一比对模型向所述第二比对模型进行知识蒸馏,得到所述编码模型。
  16. 一种电子装置,其中,包括:
    处理器;以及
    存储器,所述存储器上存储有计算机可读指令;其中,所述计算机可读指令被所述处理器执行时,实现:
    获取待识别方言语音;
    将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,所述编码模型是基于使用普通话训练样本集训练得到的第一比对模型和使用方言训练样本集训练的第二比对模型得到的,其中,
    所述第一比对模型包括:
    多个第一特征提取模块,用于提取普通话语音样本的普通话语音特征,每个所述第一特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
    第一高维编码模块,与多个所述第一特征提取模块中最后一个模块的输出连接,用于将所述普通话语音特征进行高维编码,得到普通话高维序列;
    第一回归模块,与所述第一高维编码模块的输出连接,用于将所述普通话高维序列转化为普通话低维序列,所述第一回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
    所述第二比对模型包括:
    多个第二特征提取模块,用于提取方言语音样本的方言语音特征,每个所述第二特征提取模块包括卷积层、与卷积层的输出连接的池化层和与池化层的输出连接的全连接层,其中,所述卷积层有多个,多个所述卷积层之间采用跳跃连接的方式连接;
    第二高维编码模块,与多个所述第二特征提取模块中最后一个模块的输出连接,用于将所述方言语音特征进行高维编码,得到方言高维序列;
    第二回归模块,与所述第二高维编码模块的输出连接,用于将所述方言高维序列转化为方言低维序列,所述第二回归模块包括循环神经网络的隐藏层,所述隐藏层有多个,多个所述隐藏层中相邻的两个隐藏层之间设置一层聚焦层;
    基于所述第一比对模型和所述第二比对模型得到编码模型的过程,包括:
    将所述第一比对模型通过无监督的知识蒸馏方法向所述第二比对模型进行知识蒸馏得到所述编码模型;
    将示例普通话语音输入第一比对模型,并将与和所述示例普通话语音相同语义的方言语音输入所述编码模型;
    获取所述第一比对模型的输出和所述编码模型的输出,计算所述第一比对模型的 输出和所述编码模型的输出之间的差异度;
    基于所述第一比对模型的输出和所述编码模型的输出之间的差异度调整所述编码模型;
    将所述待识别低维序列进行解码,得到与所述待识别方言语音对应的文本。
  17. 根据权利要求16所述的电子装置,其中,所述将所述待识别方言语音输入编码模型得到与所述待识别方言语音对应的待识别低维序列,包括:
    获取所述待识别方言语音的窗长;
    根据所述窗长将所述待识别方言语音进行分帧;
    将分帧后的待识别方言语音输入至所述编码模型。
  18. 根据权利要求16所述的电子装置,其中,所述第一高维编码模块是由一维卷积层组成的嵌入层。
  19. 根据权利要求16所述的电子装置,其中,所述第一比对模型还包括第一输出模块,所述第一输出模块与所述第一回归模块的输出连接,用于将所述普通话低维序列输出,所述第一输出模块的损失函数为噪声收敛估计损失函数。
  20. 根据权利要求16所述的电子装置,其中,所述第二高维编码模块是由一维卷积层组成的嵌入层。
PCT/CN2021/084305 2020-11-25 2021-03-31 方言语音识别方法、装置、介质及电子设备 WO2021213161A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011339518.XA CN112509555B (zh) 2020-11-25 2020-11-25 方言语音识别方法、装置、介质及电子设备
CN202011339518.X 2020-11-25

Publications (1)

Publication Number Publication Date
WO2021213161A1 true WO2021213161A1 (zh) 2021-10-28

Family

ID=74958592

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084305 WO2021213161A1 (zh) 2020-11-25 2021-03-31 方言语音识别方法、装置、介质及电子设备

Country Status (2)

Country Link
CN (1) CN112509555B (zh)
WO (1) WO2021213161A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117558264A (zh) * 2024-01-12 2024-02-13 联通(广东)产业互联网有限公司 一种基于自知识蒸馏的方言语音识别训练方法及系统

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509555B (zh) * 2020-11-25 2023-05-23 平安科技(深圳)有限公司 方言语音识别方法、装置、介质及电子设备
JP7381814B2 (ja) * 2020-12-15 2023-11-16 之江実験室 マルチタスク向けの予めトレーニング言語モデルの自動圧縮方法及びプラットフォーム
CN113345451B (zh) * 2021-04-26 2023-08-22 北京搜狗科技发展有限公司 一种变声方法、装置及电子设备
CN113192491B (zh) * 2021-04-28 2024-05-03 平安科技(深圳)有限公司 声学模型生成方法、装置、计算机设备及存储介质
CN112885351B (zh) * 2021-04-30 2021-07-23 浙江非线数联科技股份有限公司 一种基于迁移学习的方言语音识别方法及装置
CN113327600B (zh) * 2021-06-30 2024-07-23 北京有竹居网络技术有限公司 一种语音识别模型的训练方法、装置及设备
CN114171013B (zh) * 2021-12-31 2024-10-18 西安讯飞超脑信息科技有限公司 语音识别方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108172218A (zh) * 2016-12-05 2018-06-15 中国移动通信有限公司研究院 一种语音建模方法及装置
CN110211565A (zh) * 2019-05-06 2019-09-06 平安科技(深圳)有限公司 方言识别方法、装置及计算机可读存储介质
US20200013391A1 (en) * 2019-06-18 2020-01-09 Lg Electronics Inc. Acoustic information based language modeling system and method
CN111243575A (zh) * 2020-01-15 2020-06-05 北京工业大学 基于扩张卷积神经网络的方言种属识别方法
CN112509555A (zh) * 2020-11-25 2021-03-16 平安科技(深圳)有限公司 方言语音识别方法、装置、介质及电子设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961775A (zh) * 2017-12-15 2019-07-02 中国移动通信集团安徽有限公司 基于hmm模型的方言识别方法、装置、设备及介质
CN110033757A (zh) * 2019-04-04 2019-07-19 行知技术有限公司 一种人声识别算法
CN110033760B (zh) * 2019-04-15 2021-01-29 北京百度网讯科技有限公司 语音识别的建模方法、装置及设备
CN110706690B (zh) * 2019-09-16 2024-06-25 平安科技(深圳)有限公司 语音识别方法及其装置
CN111145728B (zh) * 2019-12-05 2022-10-28 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN111326157B (zh) * 2020-01-20 2023-09-08 抖音视界有限公司 文本生成方法、装置、电子设备和计算机可读介质
CN111540367B (zh) * 2020-04-17 2023-03-31 合肥讯飞数码科技有限公司 语音特征提取方法、装置、电子设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108172218A (zh) * 2016-12-05 2018-06-15 中国移动通信有限公司研究院 一种语音建模方法及装置
CN110211565A (zh) * 2019-05-06 2019-09-06 平安科技(深圳)有限公司 方言识别方法、装置及计算机可读存储介质
US20200013391A1 (en) * 2019-06-18 2020-01-09 Lg Electronics Inc. Acoustic information based language modeling system and method
CN111243575A (zh) * 2020-01-15 2020-06-05 北京工业大学 基于扩张卷积神经网络的方言种属识别方法
CN112509555A (zh) * 2020-11-25 2021-03-16 平安科技(深圳)有限公司 方言语音识别方法、装置、介质及电子设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117558264A (zh) * 2024-01-12 2024-02-13 联通(广东)产业互联网有限公司 一种基于自知识蒸馏的方言语音识别训练方法及系统

Also Published As

Publication number Publication date
CN112509555B (zh) 2023-05-23
CN112509555A (zh) 2021-03-16

Similar Documents

Publication Publication Date Title
WO2021213161A1 (zh) 方言语音识别方法、装置、介质及电子设备
US10380996B2 (en) Method and apparatus for correcting speech recognition result, device and computer-readable storage medium
US10854193B2 (en) Methods, devices and computer-readable storage media for real-time speech recognition
WO2020253060A1 (zh) 语音识别方法、模型的训练方法、装置、设备及存储介质
WO2020232860A1 (zh) 语音合成方法、装置及计算机可读存储介质
CN113283427B (zh) 文本识别方法、装置、设备及介质
WO2022121251A1 (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN110175336B (zh) 翻译方法、装置和电子设备
CN111402891A (zh) 语音识别方法、装置、设备和存储介质
WO2022095354A1 (zh) 基于bert的文本分类方法、装置、计算机设备及存储介质
CN111428470B (zh) 文本连贯性判定及其模型训练方法、电子设备及可读介质
CN112084752B (zh) 基于自然语言的语句标注方法、装置、设备及存储介质
WO2022257454A1 (zh) 一种合成语音的方法、装置、终端及存储介质
CN112446211A (zh) 文本处理装置、方法、设备和计算机可读存储介质
CN114547241B (zh) 一种联合字符感知和句子感知的小样本实体识别方法
WO2022105121A1 (zh) 一种应用于bert模型的蒸馏方法、装置、设备及存储介质
WO2023193394A1 (zh) 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质
CN116129902A (zh) 一种基于跨模态对齐的语音翻译方法及系统
CN115762489A (zh) 语音识别模型的数据处理系统及方法、语音识别方法
WO2023065635A1 (zh) 命名实体识别方法、装置、存储介质及终端设备
JP2023062150A (ja) 文字認識モデルトレーニング、文字認識方法、装置、機器及び媒体
CN117708568B (zh) 大语言模型的特征提取方法、装置、计算机设备及介质
Ahmed et al. CNN-based speech segments endpoints detection framework using short-time signal energy features
CN112767922B (zh) 一种对比预测编码自监督结构联合训练的语音识别方法
WO2023092719A1 (zh) 病历数据的信息抽取方法、终端设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21793822

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21793822

Country of ref document: EP

Kind code of ref document: A1