WO2022133915A1 - Speech recognition system and method automatically trained by means of speech synthesis method - Google Patents

Speech recognition system and method automatically trained by means of speech synthesis method Download PDF

Info

Publication number
WO2022133915A1
WO2022133915A1 PCT/CN2020/139051 CN2020139051W WO2022133915A1 WO 2022133915 A1 WO2022133915 A1 WO 2022133915A1 CN 2020139051 W CN2020139051 W CN 2020139051W WO 2022133915 A1 WO2022133915 A1 WO 2022133915A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
data
speech
training
speech recognition
Prior art date
Application number
PCT/CN2020/139051
Other languages
French (fr)
Chinese (zh)
Inventor
范小朋
苏充则
严伟玮
Original Assignee
杭州中科先进技术研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州中科先进技术研究院有限公司 filed Critical 杭州中科先进技术研究院有限公司
Priority to PCT/CN2020/139051 priority Critical patent/WO2022133915A1/en
Publication of WO2022133915A1 publication Critical patent/WO2022133915A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice

Definitions

  • the present application belongs to the technical field of artificial intelligence, and in particular relates to a speech recognition model system and method automatically trained by a speech synthesis method.
  • the neural network-based speech recognition model relies on a large number of training data sets. If the amount of training data set is not large enough, the training effect of the speech recognition model will be poor, resulting in a low recognition rate.
  • the methods based on machine learning and deep learning have amazing performance in the application of artificial intelligence such as image recognition and speech recognition.
  • These artificial intelligence capabilities are based on neural networks and large amounts of data, and neural networks have high requirements for training data volumes.
  • the data in Table 1 shows that training a speech recognition model usually requires thousands of hours of speech data and corresponding label data, so that the speech recognition model will have a higher recognition rate; the speech recognition model of the neural network relies on a large number of training data sets , if the amount of training data set is not large enough, the training effect of the speech recognition model will be poor, resulting in a problem that the recognition rate will be relatively low.
  • the present application provides a speech recognition system and method for automatic training through a speech synthesis method.
  • the present application provides a speech recognition system and method automatically trained by a speech synthesis method, including a speech collection module, a speech recognition module, a user error correction module, a collector and a speech synthesis module.
  • a speech recognition module including a speech collection module, a speech recognition module, a user error correction module, a collector and a speech synthesis module.
  • module, the speech recognition module, the user error correction module, the collector and the speech recognition module are sequentially connected in communication; the speech collection module, the speech synthesis module, the collector and the speech recognition module
  • the modules are in turn communicatively connected.
  • the speech recognition module includes a test set sub-module and a training set sub-module, the test set sub-module is in communication connection with the user error correction module, and the training set sub-module is connected to the collector communication connection;
  • the voice collection module, the test set sub-module, the user error correction module, the collector and the training set sub-module are sequentially connected in communication; the voice collection module, the voice synthesis module, the The controller and the training set sub-module are sequentially connected in communication.
  • the speech synthesis module includes a text collection submodule, and the text collection submodule is used to collect text data.
  • the collector includes a first collecting part and a second collecting part, the first collecting part is communicatively connected with the training set sub-module, and the second collecting part is connected with the training set sub-module.
  • the training set submodule communication connection is:
  • the present application provides a speech recognition method automatically trained by the speech synthesis method, and the speech recognition system automatically trained by the speech synthesis method is applied to the speech recognition method automatically trained by the speech synthesis method.
  • the method includes the following steps: step 1: collecting (target) user voice data; step 2: extracting the voice features of the voice data and performing voice synthesis to obtain the voice synthesis data and the The label data corresponding to the speech synthesis data is collected, the speech synthesis data and the label data are collected, and the first training data is obtained by verifying the speech synthesis data and the label data; voice recognition, and detecting and correcting the voice recognition results to obtain error correction data, collecting the voice data and the error correction data corresponding to the voice data, and analyzing the voice data and the error correction data corresponding to the voice data. Perform verification to obtain second training data; Step 4: Perform training on the first training data and the second training data, and update the speech recognition system automatically trained by the speech synthesis method according to the training results.
  • Step 2 includes checking whether the speech synthesis data is updated, and if it is updated, verifying the updated speech synthesis data and the corresponding label data, and the verification is passed. Then, the speech synthesis data and the label data are collected.
  • the first training data is input to the training set sub-module
  • the second training data is input to the training set sub-module, when the training set sub-module
  • the training is automatically performed according to the training data
  • the speech recognition model is updated according to the training results.
  • the speech recognition module checks the state of the automatic training process, if the training process is in progress, waits for the end of the training process; if the training process has been stopped, performs the automatic training process that has been stopped. examine.
  • the checking of the stopped automatic training process includes verifying the number of training rounds, and if the number of training rounds has been completed, the speech recognition automatically trained by the speech synthesis method is saved. , the whole process ends, and returns to the initial state to wait for the next round of the process to start; if the number of training rounds is not completed, it is determined that the training process is interrupted, and the training will be adjusted accordingly and restarted according to the reason for the interruption.
  • the speech recognition system automatically trained by the speech synthesis method provided by the present application is a speech recognition method automatically trained by the speech synthesis method, which overcomes the poor training effect of the neural network-based speech recognition model due to the insufficient amount of training set data.
  • the problem of low recognition rate is a speech recognition method automatically trained by the speech synthesis method, which overcomes the poor training effect of the neural network-based speech recognition model due to the insufficient amount of training set data. The problem of low recognition rate.
  • the speech recognition method automatically trained by the speech synthesis method provided by the present application, the speech data is automatically generated according to the user's speech characteristics through the speech synthesis method, and this data set includes the speech data with the user's speech characteristics and has corresponding to the speech data label, so this data set can be used directly as a training set in the speech recognition system automatically trained by the speech synthesis method; at the same time, the user's voice is obtained through the speech recognition process of the speech recognition model automatically trained by the speech synthesis method to obtain the recognition result.
  • this data set (including user voice data and error-corrected labels) can also be used as a training set in the speech recognition method automatically trained by the speech synthesis method.
  • the speech recognition method provided by the present application through the automatic training of the speech synthesis method avoids the process of manually labeling the speech data, reduces manpower and time, and has high efficiency.
  • the speech recognition method automatically trained by the speech synthesis method provided by the present application can quickly train a new speech recognition model and save the process of finding a suitable data set, and has high efficiency and strong applicability.
  • the speech recognition method provided by the present application is automatically trained by the speech synthesis method, and the speech recognition model can be automatically trained by the speech data after speech synthesis, which achieves extremely high efficiency.
  • the speech recognition method automatically trained by the speech synthesis method provided by this application adds the user's active error correction process, and can continuously improve the recognition rate according to the test results.
  • FIG. 1 is a schematic diagram of a speech recognition system automatically trained by a speech synthesis method of the present application
  • FIG. 2 is a schematic diagram of the working process of the speech recognition system automatically trained by the speech synthesis method of the present application.
  • the present application provides a speech recognition system automatically trained by a speech synthesis method, including a speech collection module 1, a speech recognition module 2, a user error correction module 3, a collector 4 and a speech synthesis module 5, the said The voice collection module 1, the voice recognition module 2, the user error correction module 3, the collector 4 and the voice recognition module 2 are sequentially connected in communication; the voice collection module 1, the voice synthesis module 5, The collector 4 and the speech recognition module 2 are sequentially connected in communication.
  • the speech recognition module 2 here includes a speech recognition process and a speech recognition model training process (abbreviation: recognition process and training process).
  • the data corrected by the user error correction module 3 enters the speech recognition module 2 again through the collector 4 for the training process.
  • the data synthesized by the speech synthesis module 5 also enters the speech recognition module 2 through the collector 4 for the training process.
  • the speech recognition module 2 refers to a speech recognition module developed with a speech recognition method as the main body
  • the speech synthesis module 5 refers to a speech synthesis module developed with a speech synthesis method as the main body.
  • the purpose of this application is to provide a speech recognition method automatically trained by a speech synthesis method, by which a large amount of speech data with the speaker's speech characteristics can be automatically generated by using the speech synthesis method using less speech data.
  • These speech data And the corresponding labels will be automatically added to the training set and the speech recognition system will be automatically trained, so as to overcome the problems of poor training effect and low recognition rate of the speech recognition system due to insufficient training data set, and greatly reduce the cost of training the speech recognition system.
  • the requirement of collecting voice data, while avoiding the tedious process of labeling according to the voice data set also provides a new method for training and testing the voice recognition system in the development process of the new voice recognition system.
  • the speech recognition module 2 includes a test set sub-module and a training set sub-module, the test set sub-module is in communication connection with the user error correction module, and the training set sub-module is in communication connection with the collector.
  • the voice collection module 1, the test set sub-module, the user error correction module 3, the collector 4 and the training set sub-module are sequentially connected in communication; the voice collection module 1, the voice synthesis module 5.
  • the collector 4 and the training set sub-module are sequentially connected in communication.
  • the speech synthesis module 5 includes a text collection submodule, and the text collection submodule is used to collect text data.
  • the collector 4 includes a first collecting part and a second collecting part, the first collecting part is connected in communication with the training set sub-module, and the second collecting part is connected in communication with the training set sub-module .
  • the voice collection module 1, the test set sub-module, the user error correction module 3, the first collection part and the training set sub-module are connected in communication in sequence; the voice collection module 1, the voice synthesis Module 5, the second collection unit and the training set sub-module are sequentially connected in communication.
  • Step 1 collect (target) user voice data; the specific process of step 1 is: step 1.1, through the recording device, that is, the voice collection module, collect user voice data according to the specifications of the voice data set, as data set A; step 1.2, if the If the data set is used for speech synthesis, go to step 2; if the data set is used as a test set in the speech recognition system automatically trained by the speech synthesis method, go to step 3.
  • Step 2 According to the speech data in Step 1, the speech synthesis module is used for speech synthesis; the specific process of Step 2 is: Step 2.1, the data set A is extracted by a method of extracting speech features (for example, MFCC and other methods) to extract speech features ; Step 2.2, the extracted speech feature is used as the parameter of the speech synthesis method (for example, SV2TTS, GANTTS etc.), and the text data collected by the text collection sub-module is used as data set B, and data set A is used as The input data of the speech synthesis module is subjected to speech synthesis; in step 2.3, the speech synthesis module will check whether the speech synthesis in the previous step (step 2.2) is completed, and if the speech synthesis is not completed, it will return to the previous step (step 2.2); if The speech synthesis has been completed and will proceed to the next step (step 2.4); in step 2.4, the speech data generated by the speech synthesis method will be used as the data set C; in step 2.5, the first collection department will check whether
  • step 2.6 the next step will be performed (step 2.6); if there is no update, no operation will be performed; in step 2.6, the first collection part will generate voice data according to the speech synthesis method in step 2.4 (data set C ) and the corresponding label data (data set B) to verify, if the voice data (data set C) generated by the speech synthesis method has the corresponding label data (data set B), then proceed to the next step (step 2.7) ; If the generated speech data (data set C) by the speech synthesis method does not have the corresponding label data, the next step will not be carried out, and return to step 2.1; Step 2.7, the first collection section will The speech data and the corresponding label data are added to the test set of the speech recognition system automatically trained by the speech synthesis method, and step 4 is performed.
  • Step 3 According to the voice data in step 1, voice recognition is carried out through the voice recognition module; the specific process of step 3 is: Step 3.1, the user voice data (data set A) collected in step 1 is stored in the voice recognition module.
  • the recognition module verifies whether the process of speech recognition is over, if the recognition process is not over, continue the recognition process; if the recognition process is over, save the speech recognition result and enter the next step (step 3.4); if the recognition process is interrupted, then return to the previous step (step 3.2); step 3.4, according to the recognition result of the speech recognition model in the speech recognition system automatically trained by the speech synthesis method in the previous step (step 3.3), carry out the user error correction process, and proceed to the next step (step 3.5); step 3.5.
  • the user error correction module detects whether the error correction process is completed. If the error correction process is completed, the user error correction module saves the error-corrected data as a label (data set D) corresponding to the data set A, and enters the next step ( Step 3.6); if the error correction process is not completed then return to the previous step (step 3.4) error correction process; Step 3.6, the second collection part detects whether the user voice data (data set A) has a corresponding label after the user error correction , if the user voice data (data set A) has the corresponding label (data set D) after user error correction, then proceed to the next step (step 3.7); if the user voice data (data set A) does not undergo user error correction The corresponding label of , then return to step 3.4; step 3.7, the second collection part puts the user voice data (data set A) and the corresponding label (data set D) after user error correction into the training set in the speech recognition module, and proceed to step 4.
  • Step 4 According to the data added to the training set of the speech recognition module in step 2 (speech synthesis module) or step 3 (speech recognition module), the training process is carried out through the speech recognition module; the specific process of step 4 is: step 4.1, check step 2 or the data added to the training set of the speech recognition module in step 3, when the amount of data reaches a certain threshold, the speech recognition module automatically conducts training through the data in the training set; step 4.2, the speech recognition module checks the status of the automatic training process , if the training process is in progress, wait for the end of the training process; if the training process has stopped, proceed to the next step (step 4.3); step 4.3, the speech recognition module checks the stopped automatic training process, if the number of training rounds is all Has been completed, then save the speech recognition automatically trained by the speech synthesis method, the whole process is over, and return to the initial state (step 1) to wait for the next round of the process to start; if the number of training rounds has not been completed, it is determined that the training process is interrupted, then
  • the present application also provides a speech recognition method automatically trained by the speech synthesis method, and the speech recognition system automatically trained by the speech synthesis method is applied to the speech recognition training automatically trained by the speech synthesis method.
  • the method includes the steps:
  • the step 2 includes checking whether the speech synthesis data is updated, if updated, verifying the updated speech synthesis data and the corresponding label data, and collecting the speech synthesis data after the verification is passed. with the tag data.
  • the first training data is input to the training set sub-module
  • the second training data is input to the training set sub-module
  • the amount of data in the training set sub-module reaches a specific threshold
  • the speech recognition module checks the state of the automatic training process, and if the training process is in progress, waits for the end of the training process; if the training process has stopped, checks the first training data and the second training data , and process the data.
  • the described data processing includes checking the automatic training process that has been stopped, if the number of training rounds has been completed, then save the speech recognition automatically trained by the speech synthesis method, the whole process ends, and returns to the initial state to wait. The next round starts; if the number of training rounds is not completed, it is determined that the training process is interrupted, and the training will be restarted according to the reason for the interruption.
  • the present application uses the combined use of neural network-based speech recognition technology and speech synthesis technology to generate speech data and conduct speech recognition training and testing through automatic training of speech synthesis methods; use speech synthesis methods to synthesize speech and use the results for A method for speech recognition; it solves the problem of insufficient training set data encountered by existing neural network-based speech recognition methods; a method for correcting speech recognition results by using user error correction and using it in the speech recognition training set.
  • Speech recognition technology based on neural network and speech synthesis technology are two independent research fields, and the two technologies have not been combined in the previous technology and research.
  • the present application proposes a speech recognition method automatically trained by the speech synthesis method.
  • the present application automatically generates voice data according to the user's voice characteristics through the speech synthesis method.
  • This data contains the user's voice characteristics and has a corresponding label, so this data can be used as a training set in the speech recognition method automatically trained by the speech synthesis method.
  • the user's voice can obtain the voice recognition result through a voice recognition method that is automatically trained by the speech synthesis method.
  • the data (including the voice data and the error correction labels) can also be used as a training set in a speech recognition system that is automatically trained by speech synthesis methods.
  • This application takes the test of real voice data as an example, and adopts a small amount of real voice data to synthesize voice data with real voice characteristics through a speech synthesis method. Speech recognition trained automatically by synthetic methods provides efficient speech datasets.
  • the voice features are extracted according to the user's voice data (data set A), and the speech synthesis module generates voice data (data set C) according to the input data (data set B) and according to the voice features; after verification by the collector, data set B and Data sets C correspond to each other, and these two data sets are stored in the training set of the speech recognition module. So far, steps 1 and 2 have been verified.
  • the speech recognition module starts training according to the data in the training set, and saves the new speech recognition model in the speech recognition system automatically trained by the speech synthesis method after the training is completed. At this point, step 4 has been independently verified.
  • the speech recognition module obtains the recognition result through the recognition process according to the input of the user's speech data (data set A); the user error correction module corrects the recognition result of the speech recognition module and saves it as data set D; the collector verifies the data set A and data After the set D corresponds to each other, the two data sets are stored in the training set of the speech recognition module; the speech recognition module starts training according to the data in the training set, and saves a new speech recognition system automatically trained by the speech synthesis method after the training is completed. Medium speech recognition model. So far, steps 3 and 4 have been verified.
  • the speech recognition method and the speech synthesis method are not limited to the method based on the neural network structure; LAS, CTC, RNN-T, RNN-T with BLSTM, RNN-T with GRU, etc.); the deep learning framework used to implement speech synthesis methods and speech recognition methods is not limited to TensorFlow or Pytorch, and the programming languages used are not limited to Python, Java, C++, etc.

Abstract

A speech recognition system and method automatically trained by means of a speech synthesis method, the speech recognition system and method belonging to the technical field of artificial intelligence. A neural network-based speech recognition model relies on a large quantity of training data sets. If the quantity of the training data sets is not large enough, the training effect of the speech recognition model will be poor, so that the recognition rate will be relatively low. The speech recognition system automatically trained by means of the speech synthesis method comprises a speech collection module (1), a speech recognition module (2), a user error correction module (3), a collector (4) and a speech synthesis module (5), wherein the speech collection module (1), the speech recognition module (2), the user error correction module (3), the collector (4) and the speech synthesis module (5) are sequentially communicatively connected. The problem of the neural network-based speech recognition model having a poor training effect or a low recognition rate due to an insufficient quantity of training set data is overcome.

Description

一种通过语音合成方法自动训练的语音识别系统及方法A speech recognition system and method automatically trained by a speech synthesis method 技术领域technical field
本申请属于人工智能技术领域,特别是涉及一种通过语音合成方法自动训练的语音识别模型系统及方法。The present application belongs to the technical field of artificial intelligence, and in particular relates to a speech recognition model system and method automatically trained by a speech synthesis method.
背景技术Background technique
基于机器学习和深度学习的方法在很多应用中表现出了惊人的人工智能的能力,尤其在图像识别和语音识别的应用中有着超出人类视觉和听觉的表现。这些人工智能的能力大部分都归功于神经网络,而神经网络对训练数据量有非常高的要求。比如,在图像识别中目标检测的应用中,训练目标检测模型需要数万张相关图像,对于这个模型才会有较高的识别率。而在语音识别的应用中也存在相同的情况,训练语音识别模型,通常需要数千小时的语音数据和相对应的标签。Methods based on machine learning and deep learning have shown amazing artificial intelligence capabilities in many applications, especially in the application of image recognition and speech recognition, which have outperformed human vision and hearing. Much of these AI capabilities owe to neural networks, which have very high requirements for training data. For example, in the application of target detection in image recognition, tens of thousands of relevant images are needed to train the target detection model, and only this model can have a high recognition rate. The same situation exists in the application of speech recognition. Training a speech recognition model usually requires thousands of hours of speech data and corresponding labels.
根据表1中的详细数据可以得出结论,即基于神经网络的语音识别模型依赖大量的训练数据集。如果训练数据集的量不够大,语音识别模型的训练效果会差,使得识别率会比较低。According to the detailed data in Table 1, it can be concluded that the neural network-based speech recognition model relies on a large number of training data sets. If the amount of training data set is not large enough, the training effect of the speech recognition model will be poor, resulting in a low recognition rate.
表1.语音识别方法和对应训练数据集的信息Table 1. Speech recognition methods and information on corresponding training datasets
Figure PCTCN2020139051-appb-000001
Figure PCTCN2020139051-appb-000001
发明内容SUMMARY OF THE INVENTION
1.要解决的技术问题1. Technical problems to be solved
基于机器学习和深度学习的方法在图像识别、语音识别等人工智能的应用中都有惊人的表现。这些人工智能的能力是根据神经网络和大量数据来实现的,而神经网络对训练数据量有很高的要求。表1的数据表明了训练语音识别模型通常需要数千小时的语音数据和相对应的标签数据,这样语音识别模型才会有较高的识别率;神经网络的语音识别模型依赖大量的训练数据集,如果训练数据集的量不够大,语音识别模型的训练效果会差,导致识别率会比较低的问题,本申请提供了一种通过语音合成方法自动训练的语音识别系统及方法。The methods based on machine learning and deep learning have amazing performance in the application of artificial intelligence such as image recognition and speech recognition. These artificial intelligence capabilities are based on neural networks and large amounts of data, and neural networks have high requirements for training data volumes. The data in Table 1 shows that training a speech recognition model usually requires thousands of hours of speech data and corresponding label data, so that the speech recognition model will have a higher recognition rate; the speech recognition model of the neural network relies on a large number of training data sets , if the amount of training data set is not large enough, the training effect of the speech recognition model will be poor, resulting in a problem that the recognition rate will be relatively low. The present application provides a speech recognition system and method for automatic training through a speech synthesis method.
2.技术方案2. Technical solutions
为了达到上述的目的,本申请提供了一种通过语音合成方法自动训练的语音识别系统及方法,包括语音收集模块、语音识别模块、用户纠错模块、收集器和语音合成模块,所述语音收集模块、所述语音识别模块、所述用户纠错模块、所述收集器和所述语音识别模块依次通信连接;所述语音收集模块、所述语音合成模块、所述收集器和所述语音识别模块依次通信连接。In order to achieve the above purpose, the present application provides a speech recognition system and method automatically trained by a speech synthesis method, including a speech collection module, a speech recognition module, a user error correction module, a collector and a speech synthesis module. module, the speech recognition module, the user error correction module, the collector and the speech recognition module are sequentially connected in communication; the speech collection module, the speech synthesis module, the collector and the speech recognition module The modules are in turn communicatively connected.
本申请提供的另一种实施方式为:所述语音识别模块包括测试集子模块和训练集子模块,所述测试集子模块与所述用户纠错模块通信连接,所述训练集子模块与所述收集器通信连接;Another embodiment provided by this application is: the speech recognition module includes a test set sub-module and a training set sub-module, the test set sub-module is in communication connection with the user error correction module, and the training set sub-module is connected to the collector communication connection;
所述语音收集模块、所述测试集子模块、所述用户纠错模块、所述收集器和所述训练集子模块依次通信连接;所述语音收集模块、所述语音合成模块、所述收集器和所述训练集子模块依次通信连接。The voice collection module, the test set sub-module, the user error correction module, the collector and the training set sub-module are sequentially connected in communication; the voice collection module, the voice synthesis module, the The controller and the training set sub-module are sequentially connected in communication.
本申请提供的另一种实施方式为:所述语音合成模块包括文字收集子模块,所述文字收集子模块用于收集文字数据。Another embodiment provided by the present application is that: the speech synthesis module includes a text collection submodule, and the text collection submodule is used to collect text data.
本申请提供的另一种实施方式为:所述收集器包括第一收集部和第二收集部,所述第一收集部与所述训练集子模块通信连接,所述第二收集部与所述训练集子模块通信连接。Another embodiment provided by the present application is: the collector includes a first collecting part and a second collecting part, the first collecting part is communicatively connected with the training set sub-module, and the second collecting part is connected with the training set sub-module. The training set submodule communication connection.
本申请提供一种通过语音合成方法自动训练的语音识别方法,将所述的通过语音合成方法自动训练的语音识别系统应用于通过语音合成方法自动训练的语音识别方法中。The present application provides a speech recognition method automatically trained by the speech synthesis method, and the speech recognition system automatically trained by the speech synthesis method is applied to the speech recognition method automatically trained by the speech synthesis method.
本申请提供的另一种实施方式为:所述方法包括如下步骤:步骤1:收集(目标)用户语音数据;步骤2:提取所述语音数据的语音特征进行语音合成得到语音合成数据和与所述语音合成数据相对应的标签数据,收集所述语音合成数据与所述标签数据,对所述语音合成数据与所述标签数据进行验证得到第一训练数据;步骤3:对所述语音数据进行语音识别,并对语音识别结果进行检测纠错得到纠错数据,收集所述语音数据和与所述语音数据对应的纠错数据,对所述语音数据和与所述语音数据对应的纠错数据进行验证得到第二训练数据;步骤4:对所述第一训练数据和所述第二训练数据进行训练,并根据训练结果更新通过语音合成方法自动训练的语音识别系统。Another embodiment provided by the present application is: the method includes the following steps: step 1: collecting (target) user voice data; step 2: extracting the voice features of the voice data and performing voice synthesis to obtain the voice synthesis data and the The label data corresponding to the speech synthesis data is collected, the speech synthesis data and the label data are collected, and the first training data is obtained by verifying the speech synthesis data and the label data; voice recognition, and detecting and correcting the voice recognition results to obtain error correction data, collecting the voice data and the error correction data corresponding to the voice data, and analyzing the voice data and the error correction data corresponding to the voice data. Perform verification to obtain second training data; Step 4: Perform training on the first training data and the second training data, and update the speech recognition system automatically trained by the speech synthesis method according to the training results.
本申请提供的另一种实施方式为:所述步骤2中包括对所述语音合成数据是否更新进行检查,若更新,则对更新后的语音合成数据与相对应的标签数据进行验证,验证通过后,收集所述语音合成数据与所述标签数据。Another embodiment provided by the present application is: Step 2 includes checking whether the speech synthesis data is updated, and if it is updated, verifying the updated speech synthesis data and the corresponding label data, and the verification is passed. Then, the speech synthesis data and the label data are collected.
本申请提供的另一种实施方式为:所述第一训练数据输入至所述训练集子模块,所述第二训练数据输入至所述训练集子模块,当所述训练集子模块中的数据量达到特定阈值,则根据训练数据自动进行训练,并根据训练结果更新语音识别模型。Another embodiment provided by this application is: the first training data is input to the training set sub-module, the second training data is input to the training set sub-module, when the training set sub-module When the amount of data reaches a certain threshold, the training is automatically performed according to the training data, and the speech recognition model is updated according to the training results.
本申请提供的另一种实施方式为:所述语音识别模块检查自动训练过程的状态,如果训练过程正在进行,则等待训练过程结束;如果训练过程已经停止,则对已经停止的自动训练过程进行检查。Another embodiment provided by the present application is: the speech recognition module checks the state of the automatic training process, if the training process is in progress, waits for the end of the training process; if the training process has been stopped, performs the automatic training process that has been stopped. examine.
本申请提供的另一种实施方式为:所述对已经停止的自动训练过程进行检查包括对训练轮数进行验证,如果训练轮数已经完成,则保存所述通过语音合成方法自动训练的语音识别,整个过程结束,并返回最初状态等待下一轮过程开始;如果训练轮数没有经完成,判定为训练过程中断,则根据中断原因做相应调整并重新开始训练。Another embodiment provided by the present application is: the checking of the stopped automatic training process includes verifying the number of training rounds, and if the number of training rounds has been completed, the speech recognition automatically trained by the speech synthesis method is saved. , the whole process ends, and returns to the initial state to wait for the next round of the process to start; if the number of training rounds is not completed, it is determined that the training process is interrupted, and the training will be adjusted accordingly and restarted according to the reason for the interruption.
3.有益效果3. Beneficial effects
与现有技术相比,本申请提供的一种通过语音合成方法自动训练的语音识别系统及方法的有益效果在于:Compared with the prior art, the beneficial effects of a speech recognition system and method automatically trained by a speech synthesis method provided by the present application are:
本申请提供的通过语音合成方法自动训练的语音识别系统,为一种通过语音合成方法自动训练的语音识别的方法,克服了基于神经网络的语音识别模型因训练集数据量不足而训练效果差或识别率低的问题。The speech recognition system automatically trained by the speech synthesis method provided by the present application is a speech recognition method automatically trained by the speech synthesis method, which overcomes the poor training effect of the neural network-based speech recognition model due to the insufficient amount of training set data. The problem of low recognition rate.
本申请提供的通过语音合成方法自动训练的语音识别方法,通过语音合成方法来自动地根据用户语音特征生成语音数据,此数据集包含带有用户语音特征的语音数据并且有与语音数据相对应的标签,所以此数据集可以作为通过语音合成方法自动训练的语音识别系统中的训练集来直接使用;同时,用户语音通过语音合成方法自动训练后的语音识别模型进行语音识别过程获得识别结果,此识别结果在经过用户的纠错过程后,此数据集(包含用户语音数据和纠错后的标签)也可以作为通过语音合成方法自动训练的语音识别方法中的的训练集使用。The speech recognition method automatically trained by the speech synthesis method provided by the present application, the speech data is automatically generated according to the user's speech characteristics through the speech synthesis method, and this data set includes the speech data with the user's speech characteristics and has corresponding to the speech data label, so this data set can be used directly as a training set in the speech recognition system automatically trained by the speech synthesis method; at the same time, the user's voice is obtained through the speech recognition process of the speech recognition model automatically trained by the speech synthesis method to obtain the recognition result. After the recognition result goes through the user's error correction process, this data set (including user voice data and error-corrected labels) can also be used as a training set in the speech recognition method automatically trained by the speech synthesis method.
本申请提供的通过语音合成方法自动训练的语音识别方法,避免了人工对语音数据作标签的过程,减少人力和时间,效率高。The speech recognition method provided by the present application through the automatic training of the speech synthesis method avoids the process of manually labeling the speech data, reduces manpower and time, and has high efficiency.
本申请提供的通过语音合成方法自动训练的语音识别方法,能够较快地对新的语音识别模型展开训练而省去寻找合适数据集的过程,效率高并且适用性强。The speech recognition method automatically trained by the speech synthesis method provided by the present application can quickly train a new speech recognition model and save the process of finding a suitable data set, and has high efficiency and strong applicability.
本申请提供的通过语音合成方法自动训练的语音识别方法,通过语音合成后的语音数据可以自动训练语音识别模型,达到了极高的效率。The speech recognition method provided by the present application is automatically trained by the speech synthesis method, and the speech recognition model can be automatically trained by the speech data after speech synthesis, which achieves extremely high efficiency.
本申请提供的通过语音合成方法自动训练的语音识别方法,加入了用户主动的纠错过程,可以根据测试结果不断提高识别率。The speech recognition method automatically trained by the speech synthesis method provided by this application adds the user's active error correction process, and can continuously improve the recognition rate according to the test results.
附图说明Description of drawings
图1是本申请的通过语音合成方法自动训练的语音识别系统示意图;1 is a schematic diagram of a speech recognition system automatically trained by a speech synthesis method of the present application;
图2是本申请的通过语音合成方法自动训练的语音识别系统工作过程示意图。FIG. 2 is a schematic diagram of the working process of the speech recognition system automatically trained by the speech synthesis method of the present application.
具体实施方式Detailed ways
在下文中,将参考附图对本申请的具体实施例进行详细地描述,依照这些详细的描述,所属领域技术人员能够清楚地理解本申请,并能够实施本申请。在不违背本申请原理的情况下,各个不同的实施例中的特征可以进行组合以获得新的实施方式,或者替代某些实施例中的某些特征,获得其它优选的实施方式。Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, from which those skilled in the art can clearly understand the present application and be able to implement the present application. Without departing from the principles of the present application, the features of the various embodiments may be combined to obtain new embodiments, or instead of certain features of certain embodiments, to obtain other preferred embodiments.
参见图1~2,本申请提供一种通过语音合成方法自动训练的语音识别系统,包括语音收集模块1、语音识别模块2、用户纠错模块3、收集器4和语音合成模块5,所述语音收集模块1、所述语音识别模块2、所述用户纠错模块3、所述收集器4和所述语音识别模块2依次通信连接;所述语音收集模块1、所述语音合成模块5、所述收集器4和所述语音识别模块2依次通信连接。1 to 2, the present application provides a speech recognition system automatically trained by a speech synthesis method, including a speech collection module 1, a speech recognition module 2, a user error correction module 3, a collector 4 and a speech synthesis module 5, the said The voice collection module 1, the voice recognition module 2, the user error correction module 3, the collector 4 and the voice recognition module 2 are sequentially connected in communication; the voice collection module 1, the voice synthesis module 5, The collector 4 and the speech recognition module 2 are sequentially connected in communication.
这里的语音识别模块2包括语音识别过程和语音识别模型训练过程(简称:识别过程和训练过程)。经过用户纠错模块3纠错的数据通过收集器4再次进入语音识别模块2进行训练过程。通过语音合成模块5合成的数据也是通过收集器4进入语音识别模块2进行训练过程。The speech recognition module 2 here includes a speech recognition process and a speech recognition model training process (abbreviation: recognition process and training process). The data corrected by the user error correction module 3 enters the speech recognition module 2 again through the collector 4 for the training process. The data synthesized by the speech synthesis module 5 also enters the speech recognition module 2 through the collector 4 for the training process.
本申请中语音识别模块2是指以语音识别方法为主体而开发的语音识别模块,所述语音合成模块5是指以语音合成方法为主体而开发的语音合成模块。In this application, the speech recognition module 2 refers to a speech recognition module developed with a speech recognition method as the main body, and the speech synthesis module 5 refers to a speech synthesis module developed with a speech synthesis method as the main body.
本申请的目的是提供一种通过语音合成方法自动训练的语音识别方法,采用该方法可以通过语音合成方法使用较少的语音数据来自动生成大量带有说话人语音特征的语音数据,这些语音数据和对应的标签会自动地加入训练集并自动训练语音识别系统,从而克服因训练数据集不足使语音识别系统训练效果差和识别率低的问题,在很大程度上降低了训练语音识别系统而收集语音数据的要求,同时能够避免根据语音数据集而做标签的繁琐过程,也为在新的语音识别系统的开发过程中,对语音识别系统的训练和测试提供了新的方法。The purpose of this application is to provide a speech recognition method automatically trained by a speech synthesis method, by which a large amount of speech data with the speaker's speech characteristics can be automatically generated by using the speech synthesis method using less speech data. These speech data And the corresponding labels will be automatically added to the training set and the speech recognition system will be automatically trained, so as to overcome the problems of poor training effect and low recognition rate of the speech recognition system due to insufficient training data set, and greatly reduce the cost of training the speech recognition system. The requirement of collecting voice data, while avoiding the tedious process of labeling according to the voice data set, also provides a new method for training and testing the voice recognition system in the development process of the new voice recognition system.
进一步地,所述语音识别模块2包括测试集子模块和训练集子模块,所述测试集子模块与所述用户纠错模块通信连接,所述训练集子模块与所述收集器通信连接。Further, the speech recognition module 2 includes a test set sub-module and a training set sub-module, the test set sub-module is in communication connection with the user error correction module, and the training set sub-module is in communication connection with the collector.
所述语音收集模块1、所述测试集子模块、所述用户纠错模块3、所述收集器4和所述训练集子模块依次通信连接;所述语音收集模块1、所述语音合成模块5、所述收集器4和所述训练集子模块依次通信连接。The voice collection module 1, the test set sub-module, the user error correction module 3, the collector 4 and the training set sub-module are sequentially connected in communication; the voice collection module 1, the voice synthesis module 5. The collector 4 and the training set sub-module are sequentially connected in communication.
进一步地,所述语音合成模块5包括文字收集子模块,所述文字收集子模块用于收集文字数据。Further, the speech synthesis module 5 includes a text collection submodule, and the text collection submodule is used to collect text data.
进一步地,所述收集器4包括第一收集部和第二收集部,所述第一收集部与所述训练集子模块通信连接,所述第二收集部与所述训练集子模块通信连接。Further, the collector 4 includes a first collecting part and a second collecting part, the first collecting part is connected in communication with the training set sub-module, and the second collecting part is connected in communication with the training set sub-module .
所述语音收集模块1、所述测试集子模块、所述用户纠错模块3、所述第一收集部和所述训练集子模块依次通信连接;所述语音收集模块1、所述语音合成模块5、所述第二收集部和所述训练集子模块依次通信连接。The voice collection module 1, the test set sub-module, the user error correction module 3, the first collection part and the training set sub-module are connected in communication in sequence; the voice collection module 1, the voice synthesis Module 5, the second collection unit and the training set sub-module are sequentially connected in communication.
步骤1:收集(目标)用户语音数据;步骤1的具体过程为:步骤1.1,通过录音设备即语音收集模块,按照语音数据集的规范收集用户语音数据,作为数据集A;步骤1.2,如果该数据集用来做语音合成,则进行步骤2;如果该数据集用来做通过语音合成方法自动训练的语音识别系统中的测试集,则进行步骤3。Step 1: collect (target) user voice data; the specific process of step 1 is: step 1.1, through the recording device, that is, the voice collection module, collect user voice data according to the specifications of the voice data set, as data set A; step 1.2, if the If the data set is used for speech synthesis, go to step 2; if the data set is used as a test set in the speech recognition system automatically trained by the speech synthesis method, go to step 3.
步骤2:根据步骤1中的语音数据,通过该语音合成模块进行语音合成;步骤2的具体过程为:步骤2.1,将数据集A通过提取语音特征的方法(比如,MFCC等方法)提取语音特征;步骤2.2,将提取到的语音特征作为语音合成方法(比如,SV2TTS,GANTTS等方法)的参数,并通过该文字收集子模块收集到的文字数据,作为数据集B,并将数据集A作为该语音合成模块的输入数据进行语音合成;步骤2.3,该语音合成模块会检查上一步骤(步骤2.2)中的语音合成是否完成,如果语音合成没有完成将返回上一步骤(步骤2.2);如果语音合成已经完成将进行下一步骤(步骤2.4);步骤2.4,将通过语音合成方法生成的语音数据,作为数据集C;步骤2.5,第一收集部会按时检验通过语音合成方法生成的语音数据是否有更新,如果有更新会进行下一步骤(步骤2.6);如果没有更新将不会进行任何操作;步骤2.6,所述第一收集部会根据步骤2.4中语音合成方法的生成语音数据(数据集C)和对应的标签数据(数据集B)进行验证,如果通过语音合成方法的生成语音数据(数据集C)有与之对应的标签数据(数据集B),则进行下一步骤(步骤2.7);如果通过语音合成方法的生成语音数据(数据集C)没有与之对应的标签数据,则不会进行下一步骤,并返回步骤2.1;步骤2.7,该第一收集部将上一步骤中的语音数据和与之相对应的标签数据加入通过语音合成方法自动训练的语音识别系统的测试集中,并进行步骤4。Step 2: According to the speech data in Step 1, the speech synthesis module is used for speech synthesis; the specific process of Step 2 is: Step 2.1, the data set A is extracted by a method of extracting speech features (for example, MFCC and other methods) to extract speech features ; Step 2.2, the extracted speech feature is used as the parameter of the speech synthesis method (for example, SV2TTS, GANTTS etc.), and the text data collected by the text collection sub-module is used as data set B, and data set A is used as The input data of the speech synthesis module is subjected to speech synthesis; in step 2.3, the speech synthesis module will check whether the speech synthesis in the previous step (step 2.2) is completed, and if the speech synthesis is not completed, it will return to the previous step (step 2.2); if The speech synthesis has been completed and will proceed to the next step (step 2.4); in step 2.4, the speech data generated by the speech synthesis method will be used as the data set C; in step 2.5, the first collection department will check whether the speech data generated by the speech synthesis method is on time. If there is an update, if there is an update, the next step will be performed (step 2.6); if there is no update, no operation will be performed; in step 2.6, the first collection part will generate voice data according to the speech synthesis method in step 2.4 (data set C ) and the corresponding label data (data set B) to verify, if the voice data (data set C) generated by the speech synthesis method has the corresponding label data (data set B), then proceed to the next step (step 2.7) ; If the generated speech data (data set C) by the speech synthesis method does not have the corresponding label data, the next step will not be carried out, and return to step 2.1; Step 2.7, the first collection section will The speech data and the corresponding label data are added to the test set of the speech recognition system automatically trained by the speech synthesis method, and step 4 is performed.
步骤3:根据步骤1中的语音数据,通过语音识别模块进行语音识别;步骤3的具体过程为:步骤3.1,将步骤1中收集的用户语音数据(数据集A)存放在语音识别模块中语音识别过程中的测试集中;步骤3.2,根据上一步骤(步骤3.1)中的测试集进行通过语音合成方法自动训练的语音识别系统中语音识别模型的测试,即语音识别过程;步骤3.3,该语音识别模块验证语音识别的过程是否结束,如果识别过程没有结束则继续识别过程;如果识别过程结束,则保存语音识别结果并进入下一步骤(步骤3.4);如果识别过程中断,则返回上一步骤(步骤3.2);步骤3.4,根据上一步骤(步骤3.3)中通过语音合成方法自动训练的语音识别系统中语音识别模型的识别结果进行用户纠错过程,进行下一步骤(步骤3.5);步骤3.5,用户纠错模块检测纠错过程是否完成,如果纠错过程已经完成,用户纠错模块将纠错后数据 保存为与数据集A对应的标签(数据集D),并进入下一步骤(步骤3.6);如果纠错过程没有完成则返回上一步骤(步骤3.4)纠错过程;步骤3.6,该第二收集部检测用户语音数据(数据集A)是否有经过用户纠错后的对应标签,如果用户语音数据(数据集A)有经过用户纠错后的对应标签(数据集D),则进行下一步骤(步骤3.7);如果用户语音数据(数据集A)没有经过用户纠错后的对应标签,则返回步骤3.4;步骤3.7,该第二收集部将用户语音数据(数据集A)和经过用户纠错后的对应标签(数据集D)放入语音识别模块中的训练集,并进行步骤4。Step 3: According to the voice data in step 1, voice recognition is carried out through the voice recognition module; the specific process of step 3 is: Step 3.1, the user voice data (data set A) collected in step 1 is stored in the voice recognition module. The test set in the recognition process; step 3.2, according to the test set in the previous step (step 3.1), the test of the speech recognition model in the speech recognition system automatically trained by the speech synthesis method, that is, the speech recognition process; step 3.3, the voice The recognition module verifies whether the process of speech recognition is over, if the recognition process is not over, continue the recognition process; if the recognition process is over, save the speech recognition result and enter the next step (step 3.4); if the recognition process is interrupted, then return to the previous step (step 3.2); step 3.4, according to the recognition result of the speech recognition model in the speech recognition system automatically trained by the speech synthesis method in the previous step (step 3.3), carry out the user error correction process, and proceed to the next step (step 3.5); step 3.5. The user error correction module detects whether the error correction process is completed. If the error correction process is completed, the user error correction module saves the error-corrected data as a label (data set D) corresponding to the data set A, and enters the next step ( Step 3.6); if the error correction process is not completed then return to the previous step (step 3.4) error correction process; Step 3.6, the second collection part detects whether the user voice data (data set A) has a corresponding label after the user error correction , if the user voice data (data set A) has the corresponding label (data set D) after user error correction, then proceed to the next step (step 3.7); if the user voice data (data set A) does not undergo user error correction The corresponding label of , then return to step 3.4; step 3.7, the second collection part puts the user voice data (data set A) and the corresponding label (data set D) after user error correction into the training set in the speech recognition module, and proceed to step 4.
步骤4:根据步骤2(语音合成模块)或者步骤3(语音识别模块)中添加到语音识别模块训练集的数据,通过语音识别模块进行训练过程;步骤4的具体过程为:步骤4.1,检查步骤2或步骤3中添加到语音识别模块训练集中的数据,当数据量达到某一特定阈值时,语音识别模块自动地通过训练集中的数据进行训练;步骤4.2,语音识别模块检查自动训练过程得状态,如果训练过程正在进行,则等待训练过程结束;如果训练过程已经停止,则进行下一步骤(步骤4.3);步骤4.3,语音识别模块对已经停止的自动训练过程进行检查,如果训练轮数都已经完成,则保存通过语音合成方法自动训练的语音识别,整个过程结束,并返回最初状态(步骤1)等待下一轮过程开始;如果训练轮数没有经完成,判定为训练过程中断,则根据中断原因做相应调整并重新开始训练,返回步骤4.1。Step 4: According to the data added to the training set of the speech recognition module in step 2 (speech synthesis module) or step 3 (speech recognition module), the training process is carried out through the speech recognition module; the specific process of step 4 is: step 4.1, check step 2 or the data added to the training set of the speech recognition module in step 3, when the amount of data reaches a certain threshold, the speech recognition module automatically conducts training through the data in the training set; step 4.2, the speech recognition module checks the status of the automatic training process , if the training process is in progress, wait for the end of the training process; if the training process has stopped, proceed to the next step (step 4.3); step 4.3, the speech recognition module checks the stopped automatic training process, if the number of training rounds is all Has been completed, then save the speech recognition automatically trained by the speech synthesis method, the whole process is over, and return to the initial state (step 1) to wait for the next round of the process to start; if the number of training rounds has not been completed, it is determined that the training process is interrupted, then according to Adjust the reason for the interruption and restart the training, returning to step 4.1.
本申请还提供一种通过语音合成方法自动训练的语音识别方法,将所述的通过语音合成方法自动训练的语音识别系统应用于通过语音合成方法自动训练的语音识别训练中。The present application also provides a speech recognition method automatically trained by the speech synthesis method, and the speech recognition system automatically trained by the speech synthesis method is applied to the speech recognition training automatically trained by the speech synthesis method.
进一步地,所述方法包括如下步骤:Further, the method includes the steps:
1)收集少量的用户语音数据。1) Collect a small amount of user voice data.
2)通过语音特征提取方法,提取用户语音特征,根据用户语音特征,通过语音合成的方法,生成带有用户特征的语音数据。2) Extracting the user's voice features through the voice feature extraction method, and generating voice data with the user's features through a voice synthesis method according to the user's voice features.
3)通过用户的语音集来测试通过语音合成方法自动训练的语音识别系统的识别率,并通过纠错的方式,将纠错后的识别结果加入通过语音合成方法自动训练的语音识别系统的训练集。3) Test the recognition rate of the speech recognition system automatically trained by the speech synthesis method through the user's speech set, and add the error-corrected recognition results to the training of the speech recognition system automatically trained by the speech synthesis method by means of error correction set.
4)收集两个部分一定量的数据在训练集中(第一部分:来自语音合成结果的语音数据和对应的语音的标签;第二部分:来自用户语音数据和经过用户纠错后的数据),将这些语音数据作为训练集来训练语音识别系统。4) Collect a certain amount of data from two parts in the training set (the first part: the voice data from the speech synthesis result and the corresponding voice label; the second part: the user's voice data and the data after the user's error correction), the These speech data are used as a training set to train the speech recognition system.
进一步地,所述步骤2中包括对所述语音合成数据是否更新进行检查,若更新,则对更新后的语音合成数据与相对应的标签数据进行验证,验证通过后,收集所述语音合成数据与所述标签数据。Further, the step 2 includes checking whether the speech synthesis data is updated, if updated, verifying the updated speech synthesis data and the corresponding label data, and collecting the speech synthesis data after the verification is passed. with the tag data.
进一步地,所述第一训练数据输入至所述训练集子模块,所述第二训练数据输入至所述训练集子模块,当所述训练集子模块中的数据量达到特定阈值,则对数据进行训练。Further, the first training data is input to the training set sub-module, the second training data is input to the training set sub-module, and when the amount of data in the training set sub-module reaches a specific threshold, the data for training.
进一步地,所述语音识别模块检查自动训练过程的状态,如果训练过程正在进行,则等待训练过程结束;如果训练过程已经停止,则对所述第一训练数据和所述第二训练数据进行检查,并对数据进行处理。Further, the speech recognition module checks the state of the automatic training process, and if the training process is in progress, waits for the end of the training process; if the training process has stopped, checks the first training data and the second training data , and process the data.
进一步地,所述对数据进行处理包括对已经停止的自动训练过程进行检查,如果训练轮数已经完成,则保存所述通过语音合成方法自动训练的语音识别,整个过程结束,并返回最初状态等待下一轮过程开始;如果训练轮数没有经完成,判定为训练过程中断,则根据中断原因做相应调整并重新开始训练。Further, the described data processing includes checking the automatic training process that has been stopped, if the number of training rounds has been completed, then save the speech recognition automatically trained by the speech synthesis method, the whole process ends, and returns to the initial state to wait. The next round starts; if the number of training rounds is not completed, it is determined that the training process is interrupted, and the training will be restarted according to the reason for the interruption.
本申请通过基于神经网络的语音识别技术和语音合成技术的结合使用来生成语音数据并进行通过语音合成方法自动训练的语音识别训练和测试的方法;采用语音合成方法合成语音并将其结果用于语音识别的方法;解决了现有基于神经网络的语音识别方法遇到训练集数据量不足的问题;采用用户纠错的方法来纠正语音识别的结果并将其用在语音识别训练集中的方法。The present application uses the combined use of neural network-based speech recognition technology and speech synthesis technology to generate speech data and conduct speech recognition training and testing through automatic training of speech synthesis methods; use speech synthesis methods to synthesize speech and use the results for A method for speech recognition; it solves the problem of insufficient training set data encountered by existing neural network-based speech recognition methods; a method for correcting speech recognition results by using user error correction and using it in the speech recognition training set.
基于神经网络的语音识别技术和语音合成技术是两个独立研究领域,在以往的技术和研究中并未将这两种技术结合使用。为解决基于神经网络的语音识别方法遇到训练集数据量不足的问题,本申请提出了一种通过语音合成方法自动训练的语音识别方法。本申请通过语音合成方法来自动地根据用户语音特征生成语音数据,此数据包含用户语音特征并带有相对应的标签,所以此数据可以作为通过语音合成方法自动训练的语音识别方法中的训练集来直接使用;同时,用户语音通过一种通过语音合成方法自动训练的语音识别方法可以得到语音的识别结果,语音识别结果在经过用户的纠错过程后,此数据(包含语音数据和纠错后的标签)也可以作为一种通过语音合成方法自动训练的语音识别系统中的训练集使用。Speech recognition technology based on neural network and speech synthesis technology are two independent research fields, and the two technologies have not been combined in the previous technology and research. In order to solve the problem of insufficient training set data in the speech recognition method based on neural network, the present application proposes a speech recognition method automatically trained by the speech synthesis method. The present application automatically generates voice data according to the user's voice characteristics through the speech synthesis method. This data contains the user's voice characteristics and has a corresponding label, so this data can be used as a training set in the speech recognition method automatically trained by the speech synthesis method. At the same time, the user's voice can obtain the voice recognition result through a voice recognition method that is automatically trained by the speech synthesis method. After the user's error correction process, the data (including the voice data and the error correction labels) can also be used as a training set in a speech recognition system that is automatically trained by speech synthesis methods.
实施例Example
本申请以真实语音数据测试为例,采用少量的真实语音数据通过语音合成方法来合成具有与真实语音特征的语音数据,通过实验表明了本申请所提方法的可行性和高效性,为通过语音合成方法自动训练的语音识别提供了有效的语音数据集。This application takes the test of real voice data as an example, and adopts a small amount of real voice data to synthesize voice data with real voice characteristics through a speech synthesis method. Speech recognition trained automatically by synthetic methods provides efficient speech datasets.
1实验环境1 Experimental environment
1)计算机硬件环境1) Computer hardware environment
服务器型号:Dell EMC PowerEdge R740Server model: Dell EMC PowerEdge R740
CPU:Intel Xeon Silver 4116CPU:Intel Xeon Silver 4116
GPU:Tesla P100GPU: Tesla P100
2)计算机软件环境2) Computer software environment
系统环境:ubuntu 18.04;GCC版本:7.5.0;开发语言Python:3.7.6;GPU驱动版本:410.129;CUDA版本:10.0;CUDNN版本:7.6.4;Pytorch版本:1.2。软件环境代码内容如下所示:System environment: ubuntu 18.04; GCC version: 7.5.0; development language Python: 3.7.6; GPU driver version: 410.129; CUDA version: 10.0; CUDNN version: 7.6.4; Pytorch version: 1.2. The content of the software environment code is as follows:
$uname‐r$uname‐r
4.15.0-55-gener ic4.15.0-55-generic
$python‐‐version$python‐‐version
Python 3.7.6Python 3.7.6
$gcc‐‐version$gcc‐‐version
gcc(Ubuntu 7.5.0-3ubuntu1~18.04)7.5.0gcc(Ubuntu 7.5.0-3ubuntu1~18.04)7.5.0
Copyright(C)2017Free Software Foundation,Inc.Copyright(C)2017Free Software Foundation,Inc.
This is free software;see the source for copying conditions.There is NO warranty;not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$nvidia‐smi$nvidia‐smi
Thu May 14 13:27:39 2020Thu May 14 13:27:39 2020
Figure PCTCN2020139051-appb-000002
Figure PCTCN2020139051-appb-000002
$nvcc‐V$nvcc-V
nvcc:NVIDIA(R)Cuda compiler drivernvcc:NVIDIA(R)Cuda compiler driver
Copyright(c)2005-2018 NVIDIA CorporationCopyright(c)2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools,release 10.0,V10.0.130Cuda compilation tools, release 10.0, V10.0.130
$cat/usr/local/cuda/version.txt$cat /usr/local/cuda/version.txt
CUDA Version 10.0.130CUDA Version 10.0.130
$cat/usr/local/cuda/include/cudnn.h|grep CUDNN_MAJOR‐A 2$cat/usr/local/cuda/include/cudnn.h|grep CUDNN_MAJOR-A 2
#define CUDNN_MAJOR 7#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 4#define CUDNN_PATCHLEVEL 4
----
#define CUDNN_VERSION(CUDNN_MAJOR*1000+CUDNN_MINOR*100+CUDNN_PATCHLEVEL)#define CUDNN_VERSION(CUDNN_MAJOR*1000+CUDNN_MINOR*100+CUDNN_PATCHLEVEL)
#include"driver_types.h"#include "driver_types.h"
2实验步骤2 Experimental steps
1)语音识别模块的搭建,本申请中通过语音合成方法自动训练的语音识别采用的方法是Listen,Attend and Spell(LAS)。1) Construction of the speech recognition module, the speech recognition method automatically trained by the speech synthesis method in this application is Listen, Attend and Spell (LAS).
2)语音合成模块的搭建,本申请中语音合成模块采用的方法是Generative adversarial Networks based text-to-speech(GANTTS)。2) Construction of the speech synthesis module, the method adopted by the speech synthesis module in this application is Generative adversarial Networks based text-to-speech (GANTTS).
3实验结果3 Experimental results
首先根据用户语音数据(数据集A)提取声音特征,语音合成模块通过输入的数据(数据集B)并根据语音特征会生成语音数据(数据集C);经过收集器验证后,数据集B和数据集C相互对应,这两个数据集被存放到语音识别模块的训练集中。至此,步骤1和2已被验证。语音识别模块根据训练集内数据开始训练,并在训练完成后保存新的通过语音合成方法自动训练的语音识别系统中语音识别模型。至此,步骤4已被单独验证。语音识别模块根据用户语音数据(数据集A)的输入经过识别过程得到识别结果;用户纠错模块通过纠正语音识别模块的识别结果,并保存为数据集D;收集器在验证数据集A和数据集D为相互对应后,将两者数据集存放到语音识别模块的训练集中;语音识别模块根据训练集内数据开始训练,并在训练完成后保存新的通过语音合成方法自动训练的语音识别系统中语音识别模型。至此,步骤3和4已被验证。Firstly, the voice features are extracted according to the user's voice data (data set A), and the speech synthesis module generates voice data (data set C) according to the input data (data set B) and according to the voice features; after verification by the collector, data set B and Data sets C correspond to each other, and these two data sets are stored in the training set of the speech recognition module. So far, steps 1 and 2 have been verified. The speech recognition module starts training according to the data in the training set, and saves the new speech recognition model in the speech recognition system automatically trained by the speech synthesis method after the training is completed. At this point, step 4 has been independently verified. The speech recognition module obtains the recognition result through the recognition process according to the input of the user's speech data (data set A); the user error correction module corrects the recognition result of the speech recognition module and saves it as data set D; the collector verifies the data set A and data After the set D corresponds to each other, the two data sets are stored in the training set of the speech recognition module; the speech recognition module starts training according to the data in the training set, and saves a new speech recognition system automatically trained by the speech synthesis method after the training is completed. Medium speech recognition model. So far, steps 3 and 4 have been verified.
本申请中语音识别方法和语音合成方法不仅仅限于基于神经网络结构的方法;基于神经网络的语音合成方法不仅仅限于TTS,基于神经网络的语音识别方法不仅仅限于本申请中提到的方法(LAS,CTC,RNN-T,RNN-T with BLSTM,RNN-T with GRU等);实现语音合成方法和语音识别方法所用的深度学习框架不仅仅限于TensorFlow或Pytorch,所用程序语言不仅仅限于Python、Java、C++等。In this application, the speech recognition method and the speech synthesis method are not limited to the method based on the neural network structure; LAS, CTC, RNN-T, RNN-T with BLSTM, RNN-T with GRU, etc.); the deep learning framework used to implement speech synthesis methods and speech recognition methods is not limited to TensorFlow or Pytorch, and the programming languages used are not limited to Python, Java, C++, etc.
尽管在上文中参考特定的实施例对本申请进行了描述,但是所属领域技术人员应当理解,在本申请公开的原理和范围内,可以针对本申请公开的配置和细节做出许多修改。本申请的保护范围由所附的权利要求来确定,并且权利要求意在涵盖权利要求中技术特征的等同物文字意义或范围所包含的全部修改。Although the present application has been described above with reference to specific embodiments, it will be understood by those skilled in the art that many modifications may be made in configuration and detail disclosed herein within the spirit and scope of the present disclosure. The scope of protection of the present application is to be determined by the appended claims, and the claims are intended to cover all modifications encompassed by the literal meaning or scope of equivalents to the technical features in the claims.

Claims (10)

  1. 一种通过语音合成方法自动训练的语音识别系统,其特征在于:包括语音收集模块、语音识别模块、用户纠错模块、收集器和语音合成模块,所述语音收集模块、所述语音识别模块、所述用户纠错模块、所述收集器和所述语音识别模块依次通信连接;A speech recognition system automatically trained by a speech synthesis method is characterized in that: comprising a speech collection module, a speech recognition module, a user error correction module, a collector and a speech synthesis module, the speech collection module, the speech recognition module, The user error correction module, the collector and the speech recognition module are sequentially connected in communication;
    所述语音收集模块、所述语音合成模块、所述收集器和所述语音识别模块依次通信连接。The speech collection module, the speech synthesis module, the collector and the speech recognition module are sequentially connected in communication.
  2. 如权利要求1所述的通过语音合成方法自动训练的语音识别系统,其特征在于:所述语音识别模块包括测试集子模块和训练集子模块,所述测试集子模块与所述用户纠错模块通信连接,所述训练集子模块与所述收集器通信连接;The speech recognition system automatically trained by the speech synthesis method according to claim 1, wherein the speech recognition module comprises a test set sub-module and a training set sub-module, the test set sub-module and the user error correction a module communication connection, the training set sub-module is in communication connection with the collector;
    所述语音收集模块、所述测试集子模块、所述用户纠错模块、所述收集器和所述训练集子模块依次通信连接;所述语音收集模块、所述语音合成模块、所述收集器和所述训练集子模块依次通信连接。The voice collection module, the test set sub-module, the user error correction module, the collector and the training set sub-module are sequentially connected in communication; the voice collection module, the voice synthesis module, the The controller and the training set sub-module are sequentially connected in communication.
  3. 如权利要求1所述的通过语音合成方法自动训练的语音识别系统,其特征在于:所述语音合成模块包括文字收集子模块,所述文字收集子模块用于收集文字数据。The speech recognition system automatically trained by the speech synthesis method according to claim 1, wherein the speech synthesis module comprises a text collection sub-module, and the text collection sub-module is used to collect text data.
  4. 如权利要求2所述的通过语音合成方法自动训练的语音识别系统,其特征在于:所述收集器包括第一收集部和第二收集部,所述第一收集部与所述训练集子模块通信连接,所述第二收集部与所述训练集子模块通信连接。The speech recognition system automatically trained by the speech synthesis method according to claim 2, wherein the collector comprises a first collection part and a second collection part, the first collection part and the training set sub-module A communication connection, the second collection part is in communication connection with the training set sub-module.
  5. 一种通过语音合成方法自动训练的语音识别方法,其特征在于:将权利要求1~4中任一项所述的通过语音合成方法自动训练的语音识别系统应用于通过语音合成方法自动训练的语音识别训练中。A speech recognition method automatically trained by a speech synthesis method, characterized in that: the speech recognition system automatically trained by the speech synthesis method according to any one of claims 1 to 4 is applied to the speech automatically trained by the speech synthesis method recognition training.
  6. 如权利要求5所述的通过语音合成方法自动训练的语音识别方法,其特征在于:所述方法包括如下步骤:The speech recognition method automatically trained by the speech synthesis method as claimed in claim 5, wherein the method comprises the following steps:
    步骤1:收集用户语音数据;Step 1: Collect user voice data;
    步骤2:提取所述语音数据的语音特征进行语音合成得到语音合成数据和与所述语音合成数据相对应的标签数据,收集所述语音合成数据与所述标签数据,对所述语音合成数据与所述标签数据进行验证得到第一训练数据;Step 2: extracting the speech features of the speech data and performing speech synthesis to obtain speech synthesis data and label data corresponding to the speech synthesis data, collect the speech synthesis data and the label data, and compare the speech synthesis data with the label data. The label data is verified to obtain first training data;
    步骤3:对所述语音数据进行语音识别,并对语音识别结果进行检测纠错得到纠错数据,收集所述语音数据和与所述语音数据对应的纠错数据,对所述语音数据和与所述语音数据对应的纠错数据进行验证得到第二训练数据;Step 3: carry out speech recognition on the voice data, and perform detection and error correction on the voice recognition result to obtain error correction data, collect the voice data and the error correction data corresponding to the voice data, and analyze the voice data and the error correction data corresponding to the voice data. The error correction data corresponding to the voice data is verified to obtain second training data;
    步骤4:对所述第一训练数据和所述第二训练数据进行训练,并根据训练结果更新通过语音合成方法自动训练的语音识别系统。Step 4: Train the first training data and the second training data, and update the speech recognition system automatically trained by the speech synthesis method according to the training results.
  7. 如权利要求6所述的通过语音合成方法自动训练的语音识别方法,其特征在于:所述 步骤2中包括对所述语音合成数据是否更新进行检查,若更新,则对更新后的语音合成数据与相对应的标签数据进行验证,验证通过后,收集所述语音合成数据与所述标签数据。The speech recognition method automatically trained by the speech synthesis method as claimed in claim 6, wherein the step 2 includes checking whether the speech synthesis data is updated, and if it is updated, then the updated speech synthesis data is checked. Verification is performed with the corresponding label data, and after the verification is passed, the speech synthesis data and the label data are collected.
  8. 如权利要求6所述的通过语音合成方法自动训练的语音识别方法,其特征在于:所述第一训练数据输入至所述训练集子模块,所述第二训练数据输入至所述训练集子模块,当所述训练集子模块中的数据量达到特定阈值,则对数据进行训练。The speech recognition method automatically trained by the speech synthesis method according to claim 6, wherein the first training data is input into the training set sub-module, and the second training data is input into the training set sub-module module, when the amount of data in the training set sub-module reaches a certain threshold, the data is trained.
  9. 如权利要求8所述的通过语音合成方法自动训练的语音识别方法,其特征在于:所述语音识别模块检查自动训练过程的状态,如果训练过程正在进行,则等待训练过程结束;如果训练过程已经停止,则对所述第一训练数据和所述第二训练数据进行检查,并对数据进行处理。The speech recognition method automatically trained by the speech synthesis method as claimed in claim 8, wherein the speech recognition module checks the state of the automatic training process, if the training process is in progress, wait for the training process to end; if the training process has been stop, the first training data and the second training data are checked, and the data is processed.
  10. 如权利要求6所述的通过语音合成方法自动训练的语音识别方法,其特征在于:所述对数据进行处理包括对已经停止的自动训练过程进行检查,如果训练轮数已经完成,则保存所述通过语音合成方法自动训练的语音识别,整个过程结束,并返回最初状态等待下一轮过程开始;如果训练轮数没有经完成,判定为训练过程中断,则根据中断原因做相应调整并重新开始训练。The speech recognition method automatically trained by the speech synthesis method according to claim 6, wherein the processing of the data includes checking the automatic training process that has been stopped, and if the number of training rounds has been completed, saving the For speech recognition automatically trained by the speech synthesis method, the whole process ends, and returns to the initial state to wait for the next round of the process to start; if the number of training rounds is not completed, it is determined that the training process is interrupted, and the training will be adjusted accordingly according to the reason for the interruption and restart the training. .
PCT/CN2020/139051 2020-12-24 2020-12-24 Speech recognition system and method automatically trained by means of speech synthesis method WO2022133915A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/139051 WO2022133915A1 (en) 2020-12-24 2020-12-24 Speech recognition system and method automatically trained by means of speech synthesis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/139051 WO2022133915A1 (en) 2020-12-24 2020-12-24 Speech recognition system and method automatically trained by means of speech synthesis method

Publications (1)

Publication Number Publication Date
WO2022133915A1 true WO2022133915A1 (en) 2022-06-30

Family

ID=82157207

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139051 WO2022133915A1 (en) 2020-12-24 2020-12-24 Speech recognition system and method automatically trained by means of speech synthesis method

Country Status (1)

Country Link
WO (1) WO2022133915A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180247640A1 (en) * 2013-12-06 2018-08-30 Speech Morphing Systems, Inc. Method and apparatus for an exemplary automatic speech recognition system
CN109147799A (en) * 2018-10-18 2019-01-04 广州势必可赢网络科技有限公司 A kind of method, apparatus of speech recognition, equipment and computer storage medium
CN110265028A (en) * 2019-06-20 2019-09-20 百度在线网络技术(北京)有限公司 Construction method, device and the equipment of corpus of speech synthesis
CN110675862A (en) * 2019-09-25 2020-01-10 招商局金融科技有限公司 Corpus acquisition method, electronic device and storage medium
CN111540345A (en) * 2020-05-09 2020-08-14 北京大牛儿科技发展有限公司 Weakly supervised speech recognition model training method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180247640A1 (en) * 2013-12-06 2018-08-30 Speech Morphing Systems, Inc. Method and apparatus for an exemplary automatic speech recognition system
CN109147799A (en) * 2018-10-18 2019-01-04 广州势必可赢网络科技有限公司 A kind of method, apparatus of speech recognition, equipment and computer storage medium
CN110265028A (en) * 2019-06-20 2019-09-20 百度在线网络技术(北京)有限公司 Construction method, device and the equipment of corpus of speech synthesis
CN110675862A (en) * 2019-09-25 2020-01-10 招商局金融科技有限公司 Corpus acquisition method, electronic device and storage medium
CN111540345A (en) * 2020-05-09 2020-08-14 北京大牛儿科技发展有限公司 Weakly supervised speech recognition model training method and device

Similar Documents

Publication Publication Date Title
US10565442B2 (en) Picture recognition method and apparatus, computer device and computer- readable medium
JP2017111806A (en) Coarse-to-fine cascade adaptations for license plate recognition with convolutional neural networks
US10311335B1 (en) Method and device for generating image data set to be used for learning CNN capable of detecting obstruction in autonomous driving circumstance, and testing method, and testing device using the same
CN107221328B (en) Method and device for positioning modification source, computer equipment and readable medium
JP6800351B2 (en) Methods and devices for detecting burr on electrode sheets
WO2020244183A1 (en) Data annotation
US10885388B1 (en) Method for generating training data to be used for training deep learning network capable of analyzing images and auto labeling device using the same
US11501178B2 (en) Data processing method, medical term processing system and medical diagnostic system
WO2018153316A1 (en) Method and apparatus for obtaining text extraction model
CN110827793A (en) Language identification method
US20120158599A1 (en) System and method for analyzing office action of patent application
CN111177341A (en) End-to-end ID + SF model-based user conversation demand extraction method and system
CN111325031B (en) Resume analysis method and device
WO2019095899A1 (en) Material annotation method and apparatus, terminal, and computer readable storage medium
CN113448843A (en) Defect analysis-based image recognition software test data enhancement method and device
CN111309596A (en) Database testing method and device, terminal equipment and storage medium
CN110110622B (en) Medical text detection method, system and storage medium based on image processing
WO2022133915A1 (en) Speech recognition system and method automatically trained by means of speech synthesis method
CN109784834A (en) A kind of the information correctness analysis method and its system of the subsidy object
CN111881882A (en) Medical bill rotation correction method and system based on deep learning
CN112669825A (en) Speech recognition system and method automatically trained through speech synthesis method
US11366742B2 (en) Automated identification of lines of code related to errors field
US20220392641A1 (en) Early detection of pancreatic neoplasms using cascaded machine learning models
CN112863493A (en) Voice data labeling method and device and electronic equipment
CN116225453B (en) Incremental demand tracking link recovery method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20966487

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20966487

Country of ref document: EP

Kind code of ref document: A1