WO2022133915A1

WO2022133915A1 - Speech recognition system and method automatically trained by means of speech synthesis method

Info

Publication number: WO2022133915A1
Application number: PCT/CN2020/139051
Authority: WO
Inventors: 范小朋; 苏充则; 严伟玮
Original assignee: 杭州中科先进技术研究院有限公司
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2022-06-30

Abstract

A speech recognition system and method automatically trained by means of a speech synthesis method, the speech recognition system and method belonging to the technical field of artificial intelligence. A neural network-based speech recognition model relies on a large quantity of training data sets. If the quantity of the training data sets is not large enough, the training effect of the speech recognition model will be poor, so that the recognition rate will be relatively low. The speech recognition system automatically trained by means of the speech synthesis method comprises a speech collection module (1), a speech recognition module (2), a user error correction module (3), a collector (4) and a speech synthesis module (5), wherein the speech collection module (1), the speech recognition module (2), the user error correction module (3), the collector (4) and the speech synthesis module (5) are sequentially communicatively connected. The problem of the neural network-based speech recognition model having a poor training effect or a low recognition rate due to an insufficient quantity of training set data is overcome.

Description

A speech recognition system and method automatically trained by a speech synthesis method

technical field

The present application belongs to the technical field of artificial intelligence, and in particular relates to a speech recognition model system and method automatically trained by a speech synthesis method.

Background technique

Methods based on machine learning and deep learning have shown amazing artificial intelligence capabilities in many applications, especially in the application of image recognition and speech recognition, which have outperformed human vision and hearing. Much of these AI capabilities owe to neural networks, which have very high requirements for training data. For example, in the application of target detection in image recognition, tens of thousands of relevant images are needed to train the target detection model, and only this model can have a high recognition rate. The same situation exists in the application of speech recognition. Training a speech recognition model usually requires thousands of hours of speech data and corresponding labels.

According to the detailed data in Table 1, it can be concluded that the neural network-based speech recognition model relies on a large number of training data sets. If the amount of training data set is not large enough, the training effect of the speech recognition model will be poor, resulting in a low recognition rate.

Table 1. Speech recognition methods and information on corresponding training datasets

SUMMARY OF THE INVENTION

1. Technical problems to be solved

The methods based on machine learning and deep learning have amazing performance in the application of artificial intelligence such as image recognition and speech recognition. These artificial intelligence capabilities are based on neural networks and large amounts of data, and neural networks have high requirements for training data volumes. The data in Table 1 shows that training a speech recognition model usually requires thousands of hours of speech data and corresponding label data, so that the speech recognition model will have a higher recognition rate; the speech recognition model of the neural network relies on a large number of training data sets , if the amount of training data set is not large enough, the training effect of the speech recognition model will be poor, resulting in a problem that the recognition rate will be relatively low. The present application provides a speech recognition system and method for automatic training through a speech synthesis method.

2. Technical solutions

In order to achieve the above purpose, the present application provides a speech recognition system and method automatically trained by a speech synthesis method, including a speech collection module, a speech recognition module, a user error correction module, a collector and a speech synthesis module. module, the speech recognition module, the user error correction module, the collector and the speech recognition module are sequentially connected in communication; the speech collection module, the speech synthesis module, the collector and the speech recognition module The modules are in turn communicatively connected.

Another embodiment provided by this application is: the speech recognition module includes a test set sub-module and a training set sub-module, the test set sub-module is in communication connection with the user error correction module, and the training set sub-module is connected to the collector communication connection;

The voice collection module, the test set sub-module, the user error correction module, the collector and the training set sub-module are sequentially connected in communication; the voice collection module, the voice synthesis module, the The controller and the training set sub-module are sequentially connected in communication.

Another embodiment provided by the present application is that: the speech synthesis module includes a text collection submodule, and the text collection submodule is used to collect text data.

Another embodiment provided by the present application is: the collector includes a first collecting part and a second collecting part, the first collecting part is communicatively connected with the training set sub-module, and the second collecting part is connected with the training set sub-module. The training set submodule communication connection.

The present application provides a speech recognition method automatically trained by the speech synthesis method, and the speech recognition system automatically trained by the speech synthesis method is applied to the speech recognition method automatically trained by the speech synthesis method.

Another embodiment provided by the present application is: the method includes the following steps: step 1: collecting (target) user voice data; step 2: extracting the voice features of the voice data and performing voice synthesis to obtain the voice synthesis data and the The label data corresponding to the speech synthesis data is collected, the speech synthesis data and the label data are collected, and the first training data is obtained by verifying the speech synthesis data and the label data; voice recognition, and detecting and correcting the voice recognition results to obtain error correction data, collecting the voice data and the error correction data corresponding to the voice data, and analyzing the voice data and the error correction data corresponding to the voice data. Perform verification to obtain second training data; Step 4: Perform training on the first training data and the second training data, and update the speech recognition system automatically trained by the speech synthesis method according to the training results.

Another embodiment provided by the present application is: Step 2 includes checking whether the speech synthesis data is updated, and if it is updated, verifying the updated speech synthesis data and the corresponding label data, and the verification is passed. Then, the speech synthesis data and the label data are collected.

Another embodiment provided by this application is: the first training data is input to the training set sub-module, the second training data is input to the training set sub-module, when the training set sub-module When the amount of data reaches a certain threshold, the training is automatically performed according to the training data, and the speech recognition model is updated according to the training results.

Another embodiment provided by the present application is: the speech recognition module checks the state of the automatic training process, if the training process is in progress, waits for the end of the training process; if the training process has been stopped, performs the automatic training process that has been stopped. examine.

Another embodiment provided by the present application is: the checking of the stopped automatic training process includes verifying the number of training rounds, and if the number of training rounds has been completed, the speech recognition automatically trained by the speech synthesis method is saved. , the whole process ends, and returns to the initial state to wait for the next round of the process to start; if the number of training rounds is not completed, it is determined that the training process is interrupted, and the training will be adjusted accordingly and restarted according to the reason for the interruption.

3. Beneficial effects

Compared with the prior art, the beneficial effects of a speech recognition system and method automatically trained by a speech synthesis method provided by the present application are:

The speech recognition system automatically trained by the speech synthesis method provided by the present application is a speech recognition method automatically trained by the speech synthesis method, which overcomes the poor training effect of the neural network-based speech recognition model due to the insufficient amount of training set data. The problem of low recognition rate.

The speech recognition method automatically trained by the speech synthesis method provided by the present application, the speech data is automatically generated according to the user's speech characteristics through the speech synthesis method, and this data set includes the speech data with the user's speech characteristics and has corresponding to the speech data label, so this data set can be used directly as a training set in the speech recognition system automatically trained by the speech synthesis method; at the same time, the user's voice is obtained through the speech recognition process of the speech recognition model automatically trained by the speech synthesis method to obtain the recognition result. After the recognition result goes through the user's error correction process, this data set (including user voice data and error-corrected labels) can also be used as a training set in the speech recognition method automatically trained by the speech synthesis method.

The speech recognition method provided by the present application through the automatic training of the speech synthesis method avoids the process of manually labeling the speech data, reduces manpower and time, and has high efficiency.

The speech recognition method automatically trained by the speech synthesis method provided by the present application can quickly train a new speech recognition model and save the process of finding a suitable data set, and has high efficiency and strong applicability.

The speech recognition method provided by the present application is automatically trained by the speech synthesis method, and the speech recognition model can be automatically trained by the speech data after speech synthesis, which achieves extremely high efficiency.

The speech recognition method automatically trained by the speech synthesis method provided by this application adds the user's active error correction process, and can continuously improve the recognition rate according to the test results.

Description of drawings

1 is a schematic diagram of a speech recognition system automatically trained by a speech synthesis method of the present application;

FIG. 2 is a schematic diagram of the working process of the speech recognition system automatically trained by the speech synthesis method of the present application.

Detailed ways

Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, from which those skilled in the art can clearly understand the present application and be able to implement the present application. Without departing from the principles of the present application, the features of the various embodiments may be combined to obtain new embodiments, or instead of certain features of certain embodiments, to obtain other preferred embodiments.

1 to 2, the present application provides a speech recognition system automatically trained by a speech synthesis method, including a speech collection module 1, a speech recognition module 2, a user error correction module 3, a collector 4 and a speech synthesis module 5, the said The voice collection module 1, the voice recognition module 2, the user error correction module 3, the collector 4 and the voice recognition module 2 are sequentially connected in communication; the voice collection module 1, the voice synthesis module 5, The collector 4 and the speech recognition module 2 are sequentially connected in communication.

The speech recognition module 2 here includes a speech recognition process and a speech recognition model training process (abbreviation: recognition process and training process). The data corrected by the user error correction module 3 enters the speech recognition module 2 again through the collector 4 for the training process. The data synthesized by the speech synthesis module 5 also enters the speech recognition module 2 through the collector 4 for the training process.

In this application, the speech recognition module 2 refers to a speech recognition module developed with a speech recognition method as the main body, and the speech synthesis module 5 refers to a speech synthesis module developed with a speech synthesis method as the main body.

The purpose of this application is to provide a speech recognition method automatically trained by a speech synthesis method, by which a large amount of speech data with the speaker's speech characteristics can be automatically generated by using the speech synthesis method using less speech data. These speech data And the corresponding labels will be automatically added to the training set and the speech recognition system will be automatically trained, so as to overcome the problems of poor training effect and low recognition rate of the speech recognition system due to insufficient training data set, and greatly reduce the cost of training the speech recognition system. The requirement of collecting voice data, while avoiding the tedious process of labeling according to the voice data set, also provides a new method for training and testing the voice recognition system in the development process of the new voice recognition system.

Further, the speech recognition module 2 includes a test set sub-module and a training set sub-module, the test set sub-module is in communication connection with the user error correction module, and the training set sub-module is in communication connection with the collector.

The voice collection module 1, the test set sub-module, the user error correction module 3, the collector 4 and the training set sub-module are sequentially connected in communication; the voice collection module 1, the voice synthesis module 5. The collector 4 and the training set sub-module are sequentially connected in communication.

Further, the speech synthesis module 5 includes a text collection submodule, and the text collection submodule is used to collect text data.

Further, the collector 4 includes a first collecting part and a second collecting part, the first collecting part is connected in communication with the training set sub-module, and the second collecting part is connected in communication with the training set sub-module .

The voice collection module 1, the test set sub-module, the user error correction module 3, the first collection part and the training set sub-module are connected in communication in sequence; the voice collection module 1, the voice synthesis Module 5, the second collection unit and the training set sub-module are sequentially connected in communication.

Step 1: collect (target) user voice data; the specific process of step 1 is: step 1.1, through the recording device, that is, the voice collection module, collect user voice data according to the specifications of the voice data set, as data set A; step 1.2, if the If the data set is used for speech synthesis, go to step 2; if the data set is used as a test set in the speech recognition system automatically trained by the speech synthesis method, go to step 3.

Step 2: According to the speech data in Step 1, the speech synthesis module is used for speech synthesis; the specific process of Step 2 is: Step 2.1, the data set A is extracted by a method of extracting speech features (for example, MFCC and other methods) to extract speech features ; Step 2.2, the extracted speech feature is used as the parameter of the speech synthesis method (for example, SV2TTS, GANTTS etc.), and the text data collected by the text collection sub-module is used as data set B, and data set A is used as The input data of the speech synthesis module is subjected to speech synthesis; in step 2.3, the speech synthesis module will check whether the speech synthesis in the previous step (step 2.2) is completed, and if the speech synthesis is not completed, it will return to the previous step (step 2.2); if The speech synthesis has been completed and will proceed to the next step (step 2.4); in step 2.4, the speech data generated by the speech synthesis method will be used as the data set C; in step 2.5, the first collection department will check whether the speech data generated by the speech synthesis method is on time. If there is an update, if there is an update, the next step will be performed (step 2.6); if there is no update, no operation will be performed; in step 2.6, the first collection part will generate voice data according to the speech synthesis method in step 2.4 (data set C ) and the corresponding label data (data set B) to verify, if the voice data (data set C) generated by the speech synthesis method has the corresponding label data (data set B), then proceed to the next step (step 2.7) ; If the generated speech data (data set C) by the speech synthesis method does not have the corresponding label data, the next step will not be carried out, and return to step 2.1; Step 2.7, the first collection section will The speech data and the corresponding label data are added to the test set of the speech recognition system automatically trained by the speech synthesis method, and step 4 is performed.

Step 3: According to the voice data in step 1, voice recognition is carried out through the voice recognition module; the specific process of step 3 is: Step 3.1, the user voice data (data set A) collected in step 1 is stored in the voice recognition module. The test set in the recognition process; step 3.2, according to the test set in the previous step (step 3.1), the test of the speech recognition model in the speech recognition system automatically trained by the speech synthesis method, that is, the speech recognition process; step 3.3, the voice The recognition module verifies whether the process of speech recognition is over, if the recognition process is not over, continue the recognition process; if the recognition process is over, save the speech recognition result and enter the next step (step 3.4); if the recognition process is interrupted, then return to the previous step (step 3.2); step 3.4, according to the recognition result of the speech recognition model in the speech recognition system automatically trained by the speech synthesis method in the previous step (step 3.3), carry out the user error correction process, and proceed to the next step (step 3.5); step 3.5. The user error correction module detects whether the error correction process is completed. If the error correction process is completed, the user error correction module saves the error-corrected data as a label (data set D) corresponding to the data set A, and enters the next step ( Step 3.6); if the error correction process is not completed then return to the previous step (step 3.4) error correction process; Step 3.6, the second collection part detects whether the user voice data (data set A) has a corresponding label after the user error correction , if the user voice data (data set A) has the corresponding label (data set D) after user error correction, then proceed to the next step (step 3.7); if the user voice data (data set A) does not undergo user error correction The corresponding label of , then return to step 3.4; step 3.7, the second collection part puts the user voice data (data set A) and the corresponding label (data set D) after user error correction into the training set in the speech recognition module, and proceed to step 4.

Step 4: According to the data added to the training set of the speech recognition module in step 2 (speech synthesis module) or step 3 (speech recognition module), the training process is carried out through the speech recognition module; the specific process of step 4 is: step 4.1, check step 2 or the data added to the training set of the speech recognition module in step 3, when the amount of data reaches a certain threshold, the speech recognition module automatically conducts training through the data in the training set; step 4.2, the speech recognition module checks the status of the automatic training process , if the training process is in progress, wait for the end of the training process; if the training process has stopped, proceed to the next step (step 4.3); step 4.3, the speech recognition module checks the stopped automatic training process, if the number of training rounds is all Has been completed, then save the speech recognition automatically trained by the speech synthesis method, the whole process is over, and return to the initial state (step 1) to wait for the next round of the process to start; if the number of training rounds has not been completed, it is determined that the training process is interrupted, then according to Adjust the reason for the interruption and restart the training, returning to step 4.1.

The present application also provides a speech recognition method automatically trained by the speech synthesis method, and the speech recognition system automatically trained by the speech synthesis method is applied to the speech recognition training automatically trained by the speech synthesis method.

Further, the method includes the steps:

1) Collect a small amount of user voice data.

2) Extracting the user's voice features through the voice feature extraction method, and generating voice data with the user's features through a voice synthesis method according to the user's voice features.

3) Test the recognition rate of the speech recognition system automatically trained by the speech synthesis method through the user's speech set, and add the error-corrected recognition results to the training of the speech recognition system automatically trained by the speech synthesis method by means of error correction set.

4) Collect a certain amount of data from two parts in the training set (the first part: the voice data from the speech synthesis result and the corresponding voice label; the second part: the user's voice data and the data after the user's error correction), the These speech data are used as a training set to train the speech recognition system.

Further, the step 2 includes checking whether the speech synthesis data is updated, if updated, verifying the updated speech synthesis data and the corresponding label data, and collecting the speech synthesis data after the verification is passed. with the tag data.

Further, the first training data is input to the training set sub-module, the second training data is input to the training set sub-module, and when the amount of data in the training set sub-module reaches a specific threshold, the data for training.

Further, the speech recognition module checks the state of the automatic training process, and if the training process is in progress, waits for the end of the training process; if the training process has stopped, checks the first training data and the second training data , and process the data.

Further, the described data processing includes checking the automatic training process that has been stopped, if the number of training rounds has been completed, then save the speech recognition automatically trained by the speech synthesis method, the whole process ends, and returns to the initial state to wait. The next round starts; if the number of training rounds is not completed, it is determined that the training process is interrupted, and the training will be restarted according to the reason for the interruption.

The present application uses the combined use of neural network-based speech recognition technology and speech synthesis technology to generate speech data and conduct speech recognition training and testing through automatic training of speech synthesis methods; use speech synthesis methods to synthesize speech and use the results for A method for speech recognition; it solves the problem of insufficient training set data encountered by existing neural network-based speech recognition methods; a method for correcting speech recognition results by using user error correction and using it in the speech recognition training set.

Speech recognition technology based on neural network and speech synthesis technology are two independent research fields, and the two technologies have not been combined in the previous technology and research. In order to solve the problem of insufficient training set data in the speech recognition method based on neural network, the present application proposes a speech recognition method automatically trained by the speech synthesis method. The present application automatically generates voice data according to the user's voice characteristics through the speech synthesis method. This data contains the user's voice characteristics and has a corresponding label, so this data can be used as a training set in the speech recognition method automatically trained by the speech synthesis method. At the same time, the user's voice can obtain the voice recognition result through a voice recognition method that is automatically trained by the speech synthesis method. After the user's error correction process, the data (including the voice data and the error correction labels) can also be used as a training set in a speech recognition system that is automatically trained by speech synthesis methods.

Example

This application takes the test of real voice data as an example, and adopts a small amount of real voice data to synthesize voice data with real voice characteristics through a speech synthesis method. Speech recognition trained automatically by synthetic methods provides efficient speech datasets.

1 Experimental environment

1) Computer hardware environment

Server model: Dell EMC PowerEdge R740

CPU:Intel Xeon Silver 4116

GPU: Tesla P100

2) Computer software environment

System environment: ubuntu 18.04; GCC version: 7.5.0; development language Python: 3.7.6; GPU driver version: 410.129; CUDA version: 10.0; CUDNN version: 7.6.4; Pytorch version: 1.2. The content of the software environment code is as follows:

$uname‐r

4.15.0-55-generic

$python‐‐version

Python 3.7.6

$gcc‐‐version

gcc(Ubuntu 7.5.0-3ubuntu1～18.04)7.5.0

This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$nvidia‐smi

Thu May 14 13:27:39 2020

$nvcc-V

nvcc:NVIDIA(R)Cuda compiler driver

Built on Sat_Aug_25_21:08:01_CDT_2018

Cuda compilation tools, release 10.0, V10.0.130

$cat /usr/local/cuda/version.txt

CUDA Version 10.0.130

$cat/usr/local/cuda/include/cudnn.h|grep CUDNN_MAJOR-A 2

#define CUDNN_MAJOR 7

#define CUDNN_MINOR 6

#define CUDNN_PATCHLEVEL 4

--

#define CUDNN_VERSION(CUDNN_MAJOR*1000+CUDNN_MINOR*100+CUDNN_PATCHLEVEL)

#include "driver_types.h"

2 Experimental steps

1) Construction of the speech recognition module, the speech recognition method automatically trained by the speech synthesis method in this application is Listen, Attend and Spell (LAS).

2) Construction of the speech synthesis module, the method adopted by the speech synthesis module in this application is Generative adversarial Networks based text-to-speech (GANTTS).

3 Experimental results

Firstly, the voice features are extracted according to the user's voice data (data set A), and the speech synthesis module generates voice data (data set C) according to the input data (data set B) and according to the voice features; after verification by the collector, data set B and Data sets C correspond to each other, and these two data sets are stored in the training set of the speech recognition module. So far, steps 1 and 2 have been verified. The speech recognition module starts training according to the data in the training set, and saves the new speech recognition model in the speech recognition system automatically trained by the speech synthesis method after the training is completed. At this point, step 4 has been independently verified. The speech recognition module obtains the recognition result through the recognition process according to the input of the user's speech data (data set A); the user error correction module corrects the recognition result of the speech recognition module and saves it as data set D; the collector verifies the data set A and data After the set D corresponds to each other, the two data sets are stored in the training set of the speech recognition module; the speech recognition module starts training according to the data in the training set, and saves a new speech recognition system automatically trained by the speech synthesis method after the training is completed. Medium speech recognition model. So far, steps 3 and 4 have been verified.

In this application, the speech recognition method and the speech synthesis method are not limited to the method based on the neural network structure; LAS, CTC, RNN-T, RNN-T with BLSTM, RNN-T with GRU, etc.); the deep learning framework used to implement speech synthesis methods and speech recognition methods is not limited to TensorFlow or Pytorch, and the programming languages used are not limited to Python, Java, C++, etc.

Although the present application has been described above with reference to specific embodiments, it will be understood by those skilled in the art that many modifications may be made in configuration and detail disclosed herein within the spirit and scope of the present disclosure. The scope of protection of the present application is to be determined by the appended claims, and the claims are intended to cover all modifications encompassed by the literal meaning or scope of equivalents to the technical features in the claims.

Claims

A speech recognition system automatically trained by a speech synthesis method is characterized in that: comprising a speech collection module, a speech recognition module, a user error correction module, a collector and a speech synthesis module, the speech collection module, the speech recognition module, The user error correction module, the collector and the speech recognition module are sequentially connected in communication;

The speech collection module, the speech synthesis module, the collector and the speech recognition module are sequentially connected in communication.
The speech recognition system automatically trained by the speech synthesis method according to claim 1, wherein the speech recognition module comprises a test set sub-module and a training set sub-module, the test set sub-module and the user error correction a module communication connection, the training set sub-module is in communication connection with the collector;

The voice collection module, the test set sub-module, the user error correction module, the collector and the training set sub-module are sequentially connected in communication; the voice collection module, the voice synthesis module, the The controller and the training set sub-module are sequentially connected in communication.
The speech recognition system automatically trained by the speech synthesis method according to claim 1, wherein the speech synthesis module comprises a text collection sub-module, and the text collection sub-module is used to collect text data.
The speech recognition system automatically trained by the speech synthesis method according to claim 2, wherein the collector comprises a first collection part and a second collection part, the first collection part and the training set sub-module A communication connection, the second collection part is in communication connection with the training set sub-module.
A speech recognition method automatically trained by a speech synthesis method, characterized in that: the speech recognition system automatically trained by the speech synthesis method according to any one of claims 1 to 4 is applied to the speech automatically trained by the speech synthesis method recognition training.
The speech recognition method automatically trained by the speech synthesis method as claimed in claim 5, wherein the method comprises the following steps:

Step 1: Collect user voice data;

Step 2: extracting the speech features of the speech data and performing speech synthesis to obtain speech synthesis data and label data corresponding to the speech synthesis data, collect the speech synthesis data and the label data, and compare the speech synthesis data with the label data. The label data is verified to obtain first training data;

Step 3: carry out speech recognition on the voice data, and perform detection and error correction on the voice recognition result to obtain error correction data, collect the voice data and the error correction data corresponding to the voice data, and analyze the voice data and the error correction data corresponding to the voice data. The error correction data corresponding to the voice data is verified to obtain second training data;

Step 4: Train the first training data and the second training data, and update the speech recognition system automatically trained by the speech synthesis method according to the training results.
The speech recognition method automatically trained by the speech synthesis method as claimed in claim 6, wherein the step 2 includes checking whether the speech synthesis data is updated, and if it is updated, then the updated speech synthesis data is checked. Verification is performed with the corresponding label data, and after the verification is passed, the speech synthesis data and the label data are collected.
The speech recognition method automatically trained by the speech synthesis method according to claim 6, wherein the first training data is input into the training set sub-module, and the second training data is input into the training set sub-module module, when the amount of data in the training set sub-module reaches a certain threshold, the data is trained.
The speech recognition method automatically trained by the speech synthesis method as claimed in claim 8, wherein the speech recognition module checks the state of the automatic training process, if the training process is in progress, wait for the training process to end; if the training process has been stop, the first training data and the second training data are checked, and the data is processed.
The speech recognition method automatically trained by the speech synthesis method according to claim 6, wherein the processing of the data includes checking the automatic training process that has been stopped, and if the number of training rounds has been completed, saving the For speech recognition automatically trained by the speech synthesis method, the whole process ends, and returns to the initial state to wait for the next round of the process to start; if the number of training rounds is not completed, it is determined that the training process is interrupted, and the training will be adjusted accordingly according to the reason for the interruption and restart the training. .