CN115240654A

CN115240654A - Speech recognition model training method, device, equipment and storage medium

Info

Publication number: CN115240654A
Application number: CN202210843040.7A
Authority: CN
Inventors: 翁羽; 文连
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-10-25

Abstract

The application provides a speech recognition model training method, a speech recognition model training device, speech recognition model training equipment and a storage medium, which relate to the technical field of artificial intelligence, and the method comprises the following steps: the method comprises the steps of obtaining training data and the number N of devices to be used for distributed parallel training, wherein N is a positive integer, packaging a speech recognition model training task by using a deep learning training optimization library starter, initializing the distributed training starter, loading multi-process information for distributed parallel training, wherein the multi-process information comprises the number of processes of the distributed parallel training, the number of processes of the distributed parallel training is N, performing distributed parallel training on the speech recognition models of N processes by using the deep learning training optimization library according to the training data and the multi-process information, and determining the speech recognition model of any process as a speech recognition model obtained through training. Therefore, the training process of the voice recognition model can be simplified, and the training speed is improved.

Description

Speech recognition model training method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for training a speech recognition model.

Background

Automatic Speech Recognition (ASR), i.e. the process of converting audio collected by a microphone into text, is known. End-to-end speech recognition is a current research focus in ASR tasks. An end-to-end speech processing toolkit (Espnet 2) based on a pytorech is a main speech toolkit in the industry, and integrates training scenes such as speech recognition and speech synthesis (TTS).

At present, a specific process of enabling Espnet2 to achieve ASR task training includes: data preparation, feature extraction, data format conversion, training of a language model, training of a speech recognition model, and recognition and scoring, wherein the training of the speech recognition model is based on training of an acoustic part by using a dictionary, a training set and a test set, and a decoder connected with a time-series classification (CTC) model, an architecture of an Attention mechanism (Attention) model and a parallel-computing feature extractor (transform), and a complex model structure of CTC + transform, which often requires a large amount of data for distributed training.

In the prior art, a distributed training method based on data parallel is used for training a voice recognition model, the method adopts a distributed starting mode of multi-process training to split the training of the voice recognition model into N processes (N is the number of GPUs), the N processes start training in parallel, and finally one process is used for storing the trained model, so that the training process is complex and the training speed is slow.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for training a voice recognition model, which can simplify the training process of the voice recognition model and improve the training speed.

In a first aspect, the present application provides a method for training a speech recognition model, including:

acquiring training data and the number N of devices to be used in distributed parallel training, wherein N is a positive integer;

packaging a speech recognition model training task by using a deep learning training optimization library starter;

initializing a distributed training starter, and loading multi-process information for distributed parallel training, wherein the multi-process information comprises the number of processes of the distributed parallel training, and the number of the processes of the distributed parallel training is N;

performing distributed parallel training of the voice recognition models of the N processes by using a deep learning training optimization library according to the training data and the multi-process information;

and determining the speech recognition model of any process as the speech recognition model obtained by training.

In a second aspect, the present application provides a speech recognition model training apparatus, including:

the acquisition module is used for acquiring training data and the number N of devices to be used in distributed parallel training, wherein N is a positive integer;

the processing module is used for encapsulating the speech recognition model training task by using a deep learning training optimization library starter;

the loading module is used for initializing the distributed training starter and loading multi-process information for performing distributed parallel training, wherein the multi-process information comprises the number of processes of the distributed parallel training, and the number of the processes of the distributed parallel training is N;

the training module is used for performing distributed parallel training on the voice recognition models of the N processes by using a deep learning training optimization library according to the training data and the multi-process information;

and the determining module is used for determining the voice recognition model of any process as the trained voice recognition model.

In a third aspect, the present application provides an electronic device, comprising: a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium comprising instructions which, when run on a computer program, cause the computer to perform the method of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

To sum up, in the present application, by obtaining training data and the number N of devices to be used for distributed parallel training, a deep learning training optimization library initiator is used to encapsulate a speech recognition model training task, the distributed training initiator is initialized, multi-process information for distributed parallel training is loaded, according to the training data and the multi-process information, the deep learning training optimization library is used to perform distributed parallel training of the speech recognition models of N processes, and finally, the speech recognition model of any process is determined as the speech recognition model obtained by training. After the distributed training starter is used for initialization, the distributed parallel training of the voice recognition models of the N processes is carried out by using the deep learning training optimization library, so that the model training processes of the multiple processes can be realized according to single-process logic, redundant training codes are simplified, the codes are clear, and the training speed is improved.

Further, in the application, when the distributed parallel training is performed, the voice recognition model of each process is subjected to model training by using a zero redundancy optimizer based on data parallel, the state information of the whole model can be segmented and distributed to each device of the parallel training, so that the consumption of memory is reduced, and the training speed can be increased by adjusting the number of samples (batch size) of one-time training. Thereby, the training speed is even further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a speech recognition model training method according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for training a speech recognition model according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for training a speech recognition model according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating a memory consumption comparison according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a speech recognition model training apparatus according to an embodiment of the present application;

fig. 6 is a schematic block diagram of an electronic device 700 provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Before the technical scheme of the application is introduced, the related knowledge of the application is introduced as follows:

1. artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

2. Machine Learning (ML): the method is a multi-field cross subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

3. Deep Learning (DL): is a branch of machine learning and is an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple non-linear transformations. Deep learning is to learn the intrinsic rules and the expression levels of training sample data, and the information obtained in the learning process is very helpful to the interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.

4. Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can watch, listen, speak and feel, and the development direction of human-computer interaction in the future is provided.

5. Espnet2 is an end-to-end speech processing toolbox based on the pytorch, and integrates training scenes such as speech recognition and speech synthesis (TTS).

6. The deep learning training optimization library framework is a lightweight framework based on the pytorech and is used as a distributed training tool for the ultra-large model.

The technical scheme provided by the embodiment of the application mainly relates to technologies such as a speech processing technology and deep learning in an artificial intelligence technology, and particularly relates to an ASR technology. Specific examples are illustrated by the following examples.

In the prior art, a distributed training method based on data parallel is used for training a voice recognition model, the method is complex in training process and slow in training speed, the training speed of the voice recognition model is limited due to memory redundancy consumption, and the training speed of the voice recognition model is slow due to the two factors. In order to solve the technical problem, in the embodiment of the application, training data and the number N of devices to be used in distributed parallel training are obtained, a deep learning training optimization library initiator is used for encapsulating a speech recognition model training task, the distributed training initiator is initialized, multi-process information for performing distributed parallel training is loaded, the deep learning training optimization library is used for performing distributed parallel training of speech recognition models of N processes according to the training data and the multi-process information, and finally the speech recognition model of any process is determined to be a speech recognition model obtained through training. After the distributed training starter is used for initialization, the distributed parallel training of the voice recognition models of the N processes is carried out by using the deep learning training optimization library, so that model training processes of multiple processes can be realized according to single-process logic, compared with multiple processes of handwriting definition used in the prior art, the condition judgment in the processes is reduced, redundant training codes are simplified, the codes are clear, the training process of the voice recognition models can be simplified, and the training speed is improved.

In the prior art, in the data parallel process, the method copies the state information of the whole voice recognition model to the N GPUs which are trained in parallel, occupies the memory of each GPU, and causes a large amount of memory redundancy consumption. With the increase of model complexity and data set, memory redundancy consumption can limit the training speed of the speech recognition model, which can result in a slower training speed of the speech recognition model.

In order to solve the problem, in the embodiment of the application, during distributed parallel training, the voice recognition model of each process is subjected to model training by using a data parallel-based zero redundancy optimizer, so that the state information of the whole voice recognition model can be segmented and distributed to each device in parallel training, the consumption of memory is reduced, and the training speed can be increased by adjusting the number of samples (batch size) in one training. Thus, the training speed is further improved.

Fig. 1 is a schematic view of an application scenario of a speech recognition model training method provided in an embodiment of the present application, as shown in fig. 1, the implementation scenario of the embodiment of the present application relates to a server 1 and a terminal device 2, and the terminal device 2 may perform data communication with the server 1 through a communication network.

In some implementation manners, the terminal device 2 refers to a device that has a rich man-machine interaction manner, has internet access capability, is usually equipped with various operating systems, and has a strong processing capability. The terminal device may be a terminal device such as a smart phone, a tablet computer, a laptop computer, a desktop computer, or a telephone watch, but is not limited thereto. Optionally, in this embodiment of the application, a client of the speech recognition software is installed in the terminal device 2, and a user may input corresponding speech information to be recognized through the client.

In some implementations, the terminal device 2 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, and the like. Illustratively, the intelligent voice interaction device may be an intelligent sound box, an intelligent television box, an online voice interaction system, an intelligent voice assistant, an on-board intelligent voice device, an intelligent voice device with a simultaneous interpretation function or installed with a voice input method, and the like.

The server 1 in fig. 1 may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing a cloud computing service. This is not limited by the present application.

Illustratively, the server 1 is configured to lay out and train a speech recognition model, deploy the trained speech recognition model in a corresponding terminal device, and process speech information in the use environment, such as performing speech recognition, by using the deployed speech recognition model through the terminal device (e.g., the terminal device 2).

It can be understood that before the speech recognition model processes the speech information in the use environment, the speech recognition model needs to be trained, and the speech recognition model training method provided in the embodiment of the present application may be specifically used. The speech recognition model training method provided by the embodiment of the application is beneficial to the accelerated training of the speech recognition model, reduces the whole time of the speech recognition model training and improves the training speed.

In some implementations, fig. 1 exemplarily shows one terminal device and one server, and may actually include other numbers of terminal devices and servers, which are not limited in this application.

In some implementation manners, the speech recognition model training method provided by the embodiment of the application can be applied to an ASR training task developed based on Espnet2, and is beneficial to accelerated training of the speech recognition model, reducing the whole time of speech recognition model training and improving the training speed.

In some implementations, a specific flow of implementing the ASR training task based on Espnet2 may include:

1. and preparing data, specifically downloading data and decompressing the data.

2. Feature extraction, speech features are extracted using Kaldi (which is also an open source ASR tool).

3. And converting the data format, and converting the intermediate data into a JSON format.

4. And training the language model.

5. The training of the speech recognition model may specifically be based on a CTC model, an architecture of the Attention, and a transform decoder for training the acoustic part using a dictionary, a training set, and a test set.

6. And identifying and scoring, and scoring by combining a Transformer model, a CTC model and an RNN language model.

Optionally, the speech recognition model training method provided in the embodiment of the present application may be applied to the 5 th process in the above processes, that is, training the speech recognition model. The training speed of the speech recognition model can be increased, and thus the training speed of the overall ASR training task can be increased.

Optionally, the speech recognition model training method provided in the embodiment of the present application may also be used for training alone.

The following describes the speech recognition model training method provided in the embodiments of the present application in detail with reference to the accompanying drawings.

Fig. 2 is a flowchart of a method for training a speech recognition model according to an embodiment of the present application, where an execution subject of the method may be various electronic devices that operate a speech recognition model training apparatus, such as a terminal device with a speech recognition function, a server with a speech recognition model training function, or a server cluster. As shown in fig. 2, the method of this embodiment may include:

s101, acquiring training data and the number N of devices to be used in distributed parallel training, wherein N is a positive integer.

Specifically, the number N of devices to be used for acquiring training data and distributed parallel training may be the number N of devices to be used for receiving input training data and distributed parallel training. The device to be used may be a Graphics Processing Unit (GPU), among others. The number of processes of distributed parallel training is equal to the number of devices to be used N.

And S102, encapsulating the speech recognition model training task by using a deep learning training optimization library starter.

Specifically, after the training data and the number N of devices to be used in the distributed parallel training are acquired, a speech recognition model training task needs to be started. The voice recognition model training task is encapsulated by using a deep learning training optimization library starter (deep. Lag), so that part of functions in the deep learning training optimization library can be used in the subsequent distributed parallel training.

As an implementable manner, the speech recognition model training task is encapsulated by using a deep learning training optimization library initiator, which may be:

and modifying the training entry of the speech recognition model training task by using a deep learning training optimization library starter.

S103, initializing the distributed training starter, and loading multi-process information for distributed parallel training, wherein the multi-process information comprises the number of processes of the distributed parallel training, and the number of the processes of the distributed parallel training is N.

Specifically, a multiprocess management initiator (torch) adopted in the existing process starts distributed parallel training, the multiprocess management initiator needs basic information of a user for manually controlling a process and submits the basic information to the whole task process, and a multi-line parallel implementation mode is adopted for codes, so that the codes are more and the speed is lower. In the embodiment of the application, distributed parallel training is started by adopting a distributed training initiator (torch. Distributed. Launch), a user can realize the training process according to single-process logic, codes are clear, the speed is high, and the distributed training initiator is a standardized flow, so that the maintenance cost can be reduced.

Initializing a distributed training initiator (torch. Distributed. Launch), and loading multi-process information for performing distributed parallel training, where the multi-process information includes a number of processes for the distributed parallel training, where the number of processes for the distributed parallel training is N, and optionally, the multi-process information may further include at least one of an identifier of each process in the N processes, an identifier of each device in the N devices, and an association relationship between the identifier of each process and the identifier of each device. Optionally, the identifier of each process in the N processes, the identifier of each device in the N devices, and the association relationship between the identifier of each process and the identifier of each device may be received by a user after S101, or may be set according to the number N of devices after S101.

Optionally, in this embodiment, parameters of the distributed training initiator need to be modified to adapt to training features with large differences in data lengths of the ASR training tasks. Through multiple experiments, it is verified that the memory limit of the model size when the sub-state information is summarized needs to be modified, and the parameters needing to be modified may include: the size of the bucket (reduce _ bucket _ size) and the size of the collection bucket (allgather _ bucket _ size) are reduced.

And S104, performing distributed parallel training of the voice recognition models of the N processes by using a deep learning training optimization library according to the training data and the multi-process information.

Specifically, after initializing a distributed training initiator and loading multi-process information for distributed parallel training, the distributed parallel training can be started, that is, the distributed parallel training of the voice recognition models of N processes is started.

As an implementable manner, according to training data and multi-process information, a deep learning training optimization library is used to perform distributed parallel training of the speech recognition models of N processes, which specifically may be:

s1041, dividing the training data into N parts of sub-training data by using a preset segmentation algorithm.

The training data is voice training data, and because the length of the voice training data changes, the training data is divided into N parts of sub-training data, and the length of the sub-training data is not equal, so the lengths of the sub-training data in the N parts of sub-training data may be equal or unequal.

Specifically, the training data halving operation is started by default in the deepspeed frame, and the voice training data is not suitable for halving processing, so that codes need to be modified, when the codes are modified, the steps of reading the data in the deepspeed frame need to be annotated, the codes corresponding to the preset segmentation algorithm are modified, and the preset segmentation algorithm can be customized.

S1042, based on N parts of sub-training data, performing distributed parallel training of the voice recognition models of N processes by using a deep learning training optimization library, wherein one part of sub-training data is used for training the voice recognition model of one process.

Specifically, a deep learning training optimization library is used for carrying out distributed parallel training on the voice recognition models of N processes according to N parts of sub-training data, the N processes are carried out simultaneously, and one part of sub-training data is used for training the voice recognition models of one process.

Further, as an implementable manner, the speech recognition model of each process in the N processes is model-trained by using a zero-redundancy optimizer based on data parallelism. In this embodiment, model training is performed by using a Zero Redundancy Optimizer (Zero Redundancy Optimizer, zero) based on data parallel, so that state information of the entire speech recognition model can be segmented and distributed to each device in parallel training, thereby reducing memory consumption. Because the state information of the speech recognition model includes optimization process information, gradient information and parameter information, optionally, in this embodiment, only the optimization process information may be segmented and distributed to each device in parallel training, and at this time, model training may be performed using a Zero Redundancy Optimizer stage (Zero Redundancy Optimizer, zero 1) based on data parallel; the optimization process information and the gradient information can be segmented and distributed to each device for parallel training; the optimization process information and the parameter information can be segmented and distributed to each device for parallel training; and the optimization process information, the gradient information and the parameter information can be segmented and distributed to each device for parallel training.

Further, as an implementable manner, the speech recognition model of one of the N processes is obtained by training, based on one of the N sub-training data, in the following manner:

and S10421, initializing an interface of an engine of the deep learning training optimization library.

S10422, constructing a voice recognition model, an optimizer and a scheduler, loading an engine of a deep learning training optimization library, and encapsulating the constructed voice recognition model, the optimizer and the scheduler to obtain an encapsulated voice recognition model, an encapsulated optimizer and an encapsulated scheduler.

S10423, loading the pre-training model and loading a piece of sub-training data.

Specifically, the specific process of loading the pre-training model and loading a sub-training data may be the same as that of the existing process, and the processing logic of the deep learning training optimization library is not used to load the pre-training model and load a sub-training data because the characteristics of the phonetic training data, i.e., the length of the phonetic training data, is variable (a sentence is long or short).

S10424, loading the check point, and storing the first state information of the check point in a distributed manner, wherein the first state information comprises the state information of the packaged voice recognition model.

The state information of the encapsulated speech recognition model includes optimization process information, or the state information of the encapsulated speech recognition model may include optimization process information and at least one of gradient information and parameter information. That is, the state information of the encapsulated speech recognition model may include optimization process information, or the state information of the encapsulated speech recognition model may include optimization process information and gradient information, or the state information of the encapsulated speech recognition model may include optimization process information and parameter information, or the state information of the encapsulated speech recognition model may include optimization process information, gradient information and parameter information.

In the embodiment of the application, the first state information or the second state information of the distributed storage checkpoint refers to that the first state information or the second state information is stored on the N devices respectively after being segmented.

In the embodiment of the application, after the checkpoint is loaded, the state information of the encapsulated optimizer of the checkpoint and the state information of the encapsulated scheduler need to be stored, and the two pieces of information occupy a small memory and can be completely copied and stored on each of the N devices.

Optionally, the first state information of the distributed storage checkpoint may be that the first state information is divided into N equal sub-state information, and the N sub-state information is stored on N devices, respectively.

S10425, according to a copy of sub-training data, using a zero redundancy optimizer based on data parallelism to carry out training iteration on the packaged voice recognition model, and obtaining a trained voice recognition model of a process and second state information of a check point.

Specifically, in an implementable manner, S10425 may specifically be:

the method comprises the steps of training a packaged voice recognition model according to a piece of sub-training data, conducting forward propagation, backward propagation and parameter updating in the iterative process of the packaged voice recognition model training, updating and storing sub-state information stored in equipment, determining that the packaged voice recognition model training is finished, obtaining a trained voice recognition model of a process, obtaining sub-state information stored in other N-1 pieces of equipment respectively, obtaining N-1 pieces of sub-state information, summarizing the sub-state information and the N-1 pieces of sub-state information stored in the equipment, and obtaining second state information of a check point. Optionally, the collection of sub-status information herein uses an allgather algorithm to perform the transmission and synchronous reception of the sub-status information.

Specifically, the first state information is divided into N equal sub-state information, the N sub-state information is respectively stored on N devices, in the parallel training process of the model, each device only stores and updates 1/N of the first state information and only updates 1/N of the parameters, and when the training is finished, the sub-state information stored on each device is summarized to obtain complete updated state information and updated parameters.

And S10426, storing second state information in a distributed mode, wherein the second state information comprises state information of a trained speech recognition model of a process.

Optionally, the second state information of the distributed storage checkpoint may be obtained by dividing the second state information into N equal sub-state information, and storing the N sub-state information on the N devices, respectively.

It should be noted that the trained speech recognition model of one process is a model trained by the encapsulated speech recognition model, and the content type included in the state information of the trained speech recognition model of one process is the same as the content type included in the state information of the encapsulated speech recognition model, for example, if the state information of the encapsulated speech recognition model includes optimization process information, then the state information of the trained speech recognition model of one process also includes optimization process information.

In the embodiment of the application, the encapsulated speech recognition model is subjected to training iteration by using a data parallel-based zero redundancy optimizer,

and S105, determining the voice recognition model of any process as the voice recognition model obtained by training.

According to the speech recognition model training method provided by the embodiment, training data and the number N of devices to be used in distributed parallel training are obtained, a deep learning training optimization library starter is used for packaging a speech recognition model training task, the distributed training starter is initialized, multi-process information for distributed parallel training is loaded, the deep learning training optimization library is used for distributed parallel training of the speech recognition models of N processes according to the training data and the multi-process information, and finally the speech recognition model of any process is determined to be the speech recognition model obtained through training. After the distributed training starter is used for initialization, the distributed parallel training of the voice recognition models of the N processes is carried out by using the deep learning training optimization library, so that the model training processes of the multiple processes can be realized according to single-process logic, redundant training codes are simplified, the codes are clear, and the training speed is improved.

The technical solutions provided in the embodiments of the present application are described in detail below with reference to a specific embodiment.

Fig. 3 is a flowchart of a method for training a speech recognition model according to an embodiment of the present application, where an execution subject of the method may be various electronic devices that operate a speech recognition model training apparatus, such as a terminal device with a speech recognition function, a server with a speech recognition model training function, or a server cluster. As shown in fig. 3, the method of this embodiment may include:

s201, acquiring training data and the number N of devices to be used in distributed parallel training, wherein N is a positive integer.

Specifically, the number N of devices to be used for acquiring training data and distributed parallel training may be the number N of devices to be used for receiving input training data and distributed parallel training. Wherein the device to be used may be a GPU. The number of processes of distributed parallel training is equal to the number of devices to be used N.

S202, packaging the speech recognition model training task by using a deep learning training optimization library starter.

Specifically, after the training data and the number N of devices to be used in the distributed parallel training are acquired, a speech recognition model training task needs to be started. The voice recognition model training task is encapsulated by using a deep learning training optimization library starter (deep. Lag), so that the subsequent distributed parallel training can be performed by using the process of the deep learning training optimization library.

S203, initializing the distributed training starter, and loading multi-process information for distributed parallel training.

The multi-process information may include the number of processes (N) of the distributed parallel training, an identifier of each process in the N processes, an identifier of each device in the N devices, and an association relationship between the identifier of each process and the identifier of each device.

Optionally, the identifier of each process in the N processes, the identifier of each device in the N devices, and the association relationship between the identifier of each process and the identifier of each device may be received by a user after S101, or may be set according to the number N of devices after S101.

S204, dividing the training data into N parts of sub-training data by using a preset segmentation algorithm, and performing distributed parallel training of the voice recognition models of the N processes by using a deep learning training optimization library based on the N parts of sub-training data, wherein one part of sub-training data is used for training the voice recognition model of one process.

As an implementable manner, S204 may specifically include:

s2041, initializing an interface of an engine of the deep learning training optimization library.

Specifically, an interface of an engine of a deep learning training optimization library (deepspeed) is initialized (deepspeed.

S2042, constructing a voice recognition model, an optimizer and a scheduler, loading an engine of a deep learning training optimization library, and encapsulating the constructed voice recognition model, the optimizer and the scheduler to obtain an encapsulated voice recognition model, an encapsulated optimizer and an encapsulated scheduler.

S2043, loading a pre-training model.

S2044, loading a piece of sub-training data.

Specifically, the S2043 and S2044 may use the original data loading scheme of Espnet2 to override the capability of depespeed to forcibly load data.

S2045, loading the check points, and storing first state information of the check points in a distributed mode, wherein the first state information comprises state information of the packaged voice recognition model.

Optionally, the state information of the encapsulated speech recognition model may include gradient information, optimization process information, and parameter information. For the state information of the speech recognition model in the training process, the optimization process consumes most memory of the state information.

In an implementation manner, the state information of the encapsulated speech recognition model includes optimization process information, that is, in the embodiment, the optimization process information is stored in a distributed manner, and gradient information and parameter information are stored on each device. Accordingly, the ZeRO redundancy optimizer may employ ZeRO1. In this embodiment, the optimization process state information is stored in a distributed manner, so that memory consumption can be reduced.

Fig. 4 is a schematic diagram of memory consumption comparison according to an embodiment of the present application, as shown in fig. 4, in the prior art, state information (including gradient information, optimization process information, and parameter information) of a speech recognition model is copied and stored on N GPUs trained in parallel (as shown in fig. GPU) ₀ —gpu _N-1 ) Occupying the memory of each GPU. In this embodiment, optimization process information in state information of the speech recognition model is stored in a distributed manner, specifically, the optimization process information is divided into N equal partitions and then stored in N GPUs, and gradient information and parameter information are copied and stored in N GPUs trained in parallel. By using the method of the embodiment, the memory consumption after the process information partition is optimized is reduced from the existing 4 Ψ + K Ψ to 4 Ψ + K/Nd Ψ.

In an implementation manner, the state information of the encapsulated speech recognition model includes gradient information and optimization process information, that is, in the embodiment, the gradient information and the optimization process information are stored in a distributed manner, and parameter information is stored on each device.

In an implementable manner, the state information of the encapsulated speech recognition model includes parameter information, optimization process information, and gradient information, that is, in this embodiment, the parameter information, the optimization process information, and the gradient information are stored in a distributed manner.

Optionally, the first state information of the distributed storage checkpoint may be that the first state information is divided into N equal sub-state information, the N sub-state information is stored in N devices, and one device stores one sub-state information.

Optionally, the first state information may further include at least one of encapsulated state information of the optimizer and encapsulated state information of the scheduler, and at least one of encapsulated state information of the optimizer and encapsulated state information of the scheduler is also stored in a distributed manner, so that consumption of the memory may be further reduced.

In this embodiment, the problem of imbalance of distributed training video memory of the ASR task can also be solved to a certain extent by storing the first state information of the check point in a distributed manner.

S2046, according to a copy of sub-training data, training iteration is carried out on the packaged voice recognition model through a zero redundancy optimizer based on data parallelism, and second state information of the trained voice recognition model of a process and a check point is obtained.

Specifically, in an implementable manner, S2046 may specifically be:

the method comprises the steps of training a packaged voice recognition model according to a piece of sub-training data, conducting forward propagation, backward propagation and parameter updating in the iterative process of the packaged voice recognition model training, updating and storing sub-state information stored in equipment, determining that the packaged voice recognition model training is finished, obtaining a trained voice recognition model of a process, obtaining sub-state information stored in other N-1 pieces of equipment respectively, obtaining N-1 pieces of sub-state information, collecting the sub-state information stored in the equipment and the N-1 pieces of sub-state information, and obtaining second state information of a check point.

S2047, storing second state information of the check point in a distributed mode, wherein the second state information comprises state information of a trained speech recognition model of a process.

Optionally, the second state information further includes at least one of state information of the trained optimizer and state information of the trained scheduler, and at least one of the state information of the trained optimizer and the state information of the trained scheduler is also stored in a distributed manner, so that memory consumption can be further reduced.

It is understood that the second status information comprises the same content as the first status information.

S205, determining the voice recognition model of any process as the voice recognition model obtained by training.

By using the speech model training method provided by the embodiment of the application, after the distributed training starter is used for initialization, the deep learning training optimization library is used for performing distributed parallel training of the speech recognition models of N processes, so that the model training processes of multiple processes can be realized according to single-process logic, the training process of the speech recognition models can be simplified, and the training speed is improved. In addition, model training is carried out on the voice recognition model of each process by adopting a data-parallel-based zero-redundancy optimizer, so that the consumption of memory is reduced, and the training speed can be increased by adjusting the number of samples (batch size) trained at one time. The combination of the two points can further improve the training speed.

In one embodiment, taking english training data as an example, in the parallel training of single-machine eight-card distributed training, the acceleration ratio can reach 1.35, the overall training time is reduced, and the training speed is increased by more than 20%, as shown in the following table one:

watch-to-train time comparison

In the embodiment of the present application, a zero redundancy optimizer based on data parallel is used for model training through a speech recognition model of each process, and in an embodiment, as shown in table two below, the occupation of video memory above 400Mib can be reduced by the speech recognition model training of the embodiment of the present application, specifically taking english training as an example, in single-machine eight-card distributed training parallel training, compared with the maximum video memory in the prior art being 10203Mib, the maximum occupation of video memory in the embodiment of the present application is 9453Mib.

According to the method provided by the embodiment, when the training is interrupted unexpectedly, the multi-process management is more reasonable and efficient, once the training is stopped, all processes can respond immediately, the training is interrupted, resources are released quickly, and the usability is improved. In the prior art, the data transmission is frequently waited for by partial processes, and the interruption is only caused after long-time waiting until the limited time, so that the reuse can be realized only by timely manual clearing of a user.

The following are embodiments of the apparatus of the present application that may be used to perform the above-described method embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method described above in the present application.

Fig. 5 is a schematic structural diagram of a speech recognition model training apparatus provided in an embodiment of the present application, and as shown in fig. 5, the apparatus in this embodiment may include: an acquisition module 11, a processing module 12, a loading module 13, a training module 14 and a determination module 15, wherein,

the acquisition module 11 is configured to acquire training data and N, which is a positive integer, of the number of devices to be used in distributed parallel training;

the processing module 12 is configured to use a deep learning training optimization library starter to encapsulate a speech recognition model training task;

the loading module 13 is configured to initialize the distributed training initiator, and load multi-process information for performing distributed parallel training, where the multi-process information includes a number of processes for the distributed parallel training, and the number of processes for the distributed parallel training is N;

the training module 14 is configured to perform distributed parallel training on the speech recognition models of the N processes by using a deep learning training optimization library according to the training data and the multi-process information;

the determining module 15 is configured to determine a speech recognition model of any one process as a trained speech recognition model.

Optionally, the training module 14 is configured to:

dividing the training data into the N parts of sub-training data by using a preset segmentation algorithm;

and performing distributed parallel training of the voice recognition models of the N processes by using a deep learning training optimization library based on the N parts of sub-training data, wherein one part of sub-training data is used for training the voice recognition model of one process.

Optionally, the speech recognition model of each of the N processes is model-trained by using a zero-redundancy optimizer based on data parallelism.

Optionally, the speech recognition model of one of the N processes is obtained by training, based on one of the N sub-training data, in the following manner:

initializing an interface of an engine of the deep learning training optimization library;

constructing a voice recognition model, an optimizer and a scheduler, loading an engine of the deep learning training optimization library, and encapsulating the constructed voice recognition model, the optimizer and the scheduler to obtain an encapsulated voice recognition model, an encapsulated optimizer and an encapsulated scheduler;

loading a pre-training model and loading the sub-training data;

loading a check point, and storing first state information of the check point in a distributed manner, wherein the first state information comprises state information of a packaged voice recognition model;

training and iterating the packaged voice recognition model by using a zero redundancy optimizer based on data parallelism according to the sub-training data to obtain a trained voice recognition model of a process and second state information of the check point;

and distributively storing the second state information, wherein the second state information comprises the state information of the trained speech recognition model of the process.

Optionally, the state information of the encapsulated speech recognition model includes optimization process information, or the state information of the encapsulated speech recognition model includes optimization process information and at least one of gradient information and parameter information.

Optionally, the first state information of the distributed storage checkpoint includes:

segmenting the first state information into the N pieces of sub-state information;

and respectively storing the N pieces of sub-state information on the N pieces of equipment.

Optionally, the training module 14 is configured to:

training the packaged voice recognition model according to the sub-training data;

in the iterative process of the encapsulated speech recognition model training, respectively performing forward propagation, backward propagation and parameter updating, and updating and storing a piece of sub-state information stored on the equipment;

determining that the training of the encapsulated voice recognition model is finished to obtain a trained voice recognition model of a process;

acquiring sub-state information respectively stored on other N-1 devices to obtain N-1 sub-state information;

and summarizing the sub-state information stored on the equipment and the N-1 sub-state information to obtain second state information of the check point.

Optionally, the distributively storing the second state information includes:

the second state information is segmented into the N equal sub-state information;

Optionally, the processing module 12 is configured to use a deep learning training optimization library initiator to rewrite a training entry of the speech recognition model training task.

Optionally, the multi-process information further includes: at least one of an identifier of each process of the N processes, an identifier of each device of the N devices, and an association relationship between the identifier of each process and the identifier of each device.

The apparatus provided in the embodiment of the present application may implement the method embodiment, and specific implementation principles and technical effects thereof may be referred to the method embodiment, which is not described herein again.

As shown in fig. 6, the electronic device 700 may include:

a memory 710 and a processor 720, the memory 710 being configured to store a computer program and to transfer the program code to the processor 720. In other words, the processor 720 may call and run a computer program from the memory 710 to implement the method in the embodiment of the present application.

For example, the processor 720 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In some embodiments of the present application, the processor 720 may include, but is not limited to:

general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 710 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), enhanced Synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules, which are stored in the memory 710 and executed by the processor 720 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of the computer program in the electronic device.

As shown in fig. 6, the electronic device may further include:

a transceiver 730, the transceiver 730 being connectable to the processor 720 or the memory 710.

The processor 720 may control the transceiver 730 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 730 may include a transmitter and a receiver. The transceiver 730 may further include antennas, which may be one or more in number.

It should be understood that the various components in the electronic device are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiment.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disc (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, indirect coupling or communication connection between devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a speech recognition model, comprising:

2. The method of claim 1, wherein the performing distributed parallel training of the N-process speech recognition models using a deep learning training optimization library based on the training data and the multi-process information comprises:

3. The method of claim 2, wherein the speech recognition model for each of the N processes is model trained using a data-parallel based zero-redundancy optimizer.

4. The method of claim 3, wherein the speech recognition model for one of the N processes is trained based on one of the N pieces of sub-training data by:

loading a pre-training model and loading the sub-training data;

and storing the second state information in a distributed manner, wherein the second state information comprises the state information of the trained speech recognition model of the process.

5. The method according to claim 4, wherein the state information of the encapsulated speech recognition model comprises optimization process information, or wherein the state information of the encapsulated speech recognition model comprises optimization process information and at least one of gradient information and parameter information.

6. The method of claim 4, wherein the first state information of the distributed storage checkpoint comprises:

7. The method of claim 6, wherein the performing training iteration on the encapsulated speech recognition model according to the sub-training data by using a zero redundancy optimizer based on data parallelism to obtain a trained speech recognition model of a process and second state information of the check point comprises:

8. The method of claim 4, wherein the distributively storing the second state information comprises:

9. The method of claim 1, wherein the encapsulating speech recognition model training tasks using a deep learning training optimization library enabler comprises:

and rewriting a training inlet of the speech recognition model training task by using a deep learning training optimization library starter.

10. The method of claim 1,

the multiprocess information further comprises: at least one of an identifier of each process of the N processes, an identifier of each device of the N devices, and an association relationship between the identifier of each process and the identifier of each device.

11. A speech recognition model training apparatus, comprising:

12. An electronic device, comprising:

a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of any of claims 1-10.

13. A computer-readable storage medium comprising instructions which, when run on a computer program, cause the computer to perform the method of any one of claims 1-10.