CN115132186A

CN115132186A - End-to-end speech recognition model training method, speech decoding method and related device

Info

Publication number: CN115132186A
Application number: CN202210893335.5A
Authority: CN
Inventors: 黄宇鑫; 周羊; 张辉; 陈晓杰; 陈泽裕; 文灿
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-07-27
Filing date: 2022-07-27
Publication date: 2022-09-30

Abstract

The disclosure provides an end-to-end speech recognition model training method, a speech decoding device, electronic equipment and a computer readable storage medium, and relates to the technical field of artificial intelligence such as speech recognition, natural speech processing and deep learning. One embodiment of the method comprises: the method comprises the steps of obtaining a plurality of sample voice files, packaging each sample voice file into a sample file block, then generating address information of the sample file block, next reading the address information by using a data loader, generating a batch data set, and finally training an initial end-to-end voice recognition model based on the batch data set to obtain an end-to-end voice recognition model. By applying the method, the training efficiency when the initial end-to-end voice recognition model is trained is improved, and the quality of the trained end-to-end voice recognition model is improved.

Description

End-to-end speech recognition model training method, speech decoding method and related device

Technical Field

The present disclosure relates to the field of artificial intelligence technologies such as speech recognition, natural speech processing, deep learning, and the like, and in particular, to a method for training an end-to-end speech recognition model and decoding speech, and a corresponding apparatus, an electronic device, and a computer-readable storage medium.

Background

The system and the method benefit from breakthrough of artificial intelligence and machine learning, improvement of algorithm and hard/software capability, and have various and massive voice databases for training multi-parameter and large-scale voice recognition and synthesis models, so that the voice processing technology is enabled to obtain leap progress.

With the development of end-to-end neural networks in terms of machine translation, speech generation, etc., end-to-end speech recognition also achieves performance comparable to that of conventional approaches. Different from the traditional method of decomposing a voice recognition task into a plurality of subtasks (a vocabulary model, an acoustic model and a language model), an end-to-end voice recognition model is usually constructed based on the time sequence class Classification (CTC for short) of a neural network, and when a Mel-language spectrum is taken as input, a corresponding natural language text can be directly generated, so that the training process of the model is greatly simplified, and the method is more and more concerned by the academic and industrial fields.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for training an end-to-end speech recognition model and decoding speech, an electronic device, a computer readable storage medium and a computer program product.

In a first aspect, an embodiment of the present disclosure provides an end-to-end speech recognition model training method, including: acquiring a plurality of sample voice files, and packaging each sample voice file into a sample file block; generating address information of the sample file block; reading the address information by using a data loader to generate a batch of data sets; and training the initial end-to-end voice recognition model based on the batch data set to obtain an end-to-end voice recognition model.

In a second aspect, an embodiment of the present disclosure provides an end-to-end speech recognition model training apparatus, including: a sample obtaining and packing unit comprising a sample obtaining subunit and a sample packing subunit, wherein the sample obtaining subunit is configured to obtain a sample obtaining subunit of a plurality of sample voice files, and the sample packing subunit is configured to pack each sample voice file into a sample file block; an address information generating unit configured to generate address information of the sample file block; a batch data set generating unit configured to read the address information using a data loader, generating a batch data set; a model training unit configured to train an initial end-to-end speech recognition model based on the batch data set, resulting in an end-to-end speech recognition model.

In a third aspect, an embodiment of the present disclosure provides a speech decoding method, including: reading a voice file in a streaming mode; and in response to that the reading time length for reading the voice file meets the requirement of a preset time threshold, inputting the read target voice file into an end-to-end voice recognition model for processing, and generating a decoding result corresponding to the target voice file, wherein the end-to-end voice recognition model is obtained according to an end-to-end voice recognition model training method described in any one implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present disclosure provides a speech decoding apparatus, including: and the voice decoding unit is configured to respond that the length of reading time for reading the voice file meets the requirement of a preset time threshold, input the read target voice file into an end-to-end voice recognition model for processing, and generate a decoding result corresponding to the target voice file, wherein the end-to-end voice recognition model is obtained according to the end-to-end voice recognition model training device described in any one implementation manner of the second aspect.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for end-to-end speech recognition model training as described in any implementation of the first aspect or the method for speech decoding as described in any implementation of the third aspect when executed by the at least one processor.

In a sixth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement an end-to-end speech recognition model training method as described in any implementation manner of the first aspect or a speech decoding method as described in any implementation manner of the third aspect when executed.

In a seventh aspect, the disclosed embodiments provide a computer program product comprising a computer program, which when executed by a processor is capable of implementing the end-to-end speech recognition model training method as described in any of the implementations of the first aspect or the speech decoding method as described in any of the implementations of the third aspect.

The method for training and decoding the end-to-end voice recognition model provided by the embodiment of the disclosure can reduce the use of cache resources when the initial end-to-end voice recognition model is trained, and can improve the training efficiency when the initial end-to-end voice recognition model is trained, so that the training of the initial end-to-end model by using a large-scale sample voice file becomes possible, and the quality of the trained end-to-end voice recognition model is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture to which the present disclosure may be applied;

FIG. 2 is a flowchart of an end-to-end speech recognition model training method provided by an embodiment of the present disclosure;

fig. 3 is a flowchart of an implementation manner for obtaining a sample file block in another end-to-end speech recognition model training method provided by the embodiment of the present disclosure;

fig. 4a and 4b are flowcharts of an end-to-end speech recognition model training method and a speech decoding method in a specific application scenario provided by the embodiment of the present disclosure;

FIG. 5 is a block diagram of an end-to-end speech recognition model training apparatus according to an embodiment of the present disclosure;

fig. 6 is a block diagram of a speech decoding apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device adapted to perform an end-to-end speech recognition model training method and/or a speech decoding method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, in the present disclosure, the embodiments and the features of the embodiments may be combined with each other without conflict.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the present methods, apparatuses, electronic devices and computer-readable storage media for training a face recognition model and recognizing a face may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 and the server 105 may be installed with various applications for communicating information therebetween, such as a speech recognition application, an online translation application, and the like.

The

terminal apparatuses

101, 102, 103 and the server 105 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the

terminal devices

101, 102, and 103 are software, they may be installed in the electronic devices listed above, and they may be implemented as multiple software or software modules, or may be implemented as a single software or software module, and are not limited in this respect. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server; when the server is software, the server may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not limited herein.

The server 105 can provide various services through various built-in applications, taking a speech recognition application which can provide speech recognition services for users as an example, the following effects can be achieved when the server 105 runs the speech recognition application: the server 105 acquires the voice file uploaded by the user in a streaming manner, then the server 105 responds when the reading time length of the read voice file meets the requirement of a preset time threshold, the read target voice file is input to an end-to-end voice recognition model for processing, decoding results corresponding to all sections of target voice files are continuously generated, and then all decoding results are collected and returned to the user as voice recognition results of the content in the voice file.

The end-to-end speech recognition model can be obtained by the end-to-end speech recognition model training application built in the server 105 according to the following steps: the server 105 obtains a plurality of sample voice files, packs each sample voice file into a sample file block, then the server 105 generates address information of the sample file block, next, the server 105 reads the address information by using a data loader, generates a batch data set, and finally, the server 105 trains an initial end-to-end voice recognition model based on the batch data set to obtain an end-to-end voice recognition model.

Since the end-to-end speech recognition model obtained by training needs to occupy more computation resources and stronger computation capability, the end-to-end speech recognition model training method provided in the following embodiments of the present application is generally executed by the server 105 having stronger computation capability and more computation resources, and accordingly, the end-to-end speech recognition model training apparatus is generally also disposed in the server 105. However, it should be noted that when the

terminal devices

101, 102, and 103 also have computing capabilities and computing resources meeting the requirements, the

terminal devices

101, 102, and 103 may also complete the above-mentioned operations performed by the server 105 through the training application of the end-to-end speech recognition model installed thereon, and then output the same result as the server 105. Correspondingly, the end-to-end speech recognition model training device may also be disposed in the

terminal equipment

101, 102, 103. In such a case, the exemplary system architecture 100 may also not include the server 105 and the network 104.

Of course, the server used to train the derived end-to-end speech recognition model may be different from the server used to invoke the trained end-to-end speech recognition model. Specifically, the end-to-end speech recognition model obtained through the training of the server 105 may also obtain a lightweight end-to-end speech recognition model suitable for being embedded in the

terminal devices

101, 102, and 103 in a model distillation manner, that is, the lightweight end-to-end speech recognition model in the

terminal devices

101, 102, and 103 may be flexibly selected and used according to the recognition accuracy of the actual requirement, or a more complex end-to-end speech recognition model in the server 105 may be selected and used.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 is a flowchart of an end-to-end speech recognition model training method according to an embodiment of the present disclosure, wherein the process 200 includes the following steps:

step 201, obtaining a plurality of sample voice files, and packaging each sample voice file into a sample file block.

In this embodiment, an execution subject (for example, the server 105 shown in fig. 1) of the end-to-end speech recognition model training method obtains a plurality of sample speech files, where the sample speech files generally include audio and text data corresponding to the audio, so that the audio included in the sample speech files is used as input and the text data corresponding to the audio is used as output to train the initial end-to-end speech recognition model, and after the execution subject obtains the plurality of sample speech files, each sample speech file may be packaged into a sample file block, where the sample file block may be configured in advance, or after obtaining at least two sample speech files, the obtained sample speech files may be packaged and encapsulated.

It should be understood that, in the case where there is a difference in the acquisition time, the lot, and the like of the sample voice file, a new sample file block may be formed by packing a sample voice file acquired later into a sample file block generated based on a sample voice file acquired earlier, or may be formed only by packing and encapsulating a sample voice file acquired later.

In some embodiments, the sample file block includes at least one of: the file system comprises a Tar format file, an Npz format file, a hierarchical data format fifth version format file and a Pickle format file, wherein the initial design purpose of the Tar format is to backup files onto a Tape (Tape Archive), so that the name Tar is obtained, a plurality of files can be combined into one file, the suffix of the packaged file is also 'Tar', and a sample file block obtained in the format runs stably, can directly run under systems such as Linux and the like, and is fast and reliable.

The Npz-format file is an offline data set of handwritten numbers, which can be directly imported locally, and without manually splitting training data, test data and corresponding digital labels, and as a data set of a deep learning entry project, all digital images can be printed in Python.

The fifth edition (Hdf5) format file of hierarchical data format is a storage file which can be used by cross-platform data, can store different types of images and data, can be transmitted on different types of machines, and has a function library for uniformly processing the file format.

The Pickle format file is a storage mode file, and can store some temporary variables used in the Python project process, or data such as character strings, lists, dictionaries and the like which need to be extracted and temporarily stored.

It should be noted that the sample voice file may be directly obtained from a local storage device by the execution subject, or may be obtained from a non-local storage device (for example,

terminal devices

101, 102, and 103 shown in fig. 1). The local storage device may be a data storage module arranged in the execution main body, such as a server hard disk, in which case the sample voice file can be quickly read locally; the non-local storage device may also be any other electronic device configured to store data, such as some user terminals, in which case the executing entity may obtain the required sample voice file by sending a obtaining command to the electronic device.

Step 202, generating address information of the sample file block.

In this embodiment, after the sample file blocks are generated based on the step 201, the sample file blocks are stored respectively, and corresponding address information is generated based on the storage location of each sample file block and then recorded, so that the corresponding sample file blocks can be called directly based on the address information.

In practice, after generating the address information of each sample file block, a directory file for recording the address information, such as Manifest, may also be generated so as to read the address information of each sample file block using the directory file.

In step 203, the address information is read by the data loader to generate a batch of data sets.

In this embodiment, after a sample file block for training is determined, a data loader (Dataloader) is used to obtain address information of the sample file block, and after a corresponding sample file block is obtained based on the address information, sample voice files included in the sample file block are read, and each sample voice file is joined into a Batch data set (Batch).

In practice, when there are multiple dataloaders and multiple sample file blocks, in order to improve reading efficiency, multiple dataloaders are used to process the sample file blocks, that is, different sample file blocks are sent to each Dataloader in parallel to be read, and the reading results of each Dataloader are summarized to generate the Batch.

In some embodiments, the Dataloader may also be configured to pre-load address information of a preset number of sample file blocks, so as to avoid an initial read stuck problem caused by inefficient address information reading in an initial stage.

And step 204, training the initial end-to-end voice recognition model based on the batch data set to obtain an end-to-end voice recognition model.

In this embodiment, the batch data set obtained in step 203 is utilized, and the audio in the sample voice file included in the batch data set is continuously used as an input and the corresponding text is used as an output, so as to train the initial end-to-end voice recognition model, and obtain the end-to-end voice recognition model.

In practice, the initial end-to-end speech recognition model may be constructed by using one of a deep speech recognition model (deep speech2), a Transformer and a Transformer model, deep speech2 is a non-autoregressive model, and both the Transformer and the Transformer are forward-inference autoregressive models supporting non-autoregression, and each of the models may support Prefix bundle Search (Prefix Beam Search) for decoding, where the Transformer as a deep learning model has been widely verified to have better performance in the speech model field than a model based on a Recurrent Neural Network (RNN), and the Transformer is a model obtained by combining the Transformer and the convolutional Neural Network.

The DeepSpeech2 can extract the characteristics of audio by using a linear spectrogram method, a 2-layer convolutional neural network is used as a down-sampling module of a model, a multi-layer RNN layer is accessed as an Encoder module (Encoder), the RNN uses a unidirectional RNN structure, so that streaming identification can be supported, the output of the Encoder is acquired through a normalization index function layer (Softmax) to obtain the output probability, and the output probability is decoded through a decoding layer to obtain the final decoding result.

The structure of the transformer can be basically divided into an Encoder and a Decoder module (Decoder). In the training process, both the Encoder and the Decoder have Loss functions (Loss), the Loss of the Encoder uses a time sequence class Classification Loss function (connection terminal class Loss) of a neural network, and the Loss of the Decoder uses a Label Smoothing Loss function (Label Smoothing Loss). The Encoder of the Transformer comprises an Encoder subsample Layer, a position Encoder Layer and a Multi-Layer Transformer Layer, wherein each Layer mainly comprises 4 parts, namely a Forward Feed (Feed Forward) Layer, a Multi-head Attention (Multi-head Attention) Layer, a convolution module and a Feed Forward Layer, each part can avoid the phenomenon of gradient disappearance in a residual error connection mode, and the Decoder of the Transformer uses a Decoder module of the Transformer. The decoder mainly comprises a decoder input layer (Output Embedding), a Position vector (Position Embedding) layer and a plurality of layers of transformers, wherein each layer of the transformers comprises a mask Multi-head Attention mechanism (mask Multi-head Attention) layer, a Multi-head Attention layer and a Feed Forward layer.

The end-to-end speech recognition model training method provided by the embodiment of the disclosure can reduce the use of cache resources when the initial end-to-end speech recognition model is trained, and can improve the training efficiency when the initial end-to-end speech recognition model is trained, so that the training of the initial end-to-end model by using a large-scale sample speech file becomes possible, and the quality of the trained end-to-end speech recognition model is improved.

In some optional implementation manners of this embodiment, the obtaining a plurality of sample voice files and packaging each sample voice file into a sample file block includes: acquiring a plurality of sample voice files and generating a plurality of sample voice file sets, wherein the sample voice file sets comprise a preset number of sample voice files; and respectively packaging and processing each sample voice file set to generate a sample file block corresponding to each sample voice file set.

Specifically, after a plurality of sample voice files are obtained, the sample voice files are grouped according to a preset number to generate a corresponding sample voice file set, wherein each sample voice file set comprises a preset number of sample voice files, and after the sample voice file set is generated, sample file blocks are packed and formed respectively based on the sample voice file set, so that the sample file blocks can be generated quickly and uniformly based on the sample voice files, and the use effect of the sample file blocks is prevented from being influenced by the difference of the number of the sample voice files included in the sample file blocks.

In practice, in the process of generating the sample voice file set based on the sample voice files, whether the same sample voice file can be reused or not, that is, whether the same sample voice file can be included in different sample voice file sets or not, can be set according to actual requirements.

In some optional implementations of this embodiment, the method further includes: pre-configuring at least one sample file block; and correspondingly adding a type list to the sample file block, wherein the type list is used for marking the type of the sample voice file existing in the corresponding sample file block.

Specifically, a sample file block obtained based on a sample voice file in the historical data can be configured or obtained in advance, a type list is correspondingly added to the sample file block, the type of the sample voice file included in the sample file block is recorded in the type list, so that the sample file block of the same type as the training purpose can be determined through the type list, a storage strategy of the sample voice file can be adjusted according to the type list, and the quality of the sample file block is improved.

In practice, the type of the sample voice file may be based on the language scene, the language type, the timbre type, etc. corresponding to the sample voice file.

In some optional implementations of this embodiment, the method further includes: obtaining a first number of data loaders used to generate the batch data set and a second number of the sample file blocks; in response to the second number not being divisible by the first number, obtaining a third number and a fourth number, the third number being divisible by the first number, the third number being the number to the left of the second number axis that is closest to the second number, and the fourth number being the number to the right of the second number axis that is closest to the second number; acquiring a first quantity difference between the third quantity and the second quantity and a second quantity difference between the fourth quantity and the second quantity; deleting the second number of sample file blocks in response to the first number difference being less than the second number difference.

Specifically, when a plurality of dataloaders are used to process sample file blocks, in order to ensure that each Dataloader can uniformly distribute the sample file blocks, after a first number of dataloaders and a second number of sample file blocks are obtained, a response is made when the second number cannot divide the first number, a third number, which is closest to the left side of the second number axis and can divide the first number, and a fourth number, which is closest to the right side of the second number axis and can divide the first number, are obtained, a first number difference between the third number and the second number and a second number difference between the fourth number and the second number are obtained, a response is made when the first number difference is smaller than the second number difference, the second number difference is deleted, the existing sample file blocks are adjusted in a reduction manner, and it is ensured that each Dataloader can uniformly distribute the sample file blocks.

In some optional implementations of this embodiment, the method further includes: and in response to the first quantity difference being larger than the second quantity difference, selecting the second quantity difference base sample file blocks from the sample file blocks and respectively copying the base sample file blocks at a single time.

Specifically, when the first quantity difference is larger than the second quantity difference, a response is performed, after the second quantity difference sample file blocks are randomly selected from the existing sample quantity blocks as basic sample file blocks, the basic sample file blocks are respectively subjected to single copy, the existing sample file blocks are adjusted in a mode of filling the sample file blocks, and the fact that the data loaders can evenly distribute the sample file blocks is guaranteed.

Referring to fig. 3, fig. 3 is a flowchart 300 of a specific implementation manner provided for step 201 in the flowchart 200 shown in fig. 2 under the condition that a sample file block and a type list corresponding to the sample file block are configured in advance in a method for training an end-to-end speech recognition model according to an embodiment of the present disclosure, other steps in the flowchart 200 are not adjusted, and a new complete embodiment is obtained by replacing step 201 with the specific implementation manner provided in this embodiment. Wherein the process 300 comprises the following steps:

step 301, a plurality of sample voice files and a first type of each sample voice file are obtained.

In this embodiment, the first type of each sample voice file is obtained while obtaining a plurality of sample voice files.

Step 302, determining the second type missing in the sample file block based on the type list.

In the present embodiment, the second type missing in each sample file block is determined based on the type list of each sample file block, that is, the sample voice file of the second type does not exist in the current sample file block.

Step 303, in response to the first type of the sample voice file matching the second type, packaging the sample voice file into a sample file block.

In this embodiment, when the first type of the sample voice file matches the second type missing in the sample file block, the sample voice file is correspondingly stored in the sample file block.

Based on the embodiment shown in fig. 2, the present embodiment further determines the type of the sample voice file lacking in each sample file block based on the type list, and correspondingly stores the sample voice file of the type lacking in the sample file block, so as to extend the category of the sample voice file stored in the sample file block and improve the use value of the sample file block.

In practice, if it is expected that the initial end-to-end speech recognition model is trained by using sample speech files of the same type, the sample speech files whose similarity satisfies the requirement of the preset threshold value are correspondingly determined to be stored based on the similarity between the first type of the sample speech files and the type of the sample speech files existing in the sample file block recorded in the type list.

On the basis of any one of the above embodiments, the method further comprises the following steps: obtaining an intermediate decoding result in the training process of the initial end-to-end voice recognition model; processing the intermediate decoding result by using a scene semantic model to generate a score of the intermediate decoding result, wherein the scene semantic model is used for generating a score of semantic quality of the input content under a language scene corresponding to the scene semantic model; and responding to the condition that the score sum of the intermediate decoding results exceeds a preset score threshold value, and adding an available scene label of the language scene corresponding to the scene semantic model for the end-to-end voice recognition model.

Specifically, in the process of training an initial end-to-end speech recognition model, taking single training as an example, obtaining an intermediate decoding result generated inside the initial end-to-end speech recognition model in the process of taking audio in a sample speech file as input and taking corresponding text data as output, then processing the intermediate result by using a scene semantic model for generating a score of semantic quality of input content in a language scene corresponding to the scene semantic model, generating a score corresponding to the intermediate decoding result, after training the initial end-to-end speech recognition model and obtaining the end-to-end speech recognition model, summing the scores of the intermediate decoding results to obtain a score sum value, responding when the score sum value exceeds a score threshold requirement, adding an available scene label corresponding to the scene semantic model for the end-to-end speech recognition model, to indicate, by the available context tag, the language context available to the end-to-end speech recognition model.

The foregoing embodiments illustrate how to train to obtain an end-to-end speech recognition model from various aspects, and in order to highlight the effect of the end-to-end speech recognition model trained from the actual usage scenario as much as possible, the present disclosure also specifically provides a scheme for solving the actual problem by using the trained end-to-end speech recognition model, and a speech decoding method includes the following steps:

reading the voice file in a streaming manner, responding when the reading time length of the voice file meets the requirement of a preset time threshold, inputting the read target voice file into an end-to-end voice recognition model for processing, and generating a decoding result corresponding to the target voice file, wherein the end-to-end voice recognition model can be obtained by training based on the end-to-end voice recognition model training method in any embodiment.

In the process of streaming reading the voice file and generating the target voice file, if a plurality of target voice files exist, the starting point of the next target voice file may be the end point of the next previous target voice file, or preferably, a point with a preset time duration before the end point of the previous target voice file is the starting point, so as to avoid the quality of identifying the connection position of the two sections of target voice files caused by the streaming reading delay and the like.

In order to deepen understanding, the disclosure also provides a specific implementation scheme by combining a specific application scenario:

as shown in fig. 4a, a plurality of sample voice files (sample voice file 1, sample voice file 2, sample voice file 3 … …) are acquired, each sample voice file is packaged into a sample file block (Tar1, Tar2, Tar3 … …), address information of each sample file block is generated, and then the position of each sample file block is recorded by using manitest.

Then, the user can use the device to perform the operation,Dataloaderafter locating various sample file blocks using Manifest, a batch of data sets is generated to enable training of an initial end-to-end speech recognition model constructed based on DeepSpeech 2.

After the training is completed and an end-to-end voice recognition model is obtained, a voice file is read in a streaming mode, a response is made when the reading time length of the read voice file meets the requirement of a preset time threshold, the read voice file is generated and recorded as a target voice file 1, the preset time length from the end of the target voice file 1 is taken as a starting point, the voice file is read, and a target voice file 2 and a target voice file 3 … … which have the same time length as the target voice file 1 are sequentially obtained.

Sequentially inputting the target voice files 1, 2 and 3 … … into an end-to-end voice recognition model, sequentially processing the target voice files 1 and 2 … … by using the end-to-end voice recognition model, as shown in fig. 4b, performing feature extraction on original audio in the voice files by using a feature extraction layer constructed in a linear spectrogram-based manner, then using a 2-layer convolutional neural network as a down-sampling module of the model, accessing a plurality of layers of RNNs as an Encoder, and using a unidirectional RNN structure to support streaming recognition.

And finally, the output of the Encoder is acquired through a Softmax layer, the probability of the output is acquired through a decoding layer, and the final decoding result is obtained.

With further reference to fig. 5 and fig. 6, as implementations of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of an end-to-end speech recognition model training apparatus and an embodiment of a speech decoding apparatus, respectively, where the embodiment of the end-to-end speech recognition model training apparatus corresponds to the embodiment of the end-to-end speech recognition model training method shown in fig. 2, and the embodiment of the speech decoding apparatus corresponds to the embodiment of the speech decoding method. The device can be applied to various electronic equipment.

As shown in fig. 5, the end-to-end speech recognition model training apparatus 500 of the present embodiment may include: a sample acquiring and packing unit 501, an address information generating unit 502, a batch data set generating unit 503, and a model training unit 504. The sample acquiring and packing unit 501 includes a sample acquiring subunit and a sample packing subunit, where the sample acquiring subunit is configured to acquire sample acquiring subunits of a plurality of sample voice files, and the sample packing subunit is configured to pack each sample voice file into a sample file block; an address information generating unit 502 configured to generate address information of the sample file block; a batch data set generating unit 503 configured to read the address information using a data loader, and generate a batch data set; a model training unit 504 configured to train an initial end-to-end speech recognition model based on the batch data set, resulting in an end-to-end speech recognition model.

In the present embodiment, in the end-to-end speech recognition model training apparatus 500: the detailed processing and the technical effects of the sample obtaining and packing unit 501, the address information generating unit 502, the batch data set generating unit 503 and the model training unit 504 can refer to the related descriptions of step 201 and step 204 in the corresponding embodiment of fig. 2, and are not described herein again.

In some optional implementations of this embodiment, the sample obtaining and packing unit 501 includes: the sample acquiring subunit is further configured to acquire a plurality of sample voice files, and generate a plurality of sample voice file sets, wherein the sample voice file sets include a preset number of sample voice files; the sample packing subunit is further configured to separately pack and process each sample voice file set, and generate a sample file block corresponding to each sample voice file set.

In some optional implementations of this embodiment, the end-to-end speech recognition model training apparatus 500 may further include: a file block configuration unit configured to pre-configure at least one of the sample file blocks; and the type list adding unit is configured to correspondingly add a type list for the sample file block, wherein the type list is used for marking the type of the sample voice file existing in the corresponding sample file block.

In some optional implementations of this embodiment, the sample obtaining and packing unit 501 includes: the sample acquiring subunit is further configured to acquire a plurality of sample voice files and a first type of each sample voice file; the sample packing subunit is further configured to determine, based on the list of types, a second type missing in the sample file block; in response to the first type of the sample voice file matching the second type, the sample voice file is packaged into sample file blocks.

In some optional implementations of this embodiment, the end-to-end speech recognition model training apparatus 500 may further include: an intermediate result obtaining unit configured to obtain an intermediate decoding result in the initial end-to-end speech recognition model training process; an intermediate result scoring unit configured to process the intermediate decoding result using a scene semantic model for generating a score of semantic quality of the input content in a language scene corresponding to the scene semantic model to generate a score of the intermediate decoding result; and the scene label adding unit is configured to add an available scene label of the language scene corresponding to the scene semantic model to the end-to-end voice recognition model in response to the sum of the scores of the intermediate decoding results exceeding a preset score threshold value.

In some optional implementations of this embodiment, the end-to-end speech recognition model training apparatus 500 may further include: a loader number and file block number obtaining unit including a loader number obtaining subunit and a file block number obtaining subunit, wherein the loader number obtaining subunit is configured to obtain a first number of data loaders used to generate the batch data set, and the file block number obtaining subunit is configured to obtain a second number of the sample file blocks; an integer division number acquisition unit configured to acquire a third number and a fourth number, wherein the third number is the number closest to the second number on the left side of the second number axis, and the fourth number is the number closest to the second number on the right side of the second number axis, in response to the second number not being able to divide the first number in an integer; a number difference acquisition unit configured to acquire a first number difference of the third number and the second number and a second number difference of the fourth number and the second number; a first sample file block adjustment unit configured to delete the second number of sample file blocks different from each other in response to the first number difference being smaller than the second number difference.

In some optional implementations of this embodiment, the end-to-end speech recognition model training apparatus 500 may further include: and a second sample file block adjusting unit configured to select the second number difference base sample file blocks from the sample file blocks and copy the base sample file blocks respectively at a single time in response to the first number difference being greater than the second number difference.

In some optional implementations of this embodiment, the sample file block includes at least one of: the file format comprises a Tar format file, an Npz format file, a fifth version format file in a hierarchical data format and a Pickle format file.

As shown in fig. 6, the speech decoding apparatus 600 of the present embodiment may include: a voice file stream reading unit 601 and a voice decoding unit 602. The voice file stream reading unit 601 is configured to read a voice file in a stream mode; a voice decoding unit 602, configured to, in response to that a reading time length for reading the voice file meets a preset time threshold requirement, input the read target voice file into an end-to-end voice recognition model for processing, and generate a decoding result corresponding to the target voice file; wherein, the end-to-end speech recognition model is obtained according to the end-to-end speech recognition model training device 500.

In the present embodiment, speech decoding apparatus 600: the specific processing of the voice file stream reading unit 601 and the voice decoding unit 602 and the technical effects brought by the processing can correspond to the related descriptions in the method embodiments, which are not described herein again.

The present embodiment exists as an apparatus embodiment corresponding to the above method embodiment, and the end-to-end speech recognition model training apparatus and the speech decoding apparatus provided in the present embodiment can reduce the use of cache resources when an initial end-to-end speech recognition model is trained, and can improve the training efficiency when the initial end-to-end speech recognition model is trained, so that it is possible to train the initial end-to-end model using a large-scale sample speech file, and further improve the quality of the end-to-end speech recognition model obtained by training.

According to an embodiment of the present disclosure, the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for training an end-to-end speech recognition model and/or the method for speech decoding described in any of the above embodiments when executed by the at least one processor.

According to an embodiment of the present disclosure, there is also provided a readable storage medium storing computer instructions for enabling a computer to implement the end-to-end speech recognition model training method and/or the speech decoding method described in any of the above embodiments when executed.

The embodiments of the present disclosure provide a computer program product, which when executed by a processor can implement the end-to-end speech recognition model training method and/or the speech decoding method described in any of the above embodiments.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as an end-to-end speech recognition model training method and/or a speech decoding method. For example, in some embodiments, the end-to-end speech recognition model training method and/or the speech decoding method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the end-to-end speech recognition model training method and/or the speech decoding method described above. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the end-to-end speech recognition model training method and/or the speech decoding method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and Virtual Private Server (VPS) service.

According to the technical scheme, the use of cache resources in the training of the initial end-to-end voice recognition model can be reduced, the training efficiency in the training of the initial end-to-end voice recognition model can be improved, the training of the initial end-to-end model by using a large-scale sample voice file becomes possible, and the quality of the trained end-to-end voice recognition model is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. An end-to-end speech recognition model training method, comprising:

obtaining a plurality of sample voice files, and packaging each sample voice file into a sample file block;

generating address information of the sample file block;

reading the address information by using a data loader to generate a batch of data sets;

and training an initial end-to-end voice recognition model based on the batch data set to obtain an end-to-end voice recognition model.

2. The method of claim 1, wherein the obtaining a plurality of sample voice files and packaging each of the sample voice files into a sample file block comprises:

acquiring a plurality of sample voice files and generating a plurality of sample voice file sets, wherein the sample voice file sets comprise a preset number of sample voice files;

and respectively packaging and processing each sample voice file set to generate a sample file block corresponding to each sample voice file set.

3. The method of claim 1, further comprising:

pre-configuring at least one sample file block;

and correspondingly adding a type list to the sample file block, wherein the type list is used for marking the type of the existing sample voice file in the corresponding sample file block.

4. The method of claim 3, wherein the obtaining a plurality of sample voice files and packaging each of the sample voice files into a sample file block comprises:

obtaining a plurality of sample voice files and a first type of each sample voice file;

determining a second type missing in the sample file block based on the type list;

in response to the first type of the sample voice file matching the second type, packaging the sample voice file into a sample file block.

5. The method of claim 1, further comprising:

obtaining a first number of data loaders used to generate the batch data set and a second number of the sample file blocks;

in response to the second number not being divisible by the first number, obtaining a third number and a fourth number, wherein the third number is the number on the left side of the second number axis that is closest to the second number, and the fourth number is the number on the right side of the second number axis that is closest to the second number;

acquiring a first quantity difference between the third quantity and the second quantity and a second quantity difference between the fourth quantity and the second quantity;

deleting the second number of the sample file blocks in response to the first number difference being less than the second number difference.

6. The method of claim 5, further comprising:

and in response to the first quantity difference being larger than the second quantity difference, selecting the second quantity difference base sample file blocks from the sample file blocks and respectively copying the base sample file blocks at a single time.

7. The method of claim 1, further comprising:

obtaining an intermediate decoding result in the training process of the initial end-to-end voice recognition model;

processing the intermediate decoding result by using a scene semantic model to generate a score of the intermediate decoding result, wherein the scene semantic model is used for generating a score of semantic quality of input content in a language scene corresponding to the scene semantic model;

and responding to the fact that the score sum value obtained after the scores of the intermediate decoding results are added exceeds a preset score threshold value, and adding available scene labels of the language scenes corresponding to the scene semantic model for the end-to-end voice recognition model.

8. The method of any of claims 1-7, wherein the sample file block comprises at least one of: the file format comprises a Tar format file, an Npz format file, a fifth version format file in a hierarchical data format and a Pickle format file.

9. A method of speech decoding, comprising:

reading a voice file in a streaming mode;

responding to the condition that the reading time length of the read voice file meets the requirement of a preset time threshold, inputting the read target voice file into an end-to-end voice recognition model for processing, and generating a decoding result corresponding to the target voice file, wherein the end-to-end voice recognition model is obtained based on the end-to-end voice recognition model training method of any one of claims 1-8.

10. An end-to-end speech recognition model training apparatus, comprising:

a sample acquiring and packing unit comprising a sample acquiring subunit and a sample packing subunit, wherein the sample acquiring subunit is configured to acquire sample acquiring subunits of a plurality of sample voice files, and the sample packing subunit is configured to pack each of the sample voice files into a sample file block;

an address information generating unit configured to generate address information of the sample file block;

a batch data set generating unit configured to read the address information using a data loader, generating a batch data set;

a model training unit configured to train an initial end-to-end speech recognition model based on the batch data set, resulting in an end-to-end speech recognition model.

11. The apparatus of claim 10, wherein the sample acquisition and packing unit comprises:

the sample acquiring subunit is further configured to acquire a plurality of sample voice files, and generate a plurality of sample voice file sets, wherein the sample voice file sets include a preset number of sample voice files;

the sample packing subunit is further configured to separately pack and process each sample voice file set, and generate a sample file block corresponding to each sample voice file set.

12. The apparatus of claim 10, further comprising:

a file block configuration unit configured to pre-configure at least one of the sample file blocks;

and the type list adding unit is configured to correspondingly add a type list to the sample file block, wherein the type list is used for marking the type of the sample voice file existing in the corresponding sample file block.

13. The apparatus of claim 12, wherein the sample acquisition and packing unit comprises:

the sample acquiring subunit is further configured to acquire a plurality of sample voice files and a first type of each of the sample voice files;

the sample packing subunit is further configured to determine a second type missing in the sample file block based on the type list;

14. The apparatus of claim 10, further comprising:

an intermediate result obtaining unit configured to obtain an intermediate decoding result in the initial end-to-end speech recognition model training process;

an intermediate result scoring unit configured to process the intermediate decoding result using a scene semantic model for generating a score of semantic quality of the input content in a language scene corresponding to the scene semantic model to generate a score of the intermediate decoding result;

and the scene label adding unit is configured to add an available scene label of a language scene corresponding to the scene semantic model to the end-to-end voice recognition model in response to the sum of the scores of the intermediate decoding results exceeding a preset score threshold.

15. The apparatus of claim 10, further comprising:

a loader number and file block number obtaining unit including a loader number obtaining subunit and a file block number obtaining subunit, wherein the loader number obtaining subunit is configured to obtain a first number of data loaders used to generate the batch data set, and the file block number obtaining subunit is configured to obtain a second number of the sample file blocks;

an integer division number acquisition unit configured to acquire a third number and a fourth number, wherein the third number is a number closest to the second number on the left side of the second number axis, and the fourth number is a number closest to the second number on the right side of the second number axis, in response to the second number not being able to divide the first number in an integer;

a number difference acquisition unit configured to acquire a first number difference of the third number and the second number and a second number difference of the fourth number and the second number;

a first sample file block adjustment unit configured to delete the second number of difference sample file blocks in response to the first number difference being less than the second number difference.

16. The apparatus of claim 15, further comprising:

a second sample file block adjusting unit configured to select a second number of difference base sample file blocks from the sample file blocks and copy the base sample file blocks at a single time, respectively, in response to the first number difference being greater than the second number difference.

17. The apparatus of any of claims 10-16, wherein the sample file block comprises at least one of: the file format comprises a Tar format file, an Npz format file, a fifth version format file in a hierarchical data format and a Pickle format file.

18. A speech decoding apparatus comprising:

a voice file stream type reading unit configured to read a voice file in a stream type;

a voice decoding unit, configured to, in response to that the length of the reading time for reading the voice file meets a preset time threshold requirement, input the read target voice file into an end-to-end voice recognition model for processing, and generate a decoding result corresponding to the target voice file, where the end-to-end voice recognition model is obtained based on the end-to-end voice recognition model training apparatus according to any one of claims 10 to 17.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the end-to-end speech recognition model training method of any of claims 1-8 and/or the speech decoding method of claim 9.

20. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the end-to-end speech recognition model training method of any one of claims 1-8 and/or the speech decoding method of claim 9.