CN113763937A

CN113763937A - Method, device and equipment for generating voice processing model and storage medium

Info

Publication number: CN113763937A
Application number: CN202111255342.4A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2021-12-07

Abstract

The disclosure provides a generation method of a voice processing model, relates to the field of artificial intelligence, and further relates to the technical field of deep learning, voice recognition and machine translation. The specific implementation scheme is as follows: training the first initial network by using the non-source language data and the first source language data to obtain a voice coding module, wherein the voice coding module is used for outputting a corresponding source language word vector according to the input source language audio; training the second initial network by using second source language data to obtain a text prediction module, wherein the text prediction module is used for outputting a corresponding associated word vector according to an input source language word vector; and generating a voice processing model according to the voice coding module, the text prediction module and the text translation module, wherein the voice processing model is used for outputting a corresponding target language text according to the input source language audio. According to the technology disclosed by the invention, the training cost of the model can be effectively reduced, and the training efficiency of the model can be improved.

Description

Method, device and equipment for generating voice processing model and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and further relates to the technical fields of deep learning, speech recognition and machine translation, and in particular, to a method, an apparatus, a device and a storage medium for generating a speech processing model.

Background

The most important technical problem of the end-to-end speech translation technology in the related art is the limitation of training data. According to the task requirement of speech translation, the training input is speech of a source language, the training target is a text of a corresponding target language, the acquisition cost of the source speech-target text data pair is high, and the scale of the existing training data is difficult to form. The end-to-end speech translation model is often huge in parameter quantity and needs large-scale training data.

Disclosure of Invention

The disclosure provides a method, a device, equipment and a storage medium for generating a voice processing model.

According to an aspect of the present disclosure, there is provided a method for generating a speech processing model, including:

training the first initial network by using the non-source language data and the first source language data to obtain a voice coding module, wherein the voice coding module is used for outputting a corresponding source language word vector according to the input source language audio;

training the second initial network by using second source language data to obtain a text prediction module, wherein the text prediction module is used for outputting a corresponding associated word vector according to an input source language word vector;

and generating a voice processing model according to the voice coding module, the text prediction module and the text translation module, wherein the voice processing model is used for outputting a corresponding target language text according to the input source language audio.

According to another aspect of the present disclosure, there is provided a speech processing method including:

inputting the source language audio to be processed into a voice processing model to obtain a target language text corresponding to the audio to be processed; the voice processing model is obtained by adopting the method for generating the voice processing model according to the embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided a generating apparatus of a speech processing model, including:

the voice coding module generation module is used for training the first initial network by utilizing the non-source language data and the first source language data to obtain a voice coding module, and the voice coding module is used for outputting corresponding source language word vectors according to input source language audio;

the text prediction module generation module is used for training the second initial network by utilizing second source language data to obtain a text prediction module, and the text prediction module is used for outputting a corresponding associated word vector according to an input source language word vector;

and the voice processing model generating module is used for generating a voice processing model according to the voice coding module, the text prediction module and the text translation module, and the voice processing model is used for outputting a corresponding target language text according to the input source language audio.

According to another aspect of the present disclosure, there is provided a voice processing apparatus including:

the input module is used for inputting the audio of the source language to be processed into the voice processing model;

the receiving module is used for receiving a target language text corresponding to the audio to be processed from the voice processing model; the voice processing model is obtained by the generation device of the voice processing model according to the above embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method in any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method in any of the embodiments of the present disclosure.

According to the technology disclosed by the invention, the limit of the training data volume to the end-to-end speech translation model is broken through, the existing stock training data can be effectively utilized, the training values of the non-source language data, the first source language data and the second source language data are fully mined, and the time and the economic cost for obtaining the training data are reduced, so that the training cost of the model is reduced, and the training efficiency of the model is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 shows a flow diagram of a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 2 illustrates a detailed flow diagram for generating a speech processing model of a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 3 shows a specific flowchart of the method for generating a speech processing model for joint training of a speech coding module, a text prediction module, and a connection layer according to an embodiment of the present disclosure;

FIG. 4 shows a detailed flow diagram of training a third initial network for a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 5 shows a detailed flow diagram for training a first initial network for a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 6 illustrates a detailed flow chart of a method of generating a speech processing model according to an embodiment of the present disclosure for inputting non-source language audio samples into a first initial network;

FIG. 7 shows a detailed flow diagram for training the second initial network of a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 8 illustrates a detailed flow diagram for pre-processing a sequence of source language word vectors for a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 9 illustrates a detailed flow diagram of the generate text translation module of a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 10 illustrates a detailed flow diagram for preprocessing source language text samples according to a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 11 is a diagram illustrating an example of an application of a method of generating a speech processing model according to an embodiment of the present disclosure;

FIG. 12 shows a flow diagram of a method of speech processing according to an embodiment of the present disclosure;

FIG. 13 shows a block diagram of a device for generating a speech processing model according to an embodiment of the present disclosure;

FIG. 14 shows a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 15 is a block diagram of an electronic device for implementing a method of generating a speech processing model and/or a method of speech processing according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A method of generating a speech processing model according to an embodiment of the present disclosure is described below with reference to fig. 1 to 11.

As shown in fig. 1, the method for generating a speech processing model according to the embodiment of the present disclosure specifically includes the following steps:

s101: training the first initial network by using the non-source language data and the first source language data to obtain a voice coding module, wherein the voice coding module is used for outputting a corresponding source language word vector according to the input source language audio;

s102: training the second initial network by using second source language data to obtain a text prediction module, wherein the text prediction module is used for outputting a corresponding associated word vector according to an input source language word vector;

s103: and generating a voice processing model according to the voice coding module, the text prediction module and the text translation module, wherein the voice processing model is used for outputting a corresponding target language text according to the input source language audio.

In the following description of the present disclosure, the first to fifth preset conditions may be various convergence conditions preset according to actual situations, and the present disclosure does not specifically limit this.

Illustratively, in step S101, the non-source language data may include non-source language audio samples and corresponding non-source language annotation word vectors, and the first source language data may include source language audio samples and corresponding source language word vectors. The non-source language may be other languages except the source language and the target language, and preferably, the non-source language may be a language of a pre-established data set with a large tagged data amount. For example, the non-source language species may be English, and the non-source language data may be LibriSpeech (a public large-scale English language database); in the case where the source language is Chinese, AiShell (a published large-scale Chinese phonetic database) may be used as the source language.

In one example, step S101 may pre-train the first initial network based on a strategy of the transfer learning. It can be understood that the speech recognition performed by the speech coding module is directed to human voice, and the speech coder has certain general characteristics for audio recognition of different languages due to the same physiological structure of human body.

Specifically, the first initial network is trained by using the non-source language data with a large data volume. Outputting the non-source language audio sample to a first initial network, reversely updating parameters of the first initial network according to the difference between the non-source language predicted word vector and the non-source language annotated word vector output by the first initial network, and obtaining the to-be-selected voice coding module meeting the first preset condition through multiple iterations. The candidate speech coding module can output non-source language word vectors corresponding to the non-source language audios according to the input non-source language audios. Wherein the first initial network may be an initial speech coding module.

And then, adjusting an output layer of the voice coding module to be selected so that the voice coding module to be selected can output source language word vectors, and continuously training and tuning the voice coding module to be selected by utilizing source language data with smaller data volume. Specifically, the source language audio sample is input into the candidate speech coding module, the parameters of the candidate speech coding module are updated reversely according to the difference between the source language predicted word vector and the source language annotated word vector output by the candidate speech coding module, and the speech coding module meeting the second preset condition is obtained through multiple iterations. It is understood that the speech encoding module may be configured to output a source language word vector corresponding to a source language audio based on the input source language audio.

For example, in step S102, the second source language data may include a source language text sample and a corresponding source language annotation associated word vector. Specifically, a word vector coding table is used for preprocessing a source language text sample to obtain a word vector sequence sample corresponding to the source language text sample, then the word vector sequence sample is input into a second initial network, the difference between the associated word vector and the source language labeling associated word vector is predicted according to the source language output by the second initial network, and the parameters of the second initial network are adjusted until a text prediction module meeting a fourth preset condition is obtained.

For example, in step S103, the output of the speech coding module and the output of the text prediction module may be combined through a connection layer to obtain a speech recognition model. It can be understood that, the speech recognition model combines the word vector output by the speech coding module and the associated side vector output by the text prediction module according to the input source language audio and by using the connection layer, and finally outputs the source language text corresponding to the source language audio. And then, connecting the output of the voice recognition model and the output of the connecting layer to a pre-trained text translation module, and performing combined training by using the source language audio sample and the target language labeling text to finally obtain a voice processing model.

The text translation model is used for outputting a target language text corresponding to the source language text according to the input source language text.

The following describes a method for generating a speech processing model according to an embodiment of the present disclosure with an application example. The generated voice processing model is used for outputting a text with a target language of Spanish language according to the input audio with a source language of Chinese, and the non-source language can be English.

Specifically, first, non-source language data and first non-source language data are obtained. The non-source language data can obtain English audio samples and corresponding tagged word vectors according to LibriSpeech; the first non-source language data can obtain a Chinese audio sample and a corresponding annotation word vector according to AiShell. And training the first initial network by using the non-source language data according to the strategy of the transfer learning to obtain a to-be-selected voice coding module. The candidate voice coding module is used for outputting corresponding English word vectors according to input English audios. And then, adjusting an output layer of the voice coding module to be selected so that the output layer of the voice coding module to be selected can output Chinese word vectors, and training the voice coding module to be selected by utilizing the first source language data to obtain the voice coding module meeting the convergence condition. The voice coding module can output corresponding Chinese text according to the input Chinese audio.

Second, second source language data is obtained. And obtaining the Chinese text sample and the corresponding Chinese labeling associated word vector by the second source language data according to AiShell. And training the second initial network by using the second source language data until a text prediction module meeting the convergence condition is obtained.

And thirdly, combining the output of the voice coding module and the output of the text prediction module by utilizing the connection layer to obtain a voice recognition model.

And finally, connecting the output of the voice recognition model to the input of a pre-trained text translation module, and performing combined training on the voice recognition model and the text translation module by using a Chinese audio sample and corresponding Spanish and labeled texts to obtain the voice recognition model.

According to the method for generating the speech processing model of the embodiment of the disclosure, the requirement for large-scale data volume of the training data pair of the source language audio sample-target language labeling text is reduced by pre-training each module of the audio processing model and combining the obtained modules in a generation mode, therefore, the limit of the training data volume to the end-to-end speech translation model is broken through, the training of the end-to-end speech translation model under the training data sparse scene is realized, the existing stock training data can be effectively utilized, the training values of the non-source language data, the first source language data and the second source language data are fully mined, the time and the economic cost for obtaining the training data are reduced, and the training cost of the model is reduced and the training efficiency of the model is improved.

As shown in fig. 2, in one embodiment, step S103 includes:

s201: connecting the output of the voice coding module and the output of the text prediction module to the input of a connection layer, and connecting the output of the connection layer to the input of a text translation module to obtain a third initial network, wherein the input of the third initial network comprises the input of the voice coding module, and the output of the third initial network comprises the output of the text translation module;

s202: and training the third initial network to obtain a voice processing model.

Illustratively, in step S201, the connection layer may employ a connection transformation layer Joint, and the connection layer is configured to output a source language word vector sequence, i.e. a source language text, according to the source language word vector output by the speech coding module and the source language associated word vector output by the text prediction module.

The voice coding module, the text prediction module and the connection layer jointly form a voice recognition model. The input of the speech recognition model comprises the input of the speech coding module, and the output of the speech recognition model comprises the output of the connection layer, namely, the speech recognition model can output the corresponding source language text according to the input source language audio. A third initial network is then obtained by connecting the output of the speech recognition model to the input of a pre-trained text translation module.

It can be understood that although the speech coding module, the text prediction module and the text translation module are pre-trained before being combined, in order to ensure the adaptability between the input and the output of each module, the third initial network needs to be further trained to fine-tune the parameters of each module in the third initial network, so as to obtain a speech processing model meeting the convergence condition.

According to the embodiment, the output of the pre-trained speech coding module and the output of the text prediction module can be combined by arranging the connection layer, and the source language text is output through the connection layer, so that the speech recognition model is obtained. And training a third initial network obtained by combining the modules, so that the parameters of the modules in the third initial network can be further adjusted, and the finally obtained output of the voice processing model meets the convergence condition.

As shown in fig. 3, in an embodiment, before connecting the output of the connection layer to the input of the text translation module in step S201, the method further includes:

s301: connecting an output of the connection layer to an input of a text prediction module;

s302: and performing joint training on the speech coding module, the text prediction module and the connecting layer by using the source language audio samples and the corresponding source language label texts until the output of the connecting layer meets a third preset condition.

It can be understood that, after the output of the connection layer is connected to the input of the text prediction module, the text prediction module outputs the next associated word vector predicted for the word vector sequence according to the source language word vector sequence output by the connection layer, and after the connection layer performs joint mapping according to the source language word vector output by the speech coding module and the source language associated word vector output by the text prediction module, the source language word vector sequence, i.e. the source language text, is output.

Illustratively, in step S302, first, an audio sample of a source language is input into a speech coding module, and the speech coding module outputs a corresponding predicted word vector of the source language; the connection layer performs combined mapping according to the source language predicted word vector output by the voice coding module and the zero vector or random word vector output by the text prediction module, and outputs a first source language word vector sequence; the text prediction module outputs corresponding source language related word vectors to the connection layer according to the first source language word vector sequence output by the connection layer, and the connection layer performs combined mapping again according to the source language related word vectors output by the text prediction module and the source language word vectors output by the voice coding module and outputs a second source language word vector sequence.

And adjusting the weight parameter of the connection layer and the parameters of the speech coding module and the text prediction module according to the difference between the word vector sequence of the second source language and the word vector sequence corresponding to the source language label text until the difference between the two meets the threshold condition. And under the condition that the difference between the two is not in accordance with the threshold condition, the output of the connection layer is input into the text prediction module again for iteration again so as to continuously adjust the weight parameter of the connection layer and the parameter of the module. It is worth noting that in the course of the previous iterations, only the weight parameters of the connection layer are adjusted, and the parameters of the speech coding module and the text prediction module are not adjusted.

Further, the difference between the word vector sequence of the second source language and the word vector sequence corresponding to the source language annotation text may be determined by using CTC (Connectionist Temporal Classification, an algorithm for corresponding to an input sequence and an output sequence) or RNN-T (current Neural Network vector, an algorithm improved based on the CTC algorithm), and the weight parameter of the connection layer is updated in a reverse direction by using a stochastic gradient descent method according to the calculated loss value.

Through the implementation mode, the parameters of the speech coding module, the text prediction module and the connection layer can be jointly adjusted, so that the speech recognition precision of the speech recognition model is improved.

As shown in fig. 4, in one embodiment, step S202 includes:

s401: and training the third initial network by using the source language audio sample and the corresponding target language text to obtain a voice processing model.

Illustratively, the source language audio sample may be a source language audio sample included in the first source language data, and the target language text may be obtained by machine labeling or manual labeling of the source language audio sample.

More specifically, a source language audio sample is input into a speech coding module of a third initial network, a target language predicted text is output through a text translation module of the third initial network, the difference between the target language predicted text and the target language text is calculated by using a loss function, and parameters of each module in the third initial network are adjusted in a random gradient descending mode according to the difference until the difference meets a threshold value condition, so that a speech processing model is obtained.

Through the above embodiment, on the basis of combining the modules to obtain the third initial network, the parameters of the modules in the third initial network are further adjusted, so that the output of the obtained speech processing model meets the convergence condition.

As shown in fig. 5, in one embodiment, the non-source language data includes non-source language audio samples and corresponding non-source language annotation word vectors, and the first source language data includes source language audio samples and corresponding source language annotation word vectors, step S101 includes:

s501: inputting non-source language audio samples into a first initial network, and adjusting parameters of the first initial network according to non-source language predicted word vectors and non-source language labeled word vectors output by the first initial network until a to-be-selected voice coding module meeting first preset conditions is obtained;

s502: and adjusting the parameters of the voice coding module to be selected until the voice coding module meeting the second preset condition is obtained.

Illustratively, the first initial network may include a plurality of BLSTM layers (Bi-directional Long Short Term memory networks), for example, may be composed of 8 BLSTM layers.

In step S501, a non-source language audio sample is input into the first initial network, a non-source language predicted word vector output by the first initial network is obtained, and a parameter of the first initial network is adjusted according to a difference between the non-source language predicted word vector and the non-source language annotation word vector.

And the difference between the non-source language predicted word vector and the non-source language annotated word vector can be obtained by using CTC or RNN-T calculation. And adjusting and updating the parameters of the first initial network by using a gradient random descent method according to the loss value obtained by calculation, and obtaining the to-be-selected voice coding module which accords with the first preset condition through multiple iterations.

In step S502, the output layer of the speech coding module to be selected may be adjusted, so that the speech coding module to be selected may output the source language word vector.

It is understood that the candidate speech coding module may include an input layer, a plurality of hidden layers, and an output layer. Aiming at 26 English letters, numbers or symbols and the like of a to-be-selected speech coding module of which the non-source language is English, an output layer comprises a plurality of output nodes in one-to-one correspondence with the output objects, wherein a transformation matrix, such as a 128-26 transformation matrix, is arranged between a hidden layer and the output layer which are connected with the output layer. Aiming at the voice coding module with the source language being Chinese, a plurality of output nodes of an output layer of the voice coding module to be selected are replaced by a plurality of output nodes corresponding to the Chinese, and a transformation matrix between a hidden layer and the output layer which are connected with the output layer is replaced, so that the adjusted voice coding module to be selected can have the capability of outputting Chinese word vectors.

And further, inputting the source language audio sample into the adjusted to-be-selected voice coding module, obtaining a source language predicted word vector output by the to-be-selected voice coding module, and adjusting parameters of the first initial network according to the difference between the source language predicted word vector and the source language tagging word vector.

And the difference between the source language predicted word vector and the source language annotation word vector can be obtained by using CTC or RNN-T calculation. And adjusting and updating the parameters of the voice coding module to be selected by utilizing a gradient random descent method according to the loss value obtained by calculation, and obtaining the voice coding module meeting a second preset condition through multiple iterations.

According to the embodiment, based on the strategy of transfer learning, the first initial network is trained by utilizing the non-source language data to obtain the voice coding module to be selected, then the voice coding module to be selected is continuously trained by utilizing the source language data to obtain the voice coding module, the first initial network can be preliminarily trained by utilizing the larger non-source language data of the existing data gauge module to obtain the voice coding module to be selected with certain voice coding capability, so that the value of the non-source language data is fully mined, the demand on the source language data is reduced, and the training cost of the voice coding module is reduced.

As shown in fig. 6, in one embodiment, inputting the audio samples of the non-source language into the first initial network in step S501 includes:

s601: performing framing processing on the non-source language audio samples to obtain frame audio data;

s602: carrying out feature extraction processing on the frame audio data to obtain audio features;

s603: and normalizing the audio features, and inputting the audio features subjected to normalization into the first initial network.

In step S601, the non-source language audio samples may be framed specifically in a manner of a duration of 25ms per frame and a frame shift of 10 ms.

In step S602, a mel-frequency cepstrum coefficient may be extracted from each frame of audio data, so as to obtain an audio feature corresponding to each frame of audio data. In other examples of the present disclosure, a frequency domain feature (Filter Bank, FBank) may be extracted from each frame of audio data to obtain an audio feature corresponding to the frame of audio data, or a Perceptual Linear prediction feature (PLP) may be extracted from the frame of audio data.

In step S603, normalization processing may be performed on the audio feature in a manner of subtracting the mean value and dividing the standard deviation.

It can be understood that after the framing processing is performed on the non-source language audio sample to obtain a plurality of frames of audio data, and the feature extraction and normalization processing is sequentially performed on each frame of audio data, an audio feature sequence corresponding to the non-source language audio data is obtained. The first initial network outputs a word vector corresponding to each audio feature according to the input audio features.

It should be noted that, in step S502, the source language audio sample is input into the first initial network, and an audio feature sequence corresponding to the source language audio sample may be obtained by a processing method similar to or the same as that in steps S601 to S602 for the source language audio sample, and input into the first initial network.

Through the embodiment, the non-source language audio samples are preprocessed before being input into the first initial network, so that the input objects of the first initial network can be serialized, the corresponding output objects can be serialized, and the output result of the finally trained speech coding module is more accurate.

As shown in fig. 7, in an embodiment, the second source language data includes a source language word vector sequence and a corresponding source language tagged associated word vector, and step S102 includes:

s701: preprocessing a source language word vector sequence to obtain a word vector sequence sample;

s702: and inputting the word vector sequence sample into a second initial network, predicting a relevant word vector and a source language labeling relevant word vector according to a source language output by the second initial network, and adjusting parameters of the second initial network until a text prediction module meeting a fourth preset condition is obtained.

Wherein the second initial network may comprise a plurality of BLSTM layers, e.g. the second initial network may be made up of two BLSTM layers.

For example, a certain amount (for example, more than 10 ten thousand) of source language text corpora may be collected in advance to construct a source language text corpus, and the source language text may be subjected to symbol washing and normalization processing, including removing special symbols (for example, "@", "-", "%", and the like), regular number unit symbols (for example, "2010", "kg", and the like), and word segmentation processing, so as to obtain clean text corpora in a unified format, and then word vectors corresponding to words in the source language may be obtained through training of a word vector training tool. The Word Vector training tool may adopt Word2Vec (Word to Vector, a correlation model for generating Word vectors), and the like.

In step S701, a source language word vector sequence may be obtained from a source language text corpus, and random analog sampling processing is performed on the source language word vector sequence to generate a word vector sequence sample. The word vector sequence sample may contain repeated word vectors or zero vectors and other semantically unrelated vectors.

In step S702, parameters of the second initial network may be adjusted according to a difference between the source language prediction related word vector and the source language labeling related word vector output by the second initial network.

And the difference between the source language prediction related word vector and the source language labeling related word vector output by the second initial network can be obtained by utilizing a cross entropy loss function.

Through the embodiment, the training of the second initial network can be realized, and the word vector sequence sample obtained by preprocessing the source language word vector sequence is closer to the actual voice audio, for example, the word vector sequence sample contains repeated word vectors or zero vectors and the like, so that the text prediction module can more accurately predict the associated word vectors in the voice recognition scene.

As shown in fig. 8, in one embodiment, step S701 includes:

s801: and randomly inserting repeated word vectors and/or blank marks into the source language word vector sequence.

It should be noted that, for the training process of the speech coding module, the CTC or RNN-T is usually used to calculate the loss of the output result, and the algorithm maps the audio without semantic meaning, such as noise or silence, onto the blank mark. By performing the above operation on the source language word vector sequence to obtain the word vector sequence sample, the output of the text prediction module can be better matched with the output of the speech coding module, so that the output of the speech recognition model is better matched with the input of the text translation module.

Therefore, through the embodiment, the adaptability of the output of the text prediction module and other modules can be improved.

As shown in fig. 9, in one embodiment, the method for generating the text translation module includes:

s901: obtaining a source language text sample and a corresponding target language labeling text;

s902: preprocessing a source language text sample to obtain a source language word vector sequence sample;

s903: inputting a source language word vector sequence sample into an initial text translation module, predicting a text and a target language labeling text according to a target language output by the initial text translation module, and adjusting parameters of the initial text translation module until a text translation module meeting a fifth preset condition is obtained.

Illustratively, the initial text translation module may include a text encoder, which may consist of 5 BLSTM layers, and a text decoder, which may consist of 2 LSTM (Long Short-Term Memory) layers and one attention layer.

The source language text sample can be obtained from a pre-established source language text corpus, and the target language text corresponding to the source language text sample can be obtained in a machine labeling or manual labeling mode.

After the source language word vector sequence sample is input into the initial text translation module and the target language predicted text output by the initial text translation module is obtained, the difference between the source language word vector sequence sample and the target language predicted text can be calculated by using a cross entropy loss function, the parameter of the initial text translation module is updated reversely according to the difference, and the text translation module meeting the fifth preset condition is obtained after multiple iterations until the difference between the source language word vector sequence sample and the target language predicted text meets the convergence condition.

Through the embodiment, in the training process aiming at the initial text translation module, the initial text translation module can be trained by fully utilizing the second source language data, so that the utilization rate of the source language data is improved.

As shown in fig. 10, in one embodiment, step S902 includes:

s1001: obtaining a word vector sequence corresponding to a source language text sample;

s1002: repeated word vectors and/or blank marks are randomly inserted into the word vector sequence.

For example, in step S1001, the source language text sample may be processed by using a pre-established word vector coding table to obtain a word vector sequence corresponding to the source language text sample.

In step S1002, the word vector sequence may be subjected to the simulation adopting process in the same or similar manner as in step S801, and details thereof are not described here.

According to the embodiment, the word vector sequence corresponding to the source language text sample is randomly processed, so that the input of the trained text translation module is matched with the output of the speech recognition model, and the combined effect of the speech recognition model and the text translation module is improved.

A method for generating a speech processing model according to an embodiment of the present disclosure is described below with reference to fig. 11 as a specific application example.

As shown in fig. 11, the method for generating the speech processing model specifically includes the following steps:

step 1: generating a source language word vector, collecting a certain amount (for example, more than 10 ten thousand) of source language text corpora in advance to construct a source language text corpus, performing symbol cleaning and normalization processing on the source language text, wherein the symbol cleaning and normalization processing includes removing special symbols (for example, "@", "-", "%", and the like), regulating digital unit symbols (for example, "2010", "kg", and the like), and performing word segmentation processing to obtain a clean text corpus in a unified format, and then training through a word vector training tool to obtain a word vector corresponding to each source language word. The Word Vector training tool may adopt Word2Vec (Word to Vector, a correlation model for generating Word vectors), and the like.

Step 2: pre-trained speech predictor (enc)_s) After the clean source language text corpus is expressed by word vectors, inserting repetition and blank by using an analog sampling method, training a predictor (pred) of a speech recognition module, inputting the current word vector to predict the next word vector, calculating a loss value between a prediction result and a label by using a cross entropy loss function, and adjusting parameters of the predictor according to the loss value until the speech predictor meeting the convergence condition is obtained.

And step 3: pre-training a speech encoder, collecting a certain amount of audio data (such as an English starting data set LibriSpeech) of a non-source language, framing the audio and extracting characteristics, performing normalization processing by subtracting a mean value and dividing a standard deviation, then inputting the obtained audio characteristics into the speech encoder, and reversely updating parameters of the speech encoder according to a stochastic gradient descent algorithm according to a prediction result and a loss value marked under a CTC (or RNN-T) criterion until the speech encoder meeting a convergence condition is obtained.

And 4, step 4: and (3) finely adjusting the speech encoder, collecting a certain amount of source language audio and corresponding text labels, and training on the basis of the English speech encoder obtained in the last step to obtain the source language speech encoder.

And 5: and (3) jointly training the speech coder and the predictor, connecting the speech predictor obtained in the step (2) and the speech coder obtained in the step (4) through a connection layer (join), and performing joint training by using the training data in the step (4). Wherein only the weight parameters of the connection layer are adjusted in the first few training sessions.

Step 6: a certain amount (more than 10 ten thousand) of source text-target text translation data pairs are collected, and a machine translation model is trained, wherein repeated and blank marks are inserted into input objects through analog sampling.

And 7: and (4) removing the input of the machine translation model obtained by the last training, connecting the output of the voice recognition model obtained in the step (5) to the machine translation model, and training the voice translation model by using the source language audio-source text-target text data set.

And 8: testing reasoning, namely marking a corresponding vector by using a text predictor in a training process; the input in the test reasoning process is a history vector obtained by calculation.

According to another aspect of the embodiment of the present disclosure, a speech processing method is also provided.

As shown in fig. 12, the speech processing method includes:

s1201: inputting the source language audio to be processed into a voice processing model to obtain a target language text corresponding to the audio to be processed; the voice processing model is obtained by adopting the method for generating the voice processing model according to the embodiment of the disclosure.

According to the speech processing method disclosed by the embodiment of the disclosure, the speech processing model obtained by the speech processing model generation method disclosed by the embodiment of the disclosure can be used for translating the source language audio to be processed into the target language text, and the translation result is relatively accurate.

According to another aspect of the embodiment of the present disclosure, an apparatus for generating a speech processing model is also provided.

As shown in fig. 13, the speech processing model generation device includes:

the speech coding module generation module 1301 is configured to train the first initial network by using the non-source language data and the first source language data to obtain a speech coding module, where the speech coding module is configured to output a corresponding source language word vector according to an input source language audio;

the text prediction module generation module 1302 is configured to train the second initial network by using the second source language data to obtain a text prediction module, and the text prediction module is configured to output a corresponding associated word vector according to the input source language word vector;

and the speech processing model generating module 1303 is configured to generate a speech processing model according to the speech encoding module, the text prediction module and the text translation module, where the speech processing model is configured to output a corresponding target language text according to the input source language audio.

In one embodiment, the speech processing model generation module 1303 includes:

the third initial network generation submodule is used for connecting the output of the voice coding module and the output of the text prediction module to the input of the connection layer and connecting the output of the connection layer to the input of the text translation module to obtain a third initial network, wherein the input of the third initial network comprises the input of the voice coding module, and the output of the third initial network comprises the output of the text translation module;

and the third initial network training submodule is used for training the third initial network to obtain a voice processing model.

In one embodiment, the third initial network generation submodule, prior to connecting the output of the connection layer to the input of the text translation module, is further configured to:

connecting an output of the connection layer to an input of a text prediction module;

and performing joint training on the speech coding module, the text prediction module and the connecting layer by using the source language audio samples and the corresponding source language label texts until the output of the connecting layer meets a third preset condition.

In one embodiment, the third initial network training sub-module is further configured to:

and training the third initial network by using the source language audio sample and the corresponding target language text to obtain a voice processing model.

In one embodiment, the non-source language data includes non-source language audio samples and corresponding non-source language annotated word vectors, the first source language data includes source language audio samples and corresponding source language annotated word vectors, and the speech coding module generation module 1301 includes:

the to-be-selected voice coding module generating unit is used for inputting the non-source language audio sample into a first initial network, and adjusting parameters of the first initial network according to the non-source language predicted word vector and the non-source language tagging word vector output by the first initial network until a to-be-selected voice coding module meeting a first preset condition is obtained;

and the voice coding module generating unit is used for adjusting the parameters of the output layer of the voice coding module to be selected, inputting the source language audio sample into the voice coding module to be selected, and adjusting the parameters of the voice coding module to be selected according to the source language predicted word vector and the source language annotated word vector output by the voice coding module to be selected until the voice coding module meeting a second preset condition is obtained.

In one embodiment, the candidate speech coding module generating unit is further configured to:

performing framing processing on the non-source language audio samples to obtain frame audio data;

carrying out feature extraction processing on the frame audio data to obtain audio features;

and normalizing the audio features, and inputting the audio features subjected to normalization into the first initial network.

In one embodiment, the second source language data includes a sequence of source language word vectors and corresponding source language tagged associated word vectors, and the text prediction module generating module 1302 includes:

the word vector sequence sample generation submodule is used for preprocessing a source language word vector sequence to obtain a word vector sequence sample;

and the text prediction module generation submodule is used for inputting the word vector sequence sample into a second initial network, predicting the associated word vector and the source language marking associated word vector according to the source language output by the second initial network, and adjusting the parameters of the second initial network until a text prediction module meeting a fourth preset condition is obtained.

In one embodiment, the word vector sequence sample generation sub-module is further configured to:

and randomly inserting repeated word vectors and/or blank marks into the source language word vector sequence.

In one embodiment, the generation module for generating a text translation module comprises:

the obtaining submodule is used for obtaining a source language text sample and a corresponding target language labeling text;

the preprocessing submodule is used for preprocessing a source language text sample to obtain a source language word vector sequence sample;

and the text translation module generation submodule is used for inputting the source language word vector sequence sample into the initial text translation module, predicting the text according to the target language output by the initial text translation module and marking the text according to the target language, and adjusting the parameters of the initial text translation module until the text translation module meeting the fifth preset condition is obtained.

In one embodiment, the pre-processing sub-module is further configured to:

obtaining a word vector sequence corresponding to a source language text sample;

repeated word vectors and/or blank marks are randomly inserted into the word vector sequence.

According to another aspect of the embodiment of the present disclosure, a speech processing apparatus is also provided.

As shown in fig. 14, the speech processing apparatus includes:

an input module 1401, configured to input audio of a source language to be processed into a speech processing model;

a receiving module 1402, configured to receive, from the speech processing model, a target language text corresponding to the audio to be processed; the voice processing model is obtained by the generation device of the voice processing model according to the above embodiment of the present disclosure.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 15 shows a schematic block diagram of an example electronic device 1500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 15, the apparatus 1500 includes a computing unit 1501 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1502 or a computer program loaded from a storage unit 1508 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data necessary for the operation of the device 1500 can also be stored. The calculation unit 1501, the ROM 1502, and the RAM 1503 are connected to each other by a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.

Various components in device 1500 connect to I/O interface 1505, including: an input unit 1506 such as a keyboard, a mouse, and the like; an output unit 1507 such as various types of displays, speakers, and the like; a storage unit 1508, such as a magnetic disk, optical disk, or the like; and a communication unit 1509 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1509 allows the device 1500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1501 may be various general and/or special purpose processing components having processing and computing capabilities. Some examples of the computation unit 1501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computation chips, various computation units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The calculation unit 1501 executes the respective methods and processes described above, such as the generation method of the speech processing model and/or the speech processing method. For example, in some embodiments, the method of generating a speech processing model and/or the method of speech processing may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1500 via the ROM 1502 and/or the communication unit 1509. When the computer program is loaded into the RAM 1503 and executed by the computing unit 1501, one or more steps of the generation method of the speech processing model and/or the speech processing method described above may be performed. Alternatively, in other embodiments, the calculation unit 1501 may be configured to perform the generation method of the speech processing model and/or the speech processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of generating a speech processing model, comprising:

2. The method of claim 1, wherein generating a speech processing model from the speech coding module, the text prediction module, and the text translation module comprises:

connecting the output of the speech coding module and the output of the text prediction module to the input of a connection layer, and connecting the output of the connection layer to the input of the text translation module to obtain a third initial network, wherein the input of the third initial network comprises the input of the speech coding module, and the output of the third initial network comprises the output of the text translation module;

and training the third initial network to obtain the voice processing model.

3. The method of claim 2, wherein prior to connecting the output of the connection layer to the input of the text translation module, further comprising:

connecting an output of the connection layer to an input of the text prediction module;

4. The method of claim 2, wherein training the third initial network to obtain the speech processing model comprises:

and training the third initial network by using the source language audio sample and the corresponding target language text to obtain the voice processing model.

5. The method of claim 1, wherein the non-source language data comprises non-source language audio samples and corresponding non-source language annotated word vectors, and the first source language data comprises source language audio samples and corresponding source language annotated word vectors;

training the first initial network by using the non-source language data and the first source language data to obtain a speech coding module, comprising:

inputting the non-source language audio sample into the first initial network, and adjusting parameters of the first initial network according to the non-source language predicted word vector and the non-source language tagged word vector output by the first initial network until a to-be-selected voice coding module meeting a first preset condition is obtained;

and adjusting parameters of the voice coding module to be selected until the voice coding module meeting a second preset condition is obtained.

6. The method of claim 5, wherein inputting the non-source language audio sample into the first initial network comprises:

performing framing processing on the non-source language audio sample to obtain frame audio data;

7. The method of claim 1, wherein the second source language data comprises a sequence of source language word vectors and corresponding source language tagged associated word vectors;

training the second initial network by using the second source language data to obtain a text prediction module, comprising:

preprocessing the source language word vector sequence to obtain a word vector sequence sample;

and inputting the word vector sequence sample into the second initial network, and adjusting parameters of the second initial network according to the source language prediction associated word vector and the source language labeling associated word vector output by the second initial network until a text prediction module meeting a fourth preset condition is obtained.

8. The method of claim 7, wherein preprocessing the sequence of source language word vectors comprises:

9. The method of claim 1, wherein the method of generating the text translation module comprises:

obtaining a source language text sample and a corresponding target language labeling text;

preprocessing the source language text sample to obtain a source language word vector sequence sample;

and inputting the source language word vector sequence sample into an initial text translation module, and adjusting parameters of the initial text translation module according to a target language predicted text and the target language tagged text output by the initial text translation module until a text translation module meeting a fifth preset condition is obtained.

10. The method of claim 9, wherein preprocessing the source language text sample comprises:

obtaining a word vector sequence corresponding to the source language text sample;

and randomly inserting repeated word vectors and/or blank marks into the word vector sequence.

11. A method of speech processing comprising:

inputting source language audio to be processed into a voice processing model to obtain a target language text corresponding to the audio to be processed; wherein the speech processing model is obtained by the method for generating a speech processing model according to any one of claims 1 to 10.

12. An apparatus for generating a speech processing model, comprising:

the voice coding module generation module is used for training the first initial network by utilizing non-source language data and first source language data to obtain a voice coding module, and the voice coding module is used for outputting corresponding source language word vectors according to input source language audio;

13. The apparatus of claim 12, wherein the speech processing model generation module comprises:

a third initial network generation sub-module, configured to connect an output of the speech coding module and an output of the text prediction module to an input of a connection layer, and connect an output of the connection layer to an input of the text translation module, so as to obtain a third initial network, where the input of the third initial network includes the input of the speech coding module, and the output of the third initial network includes the output of the text translation module;

and the third initial network training submodule is used for training the third initial network to obtain the voice processing model.

14. The apparatus of claim 13, wherein the third initial network generation submodule, prior to connecting the output of the connection layer to the input of the text translation module, is further to:

15. The apparatus of claim 13, wherein the third initial network training submodule is further configured to:

16. The apparatus of claim 12, wherein the non-source language data comprises non-source language audio samples and corresponding non-source language annotated word vectors, and the first source language data comprises source language audio samples and corresponding source language annotated word vectors;

the speech coding module generation module comprises:

a candidate speech coding module generating unit, configured to input the non-source language audio sample into the first initial network, and adjust parameters of the first initial network according to the non-source language predicted word vector and the non-source language tagged word vector output by the first initial network until a candidate speech coding module meeting a first preset condition is obtained;

and the voice coding module generating unit is used for adjusting the output layer parameters of the voice coding module to be selected, inputting the source language audio sample into the voice coding module to be selected, and adjusting the parameters of the voice coding module to be selected according to the source language predicted word vector and the source language tagged word vector output by the voice coding module to be selected until the voice coding module meeting a second preset condition is obtained.

17. The apparatus of claim 16, wherein the candidate speech coding module generating unit is further configured to:

18. The apparatus of claim 12, wherein said second source language data comprises a sequence of source language word vectors and corresponding source language tagged associated word vectors;

the text prediction module generation module comprises:

the word vector sequence sample generation submodule is used for preprocessing the source language word vector sequence to obtain a word vector sequence sample;

and the text prediction module generation submodule is used for inputting the word vector sequence sample into the second initial network, and adjusting parameters of the second initial network according to the source language prediction associated word vector and the source language labeling associated word vector output by the second initial network until a text prediction module meeting a fourth preset condition is obtained.

19. The apparatus of claim 18, wherein the word vector sequence sample generation sub-module is further configured to:

20. The apparatus of claim 12, wherein the means for generating the text translation module comprises:

the preprocessing submodule is used for preprocessing the source language text sample to obtain a source language word vector sequence sample;

and the text translation module generation submodule is used for inputting the source language word vector sequence sample into an initial text translation module, and adjusting the parameters of the initial text translation module according to a target language predicted text and the target language tagged text output by the initial text translation module until a text translation module meeting a fifth preset condition is obtained.

21. The apparatus of claim 20, wherein the pre-processing sub-module is further to:

22. A speech processing apparatus comprising:

the receiving module is used for receiving a target language text corresponding to the audio to be processed from the voice processing model; wherein the speech processing model is obtained by using the apparatus for generating a speech processing model according to any one of claims 12 to 21.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 11.