CN114373480A - Training method of voice alignment network, voice alignment method and electronic equipment - Google Patents

Training method of voice alignment network, voice alignment method and electronic equipment Download PDF

Info

Publication number
CN114373480A
CN114373480A CN202111550130.9A CN202111550130A CN114373480A CN 114373480 A CN114373480 A CN 114373480A CN 202111550130 A CN202111550130 A CN 202111550130A CN 114373480 A CN114373480 A CN 114373480A
Authority
CN
China
Prior art keywords
state
time frame
sequence
alignment
current time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111550130.9A
Other languages
Chinese (zh)
Inventor
张斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202111550130.9A priority Critical patent/CN114373480A/en
Publication of CN114373480A publication Critical patent/CN114373480A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/051Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or detection of onsets of musical sounds or notes, i.e. note attack timings

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a training method of a voice alignment network, which comprises the following steps: acquiring a label text sequence and an audio characteristic sequence; inputting the audio characteristic sequence into a coding network to obtain an audio characteristic coding sequence; obtaining a value of an alignment loss function; inputting the label text sequence and the audio feature coding sequence into a decoder network based on an attention mechanism to obtain an audio feature decoding sequence; acquiring a value of an attention loss function based on the audio feature decoding sequence and the labeled text sequence; and circularly training the coding network and the decoding network based on the value of the alignment loss function and the value of the attention loss function until the conditions are met, and outputting the coding network. The trained network model can be deployed at a mobile phone end, so that voice alignment is realized, and timeliness is good. Methods of speech alignment, and corresponding electronic devices and computer-readable storage media are also provided.

Description

Training method of voice alignment network, voice alignment method and electronic equipment
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a method for training a speech alignment network and speech alignment. In addition, the present application also relates to a related electronic device and a computer-readable storage medium.
Background
The digital music field can utilize the lyric time stamp technology to mark the starting time and the ending time of each word and the starting time and the ending time of each line in the lyric, and finally obtain the lyric text file of QRC (lyric file type) time stamp. In the prior art, an acoustic alignment model based on music genre perception is provided by modifying a pronunciation dictionary and according to different genre information of songs mainly according to different singing pronunciations and speaking pronunciations (mainly embodied in the speed of speech and determining the length of each phoneme), so that lyric contents are directly aligned to music audio with accompaniment to obtain the start-stop time of each character. The existing acoustic alignment model is difficult to deploy at a mobile phone client due to the reasons of numerous modules, complex training reasoning, more required computing resources and the like, and is mostly deployed at the rear end of a server, so that the alignment accuracy of the deployed acoustic alignment model is high but the time is prolonged. The user needs to submit the audio and lyric files to upload to the background server, then the alignment file is generated by the alignment service deployed by the background server and is transmitted back to the user in an asynchronous mode, and the processing mode is high in precision, but poor in timeliness and limited in scene applicability.
This background description is for the purpose of facilitating understanding of relevant art in the field and is not to be construed as an admission of the prior art.
Disclosure of Invention
Therefore, the embodiment of the invention intends to provide a scheme capable of quickly realizing voice alignment at a mobile phone end, a lightweight end-to-end neural network is used, the scheme can be directly deployed at the mobile phone end, a user can obtain a timestamp text of the corresponding aligned lyric in real time only by selecting audio and the corresponding lyric, and the processing mode saves flow cost and time cost.
The embodiment of the invention provides a training method of a voice alignment network, which comprises the following steps:
acquiring a label text sequence and an audio characteristic sequence;
inputting the audio characteristic sequence into an encoder network to obtain an audio characteristic coding sequence;
acquiring a value of an alignment loss function based on the audio feature coding sequence and the label text sequence;
inputting the label text sequence and the audio feature coding sequence into a decoder network based on an attention mechanism to obtain an audio feature decoding sequence;
acquiring a value of an attention loss function based on the audio feature decoding sequence and the labeled text sequence;
if the value of the alignment loss function is larger than a first loss threshold value or the value of the attention loss function is larger than a second loss threshold value, iteratively updating the encoder network and the decoder network based on the value of the alignment loss function and the value of the attention loss function, and returning to the step of acquiring the annotation text sequence and the audio feature sequence until the value of the alignment loss function is smaller than or equal to the first loss threshold value and the value of the attention loss function is smaller than or equal to the second loss threshold value;
and taking the encoder network updated for the last time as the voice alignment network.
In some embodiments of the invention, the obtaining the value of the alignment loss function includes:
summing the probability of each alignment distribution of the audio feature coding sequence and the annotation text sequence to obtain a probability sum value;
and taking a negative logarithm of the probability sum value, and taking the result as the value of the alignment loss function.
In some embodiments of the invention, the encoder network is a lightweight end-to-end network.
In some embodiments of the invention, the sequence of tagged text comprises a sequence of tagged text in words or a sequence of tagged text in phonemes.
The embodiment of the invention also provides a voice alignment method, which comprises the following steps:
acquiring a target text sequence and a target audio frequency, wherein the target text sequence is a lyric text sequence of the target audio frequency;
inputting the audio characteristic sequence of the target audio into the voice alignment network generated by any one of the training methods to obtain a target audio characteristic coding sequence;
aligning the target text sequence with the target audio based on the target audio feature encoding sequence.
In some embodiments of the invention, the target audio feature coding sequence is a posterior probability distribution of each word or each phoneme in the target text sequence at each time instant.
In some embodiments of the present invention, the aligning the target text sequence with the target audio based on the target audio feature encoding sequence comprises:
and generating an optimal time distribution of each word or each phoneme in the target text sequence to the time frame sequence of the target audio based on the posterior probability distribution so as to align the target text sequence with the target audio.
In some embodiments of the present invention, the generating an optimal time distribution of each word or each phoneme in the target text sequence into the temporal frame sequence of the target audio based on the posterior probability distribution includes:
randomly distributing each word or each phoneme in the target text sequence into the time frame sequence of the target audio to generate a full distribution sequence;
calculating the posterior probability sum of each word or each phoneme distributed to the corresponding moment in each distribution sequence in the full-quantity distribution sequence;
and taking the alignment distribution sequence with the maximum posterior probability sum as the optimal time distribution.
In some embodiments of the present invention, the generating an optimal time distribution of each word or each phoneme in the target text sequence into the temporal frame sequence of the target audio based on the posterior probability distribution includes:
initializing a time frame sequence of an alignment distribution according to the time frame sequence of the target audio;
inserting a placeholder before a first character or before an initial phoneme in the target text sequence, inserting a placeholder after a tail character or after a tail phoneme in the target text sequence and inserting a placeholder between adjacent characters or phonemes to form a state sequence;
setting the state of a first time frame in the time frame sequence which is distributed in an aligned mode as a placeholder or a first character or a first phoneme in the state sequence;
setting the state of the last time frame in the aligned time frame sequence as a placeholder or a tail character or a tail phoneme in the state sequence
Obtaining an alignment distribution sequence based on the state sequence, the state of the first time frame and the state of the last time frame in the alignment distribution time frame sequence;
calculating the posterior probability sum of each alignment distribution based on the posterior probability of each state in the alignment distribution sequence;
and taking the alignment distribution with the maximum posterior probability sum as the optimal time distribution.
In some embodiments of the present invention, obtaining the alignment distribution sequence based on the state sequence, the state of the first time frame and the state of the last time frame in the alignment distribution time frame sequence comprises:
based on the state sequence, recursively setting the state of the last time frame in the alignment distribution according to the state of the current time frame in the distribution until the state of the first time frame in the distribution is set as a placeholder or a first character or a head element in the state sequence, and acquiring the alignment distribution sequence.
In some embodiments of the present invention, the recursively setting, based on the state sequence, a state of a last time frame in the alignment distribution according to a current time frame state of the alignment distribution until a first time frame state set to the alignment distribution is a placeholder or a first character or a head phone in the state sequence, and acquiring the alignment distribution sequence includes:
if the state of the current time frame in aligned distribution is a placeholder, setting the state corresponding to the last time frame of the current time frame to be the same as the state of the current time frame or setting the state corresponding to the last time frame to be the previous state of the current time frame based on the state sequence;
if the state of the current time frame is a non-placeholder and the state of the current time frame is the same as the first two states of the state of the current time frame, setting the state corresponding to the last time frame to be the same as the state of the current time frame or setting the state corresponding to the last time frame to be the previous state of the current time frame based on the state sequence;
if the state of the current time frame is a non-placeholder and the state of the current time frame is different from the first two states of the state of the current time frame, setting the state corresponding to the last time frame of the current time frame to be the same as the state of the current time frame, or setting the state corresponding to the last time frame of the current time frame to be the previous state of the current time frame, or setting the state corresponding to the last time frame of the current time frame to be the first two states of the state of the current time frame;
and recursively executing the steps until the state of the first time frame is set as a placeholder or a first character or a head phoneme in the state sequence, and acquiring an alignment distribution sequence.
In some embodiments of the present invention, the generating an optimal time distribution of each word or each phoneme in the target text sequence into the temporal frame sequence of the target audio based on the target audio feature coding sequence includes:
initializing a time frame sequence of an alignment distribution according to the time frame sequence of the target audio;
inserting a placeholder before a first character or before an initial phoneme in the target text sequence, inserting a placeholder after a tail character or after a tail phoneme in the target text sequence and inserting a placeholder between adjacent characters or phonemes to form a state sequence;
setting the state of a first time frame in the time frame sequence which is distributed in an aligned mode as a placeholder or a first character or a first phoneme in the state sequence;
setting the state of the last time frame in the aligned time frame sequence as a placeholder or a tail character or a tail phoneme in the state sequence
Based on the state sequence and the current time frame of the alignment distribution, recursively setting the state of the current time frame in the alignment distribution according to the state of the last time frame of the alignment distribution until the state set to the last time frame is a placeholder or a tail character or a head element in the state sequence, and acquiring the alignment distribution sequence;
calculating the posterior probability sum of each alignment distribution based on the posterior probability of each state in each alignment distribution in the alignment distribution;
and taking the alignment distribution with the maximum posterior probability sum as the optimal time distribution.
In some embodiments of the present invention, the recursively setting the state of the current time frame in the alignment distribution according to the state of the last time frame until the state set to the last time frame is a placeholder or a last character or a head element in the state sequence based on the state sequence to obtain the alignment distribution sequence includes:
arbitrarily setting a state of a current time frame based on the state sequence;
when the current time frame state is judged to be a placeholder, verifying that the state corresponding to the last time frame is the same as the state of the current time frame, or when the state corresponding to the last time frame is the previous state of the current time frame, determining that the state of the current time frame is the state in the target range;
if the state of the current time frame is a non-placeholder and the state of the current time frame is the same as the first two states of the state of the current time frame, verifying that the state corresponding to the last time frame is the same as the state of the current time frame or the state corresponding to the last time frame is the previous state of the current time frame, and determining that the state of the current time frame is the state in the target range;
if the current time frame state is a non-placeholder and the current time frame state is different from the first two states of the current time frame state, verifying that the state corresponding to the last time frame is the same as the current time frame state, or the state corresponding to the last time frame is the previous state of the current time frame state, or the state corresponding to the last time frame is the first two states of the current time frame state, and determining that the current time frame state is the state in the target range;
acquiring the state of the highest posterior probability in the states in the target range as a target state based on the state of the last time frame, and setting the state of the current time frame according to the target state;
and recursively executing the steps until the state set to the last time frame is a placeholder or a tail character or a tail phoneme in the state sequence, and acquiring an alignment distribution sequence.
In an embodiment of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements a training method or a voice alignment method of a voice alignment network of any of the embodiments of the present invention.
In an embodiment of the present invention, there is provided an electronic apparatus including: a processor and a memory storing a computer program, the processor being configured to perform the method for training a speech alignment network or the method for speech alignment of any of the embodiments of the present invention when the computer program is run.
The trained end-to-end voice alignment network used in the embodiment of the invention is particularly suitable for deployment at a mobile phone end. In addition, in the speech Alignment method in the embodiment of the present invention, the audio feature sequence and the corresponding text lyric sequence are input, the target audio feature coding sequence may be obtained by using the speech Alignment network, and a CTC (connection temporal classification) Alignment algorithm may be selectively used to solve an optimal path, so as to obtain which corresponding modeling unit (word or phoneme) each frame belongs to, so that a specific start frame of each word may be obtained, and finally a start time of the word is obtained, and a time stamp of the obtained word is generated. Saving traffic costs and time costs.
Additional optional features and technical effects of embodiments of the invention are set forth, in part, in the description which follows and, in part, will be apparent from the description.
Drawings
Embodiments of the invention will hereinafter be described in detail with reference to the accompanying drawings, wherein the elements shown are not to scale as shown in the figures, and wherein like or similar reference numerals denote like or similar elements, and wherein:
FIG. 1 shows a schematic diagram of an alignment example;
FIG. 2a illustrates an example flow diagram of a method of training a voice aligned network in accordance with an embodiment of the present invention;
FIG. 2b is a data flow diagram illustrating an example of a training method for a voice aligned network according to an embodiment of the present invention;
FIG. 3a illustrates an example flow diagram of a speech alignment method according to an embodiment of the invention;
FIG. 3b is a diagram illustrating an example of data flow for a voice alignment method according to an embodiment of the present invention;
FIG. 4a is an exemplary flowchart illustrating an alignment step in a speech alignment method according to an embodiment of the present invention;
FIG. 4b shows another exemplary flow chart of an alignment step in a speech alignment method according to an embodiment of the present invention;
FIG. 4c shows another exemplary flow chart of an alignment step in a speech alignment method according to an embodiment of the present invention;
FIG. 5a illustrates an exemplary flow chart of a recursive method of the alignment step in the speech alignment method according to an embodiment of the present invention;
FIG. 5b is a diagram showing an example of a recursive process of the alignment step in the speech alignment method according to the embodiment of the present invention;
FIG. 5c is a diagram showing an example of a recursive process of the alignment step in the speech alignment method according to the embodiment of the present invention;
FIG. 6a illustrates an exemplary flow chart of another recursive method of an alignment step in a speech alignment method according to an embodiment of the present invention;
FIG. 6b is a diagram showing another example of a recursive process of an alignment step in the speech alignment method according to the embodiment of the present invention;
FIG. 7a shows an example graph of word-based and phoneme-based modeling of an output;
FIG. 7b shows an example graph of word-based and phoneme-based modeling of the output;
FIG. 7c shows an example graph of word-based and phoneme-based modeling of the output;
FIG. 8 illustrates an exemplary block diagram of a training apparatus for a voice aligned network according to an embodiment of the present invention;
FIG. 9 illustrates an exemplary block diagram of a voice alignment apparatus according to an embodiment of the present invention;
fig. 10 shows an exemplary structural schematic diagram of an electronic device capable of implementing a method according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following detailed description and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
In the embodiment of the invention, a speech recognition model can be determined, a speech sequence is input, and a corresponding recognition text is output. In the embodiment of the present invention, "aligning" means inputting a speech sequence and a correct labeled text (which may be a text sequence, a phoneme sequence, etc.) given a speech recognition model, and giving the correspondence between the speech frame and the text, for example, the 0 th frame to the 10 th frame correspond to the phoneme of'm', as shown in fig. 1.
And the acoustic alignment model comprises an acoustic model, a pronunciation dictionary and an applicable alignment algorithm for obtaining the alignment sequence. In some known schemes, an acoustic alignment model Based on an HMM-Based Hybrid is adopted, which has numerous modules, complex training and reasoning processes, requires more computing resources and is difficult to deploy at a mobile phone client, and is mostly deployed at the rear end of a server.
The embodiment of the invention provides an end-to-end lyric-to-audio alignment method, which is mainly applied to clients such as mobile phones and the like, and the method uses a pure neural network training method to solve an optimal path by adopting a CTC Viterbi algorithm to obtain alignment of a phoneme level (training a phone level acoustic model) or a word level (training a word level acoustic model), so that the neural network model for audio alignment is obtained by training, and is conveniently and quickly deployed to the mobile phone end to align the lyrics to the audio. The neural network model adopts a Transformer model based on an Attention-Encoder-Decoder (AED) mechanism, Joint CTC/AED Joint training learning is used, and finally, a CTC Decoder part is used for completing an alignment function: for a trained model, given a piece of audio and corresponding annotated lyrics text, the alignment of speech and text is obtained by scoring information in the CTC Decoder section. Specifically, an end-to-end speech recognition network is trained by using a neural network, then only an Encoder module is used for outputting probability scores of each frame of modeling units, the CTC Viterbi Alignment algorithm is used for obtaining the starting and ending time of the modeling units, and further the time stamps of the modeling units are obtained, so that the Alignment function is realized, wherein the modeling units are words or phonemes.
The voice alignment network in the embodiment of the invention can be suitable for a Transformer network, the network training uses a hybrid ctc/attention joint training mechanism, the modeling units used by the network output can be various modeling units, and the model uses a word-level end-to-end method, so that the word-level alignment is obtained. If a phone-level alignment function is required, a phone-level model needs to be trained.
Specifically, as shown in fig. 2a and 2b, an embodiment of the present invention provides a method for training a voice alignment network, including the following steps:
and S110, acquiring a label text sequence and an audio characteristic sequence.
The method comprises the steps that a label text is converted through a dictionary to generate a label text sequence, audio data are collected and calculated through time domain or frequency domain dimensions, an audio feature is generated corresponding to each time frame, the audio features are arranged according to the sequence of the time frames, and therefore the audio feature sequence is generated.
And S120, inputting the audio characteristic sequence into an encoder network to obtain an audio characteristic coding sequence.
In the embodiment of the invention, an Encoder-Decoder network architecture is adopted, and an adopted Encoder network comprises a Tansformer network. In some embodiments of the present invention, the encoder network is a lightweight end-to-end network, which facilitates deployment of subsequent networks in a mobile phone end, realizes localized identification alignment, and reduces network transmission time.
In the embodiment of the present invention, the audio feature encoding sequence may include an encoding sequence of audio feature characterization such as fbank (FilterBank filter bank) or mfcc (mel cepstral coefficient). In a more specific embodiment, the audio feature encoding sequence may be a posterior probability (distribution) of audio features (frames) such as fbank (FilterBank filter bank) or mfcc (mel-frequency cepstral coefficient) obtained after audio processing corresponding to each modeling unit, such as words or phonemes (as before, depending on the level of the modeling unit to be modeled).
And S130, acquiring the value of the alignment loss function based on the audio feature coding sequence and the annotation text sequence.
And generating the value of the alignment loss function according to the audio feature coding sequence and the annotation text sequence, wherein the alignment loss function in the embodiment of the invention can adopt CTC loss. CTC loss is defined as l (S) -ln Σ p (z | x) where p (z | x) represents the probability of outputting a sequence z given an input x, and S is the training set. The loss function can be interpreted as: the sum of the probabilities of outputting the correct sequence (i.e., the annotated text sequence) after a sample is given, and taking the negative logarithm is the CTC loss function. In some embodiments of the invention, obtaining the value of the alignment loss function comprises: summing the probability of each alignment distribution of the audio feature coding sequence and the labeling text sequence to obtain a probability sum value;
and taking the negative logarithm of the probability sum value, and taking the result as the value of the alignment loss function.
And S140, inputting the label text sequence and the audio feature coding sequence into a first decoding network based on an attention mechanism to obtain an audio feature decoding sequence.
The invention adopts an Attention mechanism mode to train the encoder network and the decoder network, and the training convergence is faster. The Attention mechanism is a solution to the problem proposed by imitating human Attention, namely high-value information is quickly screened out from a large amount of information. The problem that the final reasonable vector representation is difficult to obtain when the input sequence of the first decoding network is long is solved, the output result of the first coding network is reserved, and the model containing the Attention mechanism is used for learning the output result and screening information.
And S150, decoding the sequence and labeling the text sequence based on the audio features, and acquiring the value of the attention loss function.
And calculating the value of the attention loss function according to the audio feature decoding sequence and the annotation text sequence, wherein the attention loss function can adopt a softmax loss function or a softmax loss variant loss function. In the embodiment of the invention, the Decoder network output is the result of speech recognition, and the attention loss function is introduced to assist the training of the encoder network to have the characteristic of speech recognition, thereby ensuring the alignment effect.
And S160, if the value of the alignment loss function is greater than the first loss threshold and/or the value of the attention loss function is greater than the second loss threshold, iteratively updating the encoder network and the decoder network based on the value of the alignment loss function and the value of the attention loss function, and returning to execute the step S110 until the value of the alignment loss function is less than or equal to the first loss threshold and the value of the attention loss function is less than or equal to the second loss threshold.
Wherein, when updating the encoder network and the decoder network, the network parameters are updated by calculating the gradient of the loss.
And S170, taking the encoder network updated for the last time as a voice alignment network.
The training method in the embodiment of the invention adopts the alignment loss function and the attention loss function to train, so that the speech alignment network is trained to have the characteristics of alignment and speech recognition.
In addition, in the embodiment of the present invention, although the alignment loss function and the attention loss function are used for training, the training method in the present invention does not intentionally use only the encoder network as the voice alignment network, and therefore, the trained voice alignment network in the training method of the present invention adopts a lightweight end-to-end network, and can be deployed to the front end of the mobile phone, thereby implementing localized alignment, reducing the time of network transmission, reducing the time delay, and having richer applicable scenarios.
In some embodiments, the trained decoder network may also be used to output speech recognition results, e.g., for other usage scenarios.
In some embodiments of the invention, the sequence of tagged text comprises a sequence of tagged text in words or a sequence of tagged text in phonemes. Correspondingly, the trained speech alignment network may align word-level sequences or may align phoneme-level sequences.
Based on the speech recognition network obtained by the training method of the present invention, an embodiment of the present invention further provides a speech Alignment method, which can utilize the speech Alignment network obtained by the training of the embodiment of the present invention on one hand, and can optionally perform CTC Alignment (CTC Alignment method) processing according to the features output by the speech recognition network on the other hand, and select an Alignment result by scoring.
Specifically, as shown in fig. 3a and 3b, an embodiment of the present invention provides a method for aligning speech, including the following steps:
s210, a target text sequence and a target audio frequency are obtained, wherein the target text sequence is a lyric text sequence of the target audio frequency.
And generating a target text sequence by performing dictionary conversion on the target text.
S220, inputting the audio characteristic sequence of the target audio into the voice alignment network generated by any training method to obtain the target audio characteristic coding sequence.
The corresponding audio feature sequence is generated based on the target audio, which may be generated, for example, by an FFT (fast fourier transform method).
And S230, aligning the target text sequence with the target audio based on the target audio characteristic coding sequence.
The alignment procedure is to correspond individual characters or phonemes in the target text sequence to individual time frames of the target audio.
The target audio feature encoding sequence X ═ X1, X2, x3., xt., xT ] (usually characterized by audio features such as fbank or mfcc, T being the number of time frames of the target audio), and the target text sequence Y ═ Y1, Y2, y3., yu., yU ], where the alignment task is to map words or phonemes in the target text sequence to time frames of the target audio, where the length of X is generally greater than the length of Y. If the corresponding relation between yu and xt can be known, the tasks can be changed into classification tasks at the speech frame level, namely, xt is classified at each moment to obtain yu.
In some preferred embodiments of the present invention, the copying and insertion of placeholder blank operations are performed on the cells of the target text sequence to account for tasks where the input sequence length is greater than the output sequence length, such as speech recognition tasks.
Specifically, in the CTC model of an embodiment of the present invention, the following algorithm is used to generate an extended sequence of all possible target texts. Where N is the output sequence length and T is the input sequence length.
The process of expansion is as follows:
generating b0 blank (blank) placeholders;
and (3) circularly executing:
n=1to N;
generating an nth token (token) tn times;
generating a blank placeholder bn times;
where tn and bn need to satisfy the constraint: b0+ T1+ b1+ T2+ b2+. tn + bn ═ T.
Different modeling units in embodiments of the invention have different dictionaries, and in embodiments of the invention, a modeling unit may be used in place of a token.
In the embodiment of the present invention, the speech alignment relationship can be obtained by finding the CTC alignment distribution with the largest probability. Specifically, a target audio feature encoding sequence is generated using the trained model, and all possible CTC alignment distribution probabilities thus available are scored, selecting the highest scoring CTC alignment distribution. It is noted by the present inventors that to avoid the time complexity at the level of exponential when exhaustive, it is possible in further embodiments to solve this problem by reducing the time complexity by means of the viterbi algorithm, as described further below.
In some embodiments, there is a different correspondence between each word or each phoneme in the target text sequence and each time frame to form a different word/phoneme-time frame ordering, and thus the number of orderings may be large, but not every ordering is optimal, so that in this embodiment the scores for the individual orderings may be calculated exhaustively, one by one, with the highest score being the optimal ordering. For example, the time frame sequence includes t1, t2, t3, t4, t5, the text sequence is c-a-t, one way of sorting is c corresponds to t1, t2, a corresponds to t3, t4, t corresponds to t5, another way of sorting is c corresponds to t1, a corresponds to t2, t3, t4, t corresponds to t5, and so on, and in some embodiments, the sorting way is c corresponds to t1, a corresponds to t2, t3, t corresponds to t4, t5, and the last sorting way has the highest score, so that the sorting way is the optimal sorting way.
As before, in some embodiments of the invention, the CTC Viterbi method may be used to quickly select the rank order, reducing the alignment time. The CTC viterbi method can be implemented using the known viterbi optimal path algorithm. For the CTC speech recognition task, the path selection per step corresponds to a state selectable per time frame, e.g., representing a word or placeholder blank selectable per time frame if word modeling, or representing a phoneme or placeholder blank selectable per time frame if phoneme modeling.
In some embodiments of the present invention, as before, the target audio feature encoding sequence may be a posterior probability distribution of each word or each phoneme in the target text sequence at each time instant. The calculated score may be correlated to this posterior probability. Thus, in some embodiments, the scores of words or phonemes corresponding to respective time frames may be expressed using a posterior probability.
In some embodiments of the present invention, aligning the target text sequence with the target audio based on the target audio feature encoding sequence comprises: and generating the optimal time distribution of each word or each phoneme in the target text sequence in the time frame sequence of the target audio based on the posterior probability distribution, wherein the optimal time distribution refers to a sequence formed by distributing each word or each phoneme in the target text sequence in each time frame, and the distribution at the moment is the optimal time distribution when the total score or the total posterior probability is the highest. More specifically, as shown in fig. 4a, generating an optimal time distribution of each word or each phoneme in the target text sequence to the time-frame sequence of the target audio includes:
s231a, randomly distributing each word or each phoneme in the target text sequence into the time frame sequence of the target audio to generate a full distribution sequence;
s232a, calculating the posterior probability sum of each word or each phoneme distributed to the corresponding moment in each distribution sequence in the full-scale distribution sequence based on the target audio feature coding sequence;
s233a, the distribution sequence with the maximum posterior probability sum is used as the optimal time distribution.
In some examples, the temporal frame sequence includes t1, t2, t3, t4, t5, the text sequence is c-a-t, one way of distribution is c for t1, t2 (a posteriori probability of 0.3), a for t3, t4 (a posteriori probability of 0.1), t for t5 (a posteriori probability of 0.1), such distribution is such that a summation of a posteriori probabilities is 0.5, another way of distribution is c for t1 (a posteriori probability of 0.8), a for t2, t3, t4 (a posteriori probability of 0.1), t for t5 (a posteriori probability of 0.1), such distribution is such that a summation of a posteriori probabilities is 1, and in some embodiments, c for t1(0.8), a for t 585, t3 (a posteriori probability of 0.8), t for t4, t4 (t 598), t 598 for c for t 598, a summation of a maximum posterior probability of t 599, t 598, t 599 for a summation of t 639, t corresponds to t4, and the distribution mode of t5 is the optimal time distribution mode.
As mentioned above, in other embodiments of the present invention, the alignment selection may be performed based on the CTC viterbi method, and specifically, as shown in fig. 4b, the optimal time distribution of each word or each phoneme in the target text sequence to the time-frame sequence of the target audio is generated based on the target audio feature coding sequence, which includes:
s231b, initializing a time frame sequence of the alignment distribution according to the time frame sequence of the target audio.
S232b, inserting placeholders in front of the first character or the initial phoneme in the target text sequence, inserting placeholders behind the tail character or the tail phoneme in the target text sequence, and inserting placeholders between adjacent characters or phonemes to form a state sequence.
S233b, setting the state of the first time frame in the alignment distribution as a placeholder or a first character or a headpiece in the sequence of states.
S234b, setting the state of the last time frame in the alignment distribution as a placeholder or a tail character or a tail phoneme in the state sequence.
And S235b, based on the state sequence, recursively setting the state of the last time frame in the alignment distribution according to the current time frame state of the alignment distribution until the state of the first time frame is set as a placeholder or a first character or a first phoneme in the state sequence, and acquiring the alignment distribution sequence.
The current time frame is any time frame preceding the last time frame in the alignment distribution. And setting the state from the last to the previous recursion until the current time frame is the first time frame, and outputting the state corresponding to each time frame.
In some embodiments of the present invention, as shown in fig. 5a, based on the state sequence, recursively setting a state of a last time frame in the alignment distribution according to a current time frame state of the alignment distribution until the state of the first time frame is set to be a placeholder or a first character or a first phoneme in the state sequence, and acquiring the alignment distribution sequence includes:
s2351b, if the state S of the current time frame is a placeholder, setting the state corresponding to the last time frame to be the same as the state S of the current time frame or setting the state corresponding to the last time frame to be the previous state S-1 of the state of the current time frame based on the state sequence;
s2352b, if the state S of the current time frame is a non-placeholder and the state S of the current time frame is the same as the first two states S-2 of the state of the current time frame, setting the state corresponding to the last time frame to be the same as the state S of the current time frame or setting the state corresponding to the last time frame to be the previous state S-1 of the state of the current time frame based on the state sequence;
s2353b, if the state S of the current time frame is a non-placeholder and the state S of the current time frame is not the same as the first two states S-2 of the state of the current time frame, setting the state corresponding to the previous time frame to be the same as the state S of the current time frame, or setting the state corresponding to the previous time frame to be the previous state S-1 of the state of the current time frame, or setting the state corresponding to the previous time frame to be the first two states S-2 of the state of the current time frame based on the state sequence;
s2354b, recursively executing the above steps until the state of the first time frame is set to be a placeholder or a first character or a first phoneme in the state sequence, and acquiring the alignment distribution sequence.
As shown in fig. 5b, assuming that the tagged text sequence y is cat, after a placeholder e is inserted into the tagged text sequence, a state sequence e c e a e t e is formed, the state of the first time frame x1 is either e or c, and the state corresponding to the last time frame x6 is either t or e. According to the state sequence, when the state S of the current time frame (x4) is a (i.e. not a placeholder), the first two states S-2 are c, and the states are not the same, the state of the previous time frame (x3) may be the same as the state S (i.e. a) of the current time frame, or the previous state S-1 (i.e. e) of the state of the current time frame, or the last two states S-2 (i.e. c) of the current time frame, and in some embodiments, the state of the previous frame may be selected by a posterior probability, for example, the state with the highest corresponding posterior probability may be selected from a, c, e. In some embodiments, the selection may also be made uniformly in subsequent calculations.
In the subsequent recursion, as shown in fig. 5c, the state S of the current time frame (x3) is e-placeholder, and the state of the last time frame (x2) is e or c, and the recursion is set until the state of the first time frame and the initial setting are met, so that the recursion is ended. All possible alignment distributions are output. The recursive process in this embodiment is to gradually set the state of the last time frame, but at each setting, the step of setting the state needs to be repeatedly performed according to the state of the current time frame until the last time frame is the first time frame and the state is the same as the initial setting, and the state from the first time frame to the current time frame is reversely output.
S236b, calculating the posterior probability summation of each alignment distribution based on the posterior probability of each state in the alignment distribution;
s237b, the alignment distribution having the maximum posterior probability sum is used as the optimal time distribution.
In some preferred embodiments, step S235b and its substeps may be implemented in combination with S236b and S237 b. For example, as mentioned above, during the recursion, the state of the previous frame may be selected by the posterior probability, for example, a state with the maximum posterior probability may be selected from a, c, and e by the posterior probability, and thus when the recursion is completed, the alignment distribution with the maximum posterior probability sum is obtained as the optimal time distribution.
In the embodiment of the present invention, the alignment distribution may be calculated in the above-mentioned reverse recursive manner, or may be calculated in a forward recursive manner, that is, the current time frame is selected according to the state of the previous time frame. As shown in fig. 4c, generating an optimal time distribution of each word or each phoneme in the target text sequence into the time-frame sequence of the target audio based on the target audio feature coding sequence includes:
s231c, initializing a time frame sequence of the alignment distribution according to the time frame sequence of the target audio.
S232c, inserting placeholders in front of the first character or the initial phoneme in the target text sequence, inserting placeholders behind the tail character or the tail phoneme in the target text sequence, and inserting placeholders between adjacent characters or phonemes to form a state sequence.
S233c, setting the state of the first time frame in the alignment distribution as a placeholder or a first character or a headpiece in the sequence of states.
S234c, setting the state of the last time frame in the alignment distribution as a placeholder or a tail character or a tail phoneme in the state sequence.
S235c, based on the state sequence and the current time frame of the alignment distribution, recursively setting the state of the current time frame in the alignment distribution according to the state of the last time frame of the alignment distribution until the state set to the last time frame is a placeholder or a last character or a head element in the state sequence, and obtaining the alignment distribution sequence.
In some embodiments of the present invention, as shown in fig. 6a, based on the state sequence and the current time frame of the alignment distribution, recursively setting the state of the current time frame in the alignment distribution according to the state of the previous time frame until the state set to the last time frame is a placeholder or a last character or a head element in the state sequence, and acquiring the alignment distribution sequence, includes:
s2351c, setting the state S of the current time frame of the alignment distribution arbitrarily based on the state sequence.
S2352c, if the state S of the current time frame is a placeholder, verifying that the state corresponding to the previous time frame is the same as the state S of the current time frame, or when the state corresponding to the previous time frame is the previous state S-1 of the state of the current time frame, determining that the state S of the current time frame is the state within the target range.
S2353c, if the state S of the current time frame is not a placeholder and the state S of the current time frame is the same as the first two states S-2 of the state of the current time frame, verifying that the state corresponding to the last time frame is the same as the state S of the current time frame or the state corresponding to the last time frame is the previous state S-1 of the state of the current time frame, determining that the state S of the current time frame is the state in the target range.
S2354c, if the state S of the current time frame is a non-placeholder and the state S of the current time frame is different from the first two states S-2 of the state of the current time frame, verifying that the state corresponding to the last time frame is the same as the state S of the current time frame, or the state corresponding to the last time frame is the previous state S-1 of the state of the current time frame, or the state corresponding to the last time frame is the first two states S-2 of the state of the current time frame, determining that the state S of the current time frame is the state in the target range.
S2355c, based on the state of the last time frame, the state of the highest posterior probability in the state in the target range is acquired as the target state, and the state of the current time frame is set according to the target state.
S2356c, recursively executing the above steps until the state set to the last time frame is a placeholder or a tail character or a tail phoneme in the state sequence, and obtaining the alignment distribution sequence.
As shown in FIG. 6b, the state of the last time frame (x1) is c or E, assuming that the state S of the current time frame x2 is set to the second E in the state sequence C E A T E, the state of the last time frame x1 is verified to be E, the verification is passed, or the state of the last time frame x1 is verified to be c, which is the previous state S-1 of the state of the current time frame, the verification is passed, so the state S of the current time frame can be E.
Assuming that the state S of the current time frame x2 is set as c in the state sequence e c e a e t e, the first two states S-2 of the current time frame state are not present, the current time frame state and the first two states of the current time frame state are not the same, but the last time frame x1 state may be c, the last time frame state and the current time frame state are the same, the verification is passed, so the state S of the current time frame x2 may be c.
Assuming that the state S of the current time frame x2 is c e a e t, the first two states S-2 of the state of the current time frame are c, the current time frame state and the first two states of the state of the current time frame are not the same, the state of the last time frame x1 may be c, and the state of the last time frame is the first two states of the state of the current time frame, the verification is passed.
Assuming that the current time frame state S is e c e a e t, the first two states S-2 of the state of the current time frame are a, the state of the last time frame x1 should be the state S of the current time frame, the previous state S-1 of the state of the current time frame, the first two states S-2 of the state of the current time frame, respectively t, e, a, but the last time frame state is c or e, the verification fails, and the state of the current frame is not t. And selecting the state with the maximum posterior probability from the verified states (c, e, a) as the state of the current time frame, and performing recursive operation until the last state is consistent with the initial setting.
S236c, calculating the posterior probability summation of each alignment distribution based on the posterior probability of each state in the alignment distribution.
Since the last state, possibly a word (phoneme) or a placeholder, the score of the alignment distribution ending with the placeholder or the score of the alignment distribution ending with the last word (phoneme), respectively, will be calculated last, distributing the alignment with the largest score as the optimal distribution.
S237c, the alignment distribution having the maximum posterior probability sum is used as the optimal time distribution.
Similarly, in some preferred embodiments, step S235c and its sub-steps may be implemented in conjunction with S236c and S237 c.
In the embodiment of the present invention, after the optimal distribution is determined, backtracking may be performed, that is, the state of the time frame is determined according to the optimal distribution and the time frame sequence, that is, the time frame and the state are bound, so as to output the time stamp of the word or the phoneme in the labeled text.
Fig. 7a, 7b, 7c list examples of outputs based on word and phoneme modeling. As shown in fig. 7a, the text sequence is "let i see you not drop away", and the respective texts or phonemes are aligned with the time-frame sequence, and in the optimal distribution of the output, the "let" corresponding time-frame is more than the "i" time-frame.
The invention provides an end-to-end lyric to audio voice alignment method, which is mainly used for optimizing a mobile phone end, is directly deployed at the mobile phone end without background server deployment, and can be used for aligning lyrics to audio by a user selecting audio and corresponding lyrics. By using the scheme, the time stamp of the lyrics can be automatically and locally generated, so that the labor and time cost are saved, and the user experience can be improved. The alignment method of the embodiment of the invention adopts a CTC Viterbi alignment method, and the alignment efficiency is effectively improved.
In an embodiment of the present invention, as shown in fig. 8, a training apparatus 500 for a voice aligned network is shown, including:
an obtaining module 510 configured to obtain a tagged text sequence and an audio feature sequence;
an encoding module 520 configured to input the audio feature sequence into an encoder network, and obtain an audio feature encoding sequence;
an alignment loss function calculation module 530 configured to obtain a value of an alignment loss function based on the audio feature coding sequence and the annotation text sequence;
a decoding module 540 configured to input the annotation text sequence and the audio feature coding sequence into a decoder network based on an attention mechanism, and obtain an audio feature decoding sequence;
an attention loss function calculating module 550 configured to obtain a value of an attention loss function based on the audio feature decoding sequence and the annotation text sequence;
an updating module 560 configured to iteratively update the encoder network and the decoder network based on the value of the alignment loss function and the value of the attention loss function if the value of the alignment loss function is greater than a first loss threshold and/or the value of the attention loss function is greater than a second loss threshold;
the cyclic training module 570 returns to execute the step of obtaining the annotation text sequence and the audio feature sequence until the value of the alignment loss function is less than or equal to a first loss threshold and the value of the attention loss function is less than or equal to a second loss threshold;
an output module 580 configured to treat the last updated encoder network as the speech alignment network.
In some embodiments, the alignment loss function calculation module 530 is specifically configured to sum the probabilities of each alignment distribution of the audio feature encoding sequence and the annotation text sequence to obtain a probability sum value; and taking a negative logarithm of the probability sum value, and taking the result as the value of the alignment loss function.
In some embodiments, the first encoding network is a lightweight end-to-end network.
In some embodiments, the sequence of tagged text comprises a sequence of tagged text in words or a sequence of tagged text in phonemes.
In some embodiments, the training apparatus may combine features of the training method of the voice alignment network of any embodiment, and vice versa, which is not described herein.
In an embodiment of the present invention, as shown in fig. 9, a speech alignment apparatus 600 is shown, including:
an obtaining module 610 configured to obtain a target text sequence and a target audio, where the target text sequence is a lyric text sequence of the target audio;
the encoding module 620 is configured to input the audio feature sequence of the target audio into the voice alignment network generated by any one of the above training methods, and obtain a target audio feature encoding sequence;
an alignment module 630 configured to align the target text sequence with the target audio based on the target audio feature encoding sequence.
In some embodiments, the target audio feature encoding sequence is a posterior probability distribution of each word or each phoneme in the target text sequence at each time instant.
In some embodiments, the alignment module 630 is specifically configured to generate an optimal time distribution of each word or each phoneme in the target text sequence based on the target audio feature encoding sequence.
In some embodiments, the alignment module 630 is further specifically configured to:
randomly distributing each word or each phoneme in the target text sequence into the time frame sequence of the target audio to generate a full distribution sequence; based on the target audio feature coding sequence, calculating the posterior probability sum of each word or each phoneme in the text sequence to be processed in each distribution sequence in all distribution alignment distribution sequences of the total amount to each corresponding moment;
and aligning the distribution sequence with the maximum posterior probability sum with the distribution sequence as the optimal time distribution.
In some embodiments, the alignment module 630 is further specifically configured to:
initializing a time frame sequence of an alignment distribution according to the time frame sequence of the target audio;
inserting a placeholder before a first character or before an initial phoneme in the target text sequence, inserting a placeholder after a tail character or a tail phoneme in the target text sequence, and inserting a placeholder between adjacent characters or phonemes to form a state sequence;
setting the state of a first time frame in the alignment distribution as a placeholder or a first character or a head element in the state sequence;
setting a state of a last time frame in an alignment distribution as a placeholder or a tail character or a tail phoneme in the sequence of states
Based on the state sequence, recursively setting the state of the last time frame in alignment distribution according to the current time frame state distributed to the state sequence until the state of the first time frame is set as a placeholder or a first character or a head element in the state sequence, and acquiring the alignment distribution sequence;
calculating the posterior probability sum of each alignment distribution based on the posterior probability of each state in each alignment distribution in the alignment distribution;
and taking the alignment distribution with the maximum posterior probability sum as the optimal time distribution.
In some embodiments, the alignment module 630 is further specifically configured to:
if the state S of the distributed current time frame is a placeholder, setting the state corresponding to the last time frame of the current time frame to be the same as the state S of the current time frame or setting the state corresponding to the last time frame to be the previous state S-1 of the state of the current time frame based on the state sequence;
if the state S of the current time frame is a non-placeholder and the state S of the current time frame is the same as the first two states S-2 of the state of the current time frame, setting the state corresponding to the last time frame to be the same as the state S of the current time frame or setting the state corresponding to the last time frame to be the previous state S-1 of the state of the current time frame based on the state sequence;
if the state S of the current time frame is a non-placeholder and the state S of the current time frame is different from the first two states S-2 of the state of the current time frame, setting the state corresponding to the last time frame to be the same as the state S of the current time frame, or setting the state corresponding to the last time frame to be the previous state S-1 of the state of the current time frame, or setting the state corresponding to the last time frame to be the first two states S-2 of the state of the current time frame based on the state sequence;
and recursively executing the steps until the state of the first time frame is set as a placeholder or a first character or a head phoneme in the state sequence, and acquiring an alignment distribution sequence.
In some embodiments, the alignment module 630 is further specifically configured to:
initializing a time frame sequence of an alignment distribution according to the time frame sequence of the target audio;
inserting a placeholder before a first character or before an initial phoneme in the target text sequence, inserting a placeholder after a tail character or a tail phoneme in the target text sequence, and inserting a placeholder between adjacent characters or phonemes to form a state sequence;
setting the state of a first time frame in the alignment distribution as a placeholder or a first character or a head element in the state sequence;
setting a state of a last time frame in an alignment distribution as a placeholder or a tail character or a tail phoneme in the sequence of states
Based on the state sequence, recursively setting the state of the current time frame in the alignment distribution according to the state of the last time frame until the state set to the last time frame is a placeholder or a tail character or a head element in the state sequence, and acquiring the alignment distribution sequence;
calculating the posterior probability sum of each alignment distribution based on the posterior probability of each state in each alignment distribution in the alignment distribution sequence;
and taking the alignment distribution with the maximum posterior probability sum as the optimal time distribution.
In some embodiments, the alignment module 630 is further specifically configured to:
arbitrarily setting a state S of a current time frame based on the state sequence;
if the state S of the current time frame is a placeholder, verifying that the state corresponding to the last time frame is the same as the state S of the current time frame, or determining that the state S of the current time frame is the state in the target range when the state of the last time frame is the previous state S-1 of the state of the current time frame;
if the state S of the current time frame is a non-placeholder and the state S of the current time frame is the same as the first two states S-2 of the state of the current time frame, verifying that the state corresponding to the last time frame is the same as the state S of the current time frame or the state corresponding to the last time frame is the previous state S-1 of the state of the current time frame, and determining that the state S of the current time frame is the state in the target range;
if the state S of the current time frame is a non-placeholder and the state S of the current time frame is different from the first two states S-2 of the state of the current time frame, verifying that the state corresponding to the last time frame is the same as the state S of the current time frame, or the state corresponding to the last time frame is the previous state S-1 of the state of the current time frame, or the state corresponding to the last time frame is the first two states S-2 of the state of the current time frame, and determining that the state S of the current time frame is the state in the target range;
acquiring the state of the highest posterior probability in the states in the target range as a target state based on the state of the last time frame, and setting the state of the current time frame according to the target state;
and recursively executing the steps until the state set to the last time frame is a placeholder or a tail character or a tail phoneme in the state sequence, and acquiring an alignment distribution sequence.
In some embodiments, the apparatus 600 may combine features of the method for speech alignment of any embodiment, and vice versa, which are not described herein.
In an embodiment of the present invention, there is provided an electronic apparatus including: a processor and a memory storing a computer program, the processor being configured to perform the method of training a speech alignment network or the method of speech alignment of any of the embodiments of the present invention when the computer program is run.
Fig. 10 shows a schematic diagram of an electronic device 1000 in which a method or implementing an embodiment of the invention may be implemented, which may include more or less electronic devices than those shown in some embodiments. In some embodiments, it may be implemented using a single or multiple electronic devices. In some embodiments, the implementation may be with cloud or distributed electronic devices.
As shown in fig. 10, the electronic apparatus 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate operations and processes according to programs and/or data stored in a Read Only Memory (ROM)1002 or programs and/or data loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. The CPU 1001 may be a single multi-core processor or may include a plurality of processors. In some embodiments, the CPU 1001 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a neural Network Processor (NPU), a Digital Signal Processor (DSP), or the like. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
The processor and the memory are used together to execute a program stored in the memory, and the program can realize the steps or functions of the training method of the high-resolution audio generation model, the high-resolution audio generation method, and the sound effect switching method described in the above embodiments when executed by a computer.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, a touch screen, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary. Only some of the components are schematically illustrated in fig. 10, and it is not meant that the computer system 1000 includes only the components illustrated in fig. 10.
In some embodiments, the mobile terminal indicated by the electronic device 1000 includes a mobile phone, a vehicle-mounted terminal, a smart television, and the like, and taking the mobile phone as an example, the electronic device 1000 further includes a display screen with a touch function, an external sound box, a gyroscope, a camera, a 4G/5G antenna, and other device modules.
The systems, devices, modules or units illustrated in the above embodiments can be implemented by a computer or its associated components. The computer may be, for example, a mobile terminal, a smart phone, a personal computer, a laptop computer, a vehicle-mounted human interaction device, a personal digital assistant, a media player, a navigation device, a game console, a tablet, a wearable device, a smart television, an internet of things system, a smart home, an industrial computer, a server, or a combination thereof.
Although not shown, in the embodiment of the present invention, a storage medium storing a computer program configured to be executed when the computer program is executed performs the file difference-based compiling method of any of the embodiments of the present invention.
Storage media in embodiments of the invention include permanent and non-permanent, removable and non-removable articles of manufacture in which information storage may be accomplished by any method or technology. Examples of storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The methods, programs, systems, apparatuses, etc., in embodiments of the present invention may be performed or implemented in a single or multiple networked computers, or may be practiced in distributed computing environments. In the described embodiments, tasks may be performed by remote processing devices that are linked through a communications network in such distributed computing environments.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Thus, it will be apparent to one skilled in the art that the implementation of the functional modules/units or controllers and the associated method steps set forth in the above embodiments may be implemented in software, hardware, and a combination of software and hardware.
Unless specifically stated otherwise, the actions or steps of a method, program or process described in accordance with an embodiment of the present invention need not be performed in a particular order and still achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
While various embodiments of the invention have been described herein, the description of the various embodiments is not intended to be exhaustive or to limit the invention to the precise forms disclosed, and features and components that are the same or similar to one another may be omitted for clarity and conciseness. As used herein, "one embodiment," "some embodiments," "examples," "specific examples," or "some examples" are intended to apply to at least one embodiment or example, but not to all embodiments, in accordance with the present invention. The above terms are not necessarily meant to refer to the same embodiment or example. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Exemplary systems and methods of the present invention have been particularly shown and described with reference to the foregoing embodiments, which are merely illustrative of the best modes for carrying out the systems and methods. It will be appreciated by those skilled in the art that various changes in the embodiments of the systems and methods described herein may be made in practicing the systems and/or methods without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (15)

1. A training method of a voice alignment network is characterized by comprising the following steps:
acquiring a label text sequence and an audio characteristic sequence;
inputting the audio characteristic sequence into an encoder network to obtain an audio characteristic coding sequence;
acquiring a value of an alignment loss function based on the audio feature coding sequence and the label text sequence;
inputting the label text sequence and the audio feature coding sequence into a decoder network based on an attention mechanism to obtain an audio feature decoding sequence;
acquiring a value of an attention loss function based on the audio feature decoding sequence and the labeled text sequence;
if the value of the alignment loss function is larger than a first loss threshold value or the value of the attention loss function is larger than a second loss threshold value, iteratively updating the encoder network and the decoder network based on the value of the alignment loss function and the value of the attention loss function, and returning to the step of acquiring the annotation text sequence and the audio feature sequence until the value of the alignment loss function is smaller than or equal to the first loss threshold value and the value of the attention loss function is smaller than or equal to the second loss threshold value;
and taking the encoder network updated for the last time as the voice alignment network.
2. The training method of claim 1, wherein the obtaining the value of the alignment loss function comprises:
summing the probability of each alignment distribution of the audio feature coding sequence and the annotation text sequence to obtain a probability sum value;
and taking a negative logarithm of the probability sum value, and taking the result as the value of the alignment loss function.
3. The training method of claim 1, wherein the encoder network is a lightweight end-to-end network.
4. The training method of claim 1, wherein the sequence of tagged text comprises a sequence of tagged text in words or a sequence of tagged text in phonemes.
5. A method of speech alignment, comprising:
acquiring a target text sequence and a target audio frequency, wherein the target text sequence is a lyric text sequence of the target audio frequency;
inputting the audio feature sequence of the target audio into the voice alignment network generated by the training method of any one of claims 1-4 to obtain a target audio feature coding sequence;
aligning the target text sequence with the target audio based on the target audio feature encoding sequence.
6. The method of claim 5, wherein the target audio feature coding sequence is a posterior probability distribution of each word or each phoneme in the target text sequence at each time instant.
7. The method of claim 6, wherein aligning the target text sequence with the target audio based on the target audio feature encoding sequence comprises:
and generating an optimal time distribution of each word or each phoneme in the target text sequence to the time frame sequence of the target audio based on the posterior probability distribution so as to align the target text sequence with the target audio.
8. The method of claim 7, wherein generating an optimal time distribution of each word or each phoneme in the target text sequence into the temporal frame sequence of the target audio based on the posterior probability distribution comprises:
randomly distributing each word or each phoneme in the target text sequence into the time frame sequence of the target audio to generate a full distribution sequence;
calculating the posterior probability sum of each word or each phoneme distributed to the corresponding moment in each distribution sequence in the full-quantity distribution sequence;
and taking the distribution sequence with the maximum posterior probability sum as the optimal time distribution.
9. The method of claim 7, wherein generating an optimal time distribution of each word or each phoneme in the target text sequence into the temporal frame sequence of the target audio based on the posterior probability distribution comprises:
initializing a time frame sequence of an alignment distribution according to the time frame sequence of the target audio;
inserting a placeholder before a first character or before an initial phoneme in the target text sequence, inserting a placeholder after a tail character or after a tail phoneme in the target text sequence and inserting a placeholder between adjacent characters or phonemes to form a state sequence;
setting the state of a first time frame in the time frame sequence which is distributed in an aligned mode as a placeholder or a first character or a first phoneme in the state sequence;
setting the state of the last time frame in the time frame sequence which is distributed in an aligned mode as a placeholder or a tail character or a tail phoneme in the state sequence;
obtaining an alignment distribution sequence based on the state sequence, the state of the first time frame and the state of the last time frame in the alignment distribution time frame sequence;
calculating the posterior probability sum of each alignment distribution based on the posterior probability of each state in each alignment distribution in the alignment distribution sequence;
and taking the alignment distribution with the maximum posterior probability sum as the optimal time distribution.
10. The method of claim 9, wherein obtaining the alignment distribution sequence based on the state sequence, the state of the first time frame and the state of the last time frame in the alignment distribution time frame sequence comprises:
based on the state sequence, recursively setting the state of the last time frame in the alignment distribution according to the state of the current time frame in the alignment distribution until the state of the first time frame set to the alignment distribution is a placeholder or a first character or a first phoneme in the state sequence, and acquiring the alignment distribution sequence.
11. The method according to claim 10, wherein the recursively setting the state of the last time frame in the alignment distribution according to the state of the current time frame in the alignment distribution based on the state sequence until the state of the first time frame set to the alignment distribution is a placeholder or a first character or a head element in the state sequence, and acquiring the alignment distribution sequence comprises:
if the state of the current time frame which is distributed in an aligned mode is a placeholder, setting the state of the last time frame of the current time frame as the state of the current time frame or setting the state of the last time frame as the previous state of the current time frame based on the state sequence;
if the state of the current time frame is a non-placeholder and the state of the current time frame is the same as the first two states of the state of the current time frame, setting the state of the previous time frame of the current time frame as the state of the current time frame or setting the state corresponding to the previous time frame as the previous state of the current time frame;
if the current time frame state is a non-placeholder and the current time frame state is different from the first two states of the current time frame state, setting the state of the last time frame of the current time frame as the state of the current time frame, or setting the state of the last time frame as the previous state of the current time frame, or setting the state of the last time frame as the first two states of the state of the current time frame;
and recursively executing the steps until the state of the first time frame is set as a placeholder or a first character or a head phoneme in the state sequence, and acquiring an alignment distribution sequence.
12. The method of claim 9, wherein obtaining the alignment distribution sequence based on the state sequence, the state of the first time frame and the state of the last time frame in the alignment distribution time frame sequence comprises:
based on the state sequence and the current time frame of the alignment distribution, recursively setting the state of the current time frame in the alignment distribution according to the state of the last time frame of the alignment distribution until the state set to the last time frame is a placeholder or a last character or a head element in the state sequence, and acquiring the alignment distribution sequence.
13. The method according to claim 12, wherein the recursively setting the state of the current time frame in the alignment distribution according to the state of the previous time frame in the alignment distribution based on the state sequence and the current time frame in the alignment distribution until the state set to the last time frame is a placeholder or a last character or a head element in the state sequence, acquiring the alignment distribution sequence, comprises:
arbitrarily setting a state of a current time frame based on the state sequence;
if the current time frame state is a placeholder, verifying that the state of the last time frame is the same as the state of the current time frame or the state of the last time frame is the previous state of the current time frame, and determining that the state of the current time frame is the state in the target range;
if the state of the current time frame is a non-placeholder and the state of the current time frame is the same as the first two states of the state of the current time frame, verifying that the state corresponding to the last time frame is the same as the state of the current time frame or the state corresponding to the last time frame is the previous state of the current time frame, and determining that the state of the current time frame is the state in the target range;
if the current time frame state is a non-placeholder and the current time frame state is different from the first two states of the current time frame state, verifying that the state corresponding to the last time frame is the same as the current time frame state, or the state corresponding to the last time frame is the previous state of the current time frame state, or the state corresponding to the last time frame is the first two states of the current time frame state, and determining that the current time frame state is the state in the target range;
acquiring the state of the highest posterior probability in the states in the target range as a target state based on the state of the last time frame, and setting the state of the current time frame according to the target state;
and recursively executing the steps until the state set to the last time frame is a placeholder or a tail character or a tail phoneme in the state sequence, and acquiring an alignment distribution sequence.
14. A computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-13.
15. An electronic device, comprising: a processor and a memory storing a computer program, the processor being configured to perform the method of any of claims 1-13 when the computer program is run.
CN202111550130.9A 2021-12-17 2021-12-17 Training method of voice alignment network, voice alignment method and electronic equipment Pending CN114373480A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111550130.9A CN114373480A (en) 2021-12-17 2021-12-17 Training method of voice alignment network, voice alignment method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111550130.9A CN114373480A (en) 2021-12-17 2021-12-17 Training method of voice alignment network, voice alignment method and electronic equipment

Publications (1)

Publication Number Publication Date
CN114373480A true CN114373480A (en) 2022-04-19

Family

ID=81140392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111550130.9A Pending CN114373480A (en) 2021-12-17 2021-12-17 Training method of voice alignment network, voice alignment method and electronic equipment

Country Status (1)

Country Link
CN (1) CN114373480A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114781377A (en) * 2022-06-20 2022-07-22 联通(广东)产业互联网有限公司 Error correction model, training and error correction method for non-aligned text
CN116364063A (en) * 2023-06-01 2023-06-30 蔚来汽车科技(安徽)有限公司 Phoneme alignment method, apparatus, driving apparatus, and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114781377A (en) * 2022-06-20 2022-07-22 联通(广东)产业互联网有限公司 Error correction model, training and error correction method for non-aligned text
CN116364063A (en) * 2023-06-01 2023-06-30 蔚来汽车科技(安徽)有限公司 Phoneme alignment method, apparatus, driving apparatus, and medium
CN116364063B (en) * 2023-06-01 2023-09-05 蔚来汽车科技(安徽)有限公司 Phoneme alignment method, apparatus, driving apparatus, and medium

Similar Documents

Publication Publication Date Title
CN104143327B (en) A kind of acoustic training model method and apparatus
CN111583900B (en) Song synthesis method and device, readable medium and electronic equipment
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
US11699074B2 (en) Training sequence generation neural networks using quality scores
CN111081280B (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN109741735B (en) Modeling method, acoustic model acquisition method and acoustic model acquisition device
CN111862942B (en) Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN114373480A (en) Training method of voice alignment network, voice alignment method and electronic equipment
CN112153460B (en) Video dubbing method and device, electronic equipment and storage medium
CN115862600B (en) Voice recognition method and device and vehicle
CN112242144A (en) Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium
CN111444379B (en) Audio feature vector generation method and audio fragment representation model training method
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN113012665A (en) Music generation method and training method of music generation model
US10922867B2 (en) System and method for rendering of an animated avatar
Coman et al. An incremental turn-taking model for task-oriented dialog systems
CN112035699A (en) Music synthesis method, device, equipment and computer readable medium
CN116821324A (en) Model training method and device, electronic equipment and storage medium
US20230351752A1 (en) Moment localization in media stream
CN110675865B (en) Method and apparatus for training hybrid language recognition models
CN117711427A (en) Audio processing method and device
CN113591472A (en) Lyric generation method, lyric generation model training method and device and electronic equipment
CN112580325A (en) Rapid text matching method and device
CN111581347A (en) Sentence similarity matching method and device
CN114564581A (en) Text classification display method, device, equipment and medium based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination