CN113345414A - Film restoration method, device, equipment and medium based on voice synthesis - Google Patents

Film restoration method, device, equipment and medium based on voice synthesis Download PDF

Info

Publication number
CN113345414A
CN113345414A CN202110605270.5A CN202110605270A CN113345414A CN 113345414 A CN113345414 A CN 113345414A CN 202110605270 A CN202110605270 A CN 202110605270A CN 113345414 A CN113345414 A CN 113345414A
Authority
CN
China
Prior art keywords
target
actor
audio
film
splicing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110605270.5A
Other languages
Chinese (zh)
Other versions
CN113345414B (en
Inventor
张旭龙
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110605270.5A priority Critical patent/CN113345414B/en
Publication of CN113345414A publication Critical patent/CN113345414A/en
Application granted granted Critical
Publication of CN113345414B publication Critical patent/CN113345414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a film restoration method, a device, equipment and a medium based on voice synthesis. The film repairing method based on the voice synthesis comprises the steps of obtaining audio missing segments in a film to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text and an actor identifier; inputting the target audio text of each target actor and the actor identification into a pre-trained multi-speaker voice synthesis model for voice synthesis to obtain a synthetic voice corresponding to each target actor; and repairing audio missing segments in the film to be repaired according to the synthesized voice corresponding to each target actor. The method can realize automatic repair of audio missing segments in the film based on a multi-speaker voice synthesis model.

Description

Film restoration method, device, equipment and medium based on voice synthesis
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a film restoration method, a device, equipment and a medium based on voice synthesis.
Background
In the film restoration process, the image quality restoration technology is mature, and the technology combined with the artificial intelligence technology is also greatly developed. However, sound restoration has been a difficult problem, especially for some precious early movies, since the era changes and brings different damages to film carriers, many movies only retain image segments but lack corresponding sound segments.
The traditional film sound fragment repairing technology mainly repairs films through physical and chemical means, such as removing glue joints on a soundtrack, repairing spots and broken film holes and the like. The existing digital restoration technology can easily process bad sound recordings such as scratching sound, clicking sound and the like through a filter or a virtual sound waveform. However, there is still no good way to deal with the problem of missing segments of film sound.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a medium for repairing a film based on voice synthesis, which aim to solve the problem that the sound missing segment of the film cannot be repaired at present.
A method for film restoration based on speech synthesis, comprising:
acquiring audio missing segments in a film to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text and an actor identifier;
inputting the target audio text of each target actor and the actor identification into a pre-trained multi-speaker voice synthesis model for voice synthesis to obtain a synthetic voice corresponding to each target actor;
and repairing audio missing segments in the film to be repaired according to the synthesized voice corresponding to each target actor.
A movie repair apparatus based on speech synthesis, comprising:
the data acquisition module is used for acquiring audio missing segments in the film to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text and an actor identifier;
a voice synthesis module, configured to input the target audio text of each target actor and the actor identifier into a pre-trained multi-speaker voice synthesis model for voice synthesis, so as to obtain a synthesized voice corresponding to each target actor;
and the audio repairing module is used for repairing audio missing segments in the film to be repaired according to the synthesized voice corresponding to each target actor.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-mentioned film repair method based on speech synthesis when executing the computer program.
A computer storage medium, which stores a computer program that, when executed by a processor, implements the steps of the above-described movie repair method based on speech synthesis.
In the method, the device, the equipment and the medium for repairing the film based on the voice synthesis, the audio missing segment in the film to be repaired is obtained, so that the synthesized voice which accords with the voice characteristic of the target actor corresponding to the actor identification is synthesized aiming at the multiple actor identifications corresponding to the audio missing segment and the target audio text corresponding to each actor identification, the target audio text of each target actor and the actor identification are input into the pre-trained multi-speaker voice synthesis model for voice synthesis, so that the synthesized voice corresponding to each target actor is obtained, and the end-to-end voice synthesis of the multi-speakers is realized. And finally, repairing the audio missing segment in the film to be repaired according to the synthesized voice corresponding to each target actor, so as to realize automatic repair of the audio missing segment in the film.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
Fig. 1 is a schematic diagram of an application environment of a film restoration method based on speech synthesis according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for repairing a film based on speech synthesis according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for repairing a film based on speech synthesis according to an embodiment of the present invention;
FIG. 4 is a detailed flowchart of step S304 in FIG. 3;
FIG. 5 is a detailed flowchart of step S301 in FIG. 2;
fig. 6 is a schematic diagram of a movie repair apparatus based on speech synthesis according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for repairing a film based on speech synthesis can be applied to the application environment as shown in fig. 1, wherein a computer device communicates with a server through a network. The computer device may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server.
In an embodiment, as shown in fig. 2, a method for repairing a movie based on speech synthesis is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s201: acquiring audio missing segments in a film to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text.
Wherein, the video to be repaired is the video with partial or complete audio loss. The missing audio segments contained in the movie to be repaired may be deduced by one or more different actors (i.e., target actors) and the timbre of each target actor is different, so that it is necessary to synthesize voices that conform to the voice characteristics of the actors for the different actors. Further, the lines of different actors in the movie are different, so that the audio text sequence of the audio missing sub-segment corresponding to each target actor is different, that is, each target actor corresponds to a target audio text. The target audio text is the speech-line text corresponding to the part of the audio deduced by the corresponding target actor in the audio missing sub-segment.
S202: and inputting the target audio text of the target actor and the actor identification into a pre-trained multi-speaker voice synthesis model for voice synthesis to obtain the synthetic voice corresponding to each target actor.
The multi-speaker voice synthesis model comprises an identity characteristic extraction network and a multi-speaker voice synthesis network and is used for realizing end-to-end voice synthesis of speakers; the multi-speaker speech synthesis network is obtained by training based on a Tacotron2 model, and the synthesized speech can be closer to the original sound of a movie by performing speech synthesis through the Tacotron2 model. The Tacotron2 model includes an encoder, a first splicing module, an attention-based decoder, and a second splicing module; the identity characteristic extraction network is respectively connected with the first splicing module and the second splicing module. The encoder is used for extracting text features; the first splicing module is used for splicing the text features and the identity features extracted through the identity feature extraction network; the attention-based decoder is configured to predict a sequence of output mel-frequency spectrum frames; the second splicing module is used for splicing the Mel frequency spectrum frame sequence and the identity feature.
It can be understood that, in order to implement the speech synthesis of multiple speakers, that is, to break the situation that the conventional tacon 2 model can only be applied to the speech synthesis of a single speaker, the multi-speaker speech synthesis model in this embodiment introduces a branch network, that is, an identity feature extraction network, and a first concatenation module and a second concatenation module connected to the identity feature extraction network on the basis of the conventional tacon 2 model, so as to implement the speech synthesis of multiple speakers. The Tacotron2 model in this embodiment is consistent with the conventional Tacotron2 model. The identity feature extraction network is a feature extraction network in a Speaker recognition model obtained by training actor identifiers of different actors and corresponding audios in advance, and is used for carrying out feature coding according to the actor identifiers (Speaker ID) to obtain identity features (Speaker encoding). The first splicing module is connected with the encoder and used for splicing the output of the encoder with the identity characteristics. The second splicing module is connected with the decoder based on the attention mechanism and is used for splicing the output of the decoder with the identity characteristic.
Specifically, the target audio text of each target actor and the actor identification are input into a pre-trained multi-speaker speech synthesis model for speech synthesis to obtain a synthesized speech corresponding to each target actor, that is, end-to-end speech synthesis of multiple speakers is realized, so that a synthesized speech corresponding to each target actor in an audio clip and conforming to the tone color of the target actor is obtained, and a technical source is provided for sound restoration of a film.
Furthermore, in order to further ensure the tone quality of the synthesized voice, the fine tuning can be performed on the synthesized voice through a voice modification tool, and the synthesized voice with better tone quality is output.
S203: and repairing the audio missing segment in the film to be repaired according to the synthesized voice corresponding to each target actor.
Specifically, the synthesized voices of different target actors are used to replace audio missing sub-segments corresponding to the target actors in the video to be repaired by using a video editing tool (such as Premiere), so as to repair the audio missing segments in the video to be repaired.
In this embodiment, by obtaining an audio missing segment in a to-be-repaired film, synthesizing a synthesized voice conforming to the sound characteristic of a target actor corresponding to an audio missing segment for a plurality of actor identifiers and a target audio text corresponding to each actor identifier, that is, inputting the target audio text of each target actor and the actor identifiers into a pre-trained multi-talker voice synthesis model for voice synthesis, so as to obtain a synthesized voice corresponding to each target actor, thereby implementing end-to-end voice synthesis of multiple talkers. And finally, repairing the audio missing segment in the film to be repaired according to the synthesized voice corresponding to each target actor, so as to realize automatic repair of the audio missing segment in the film.
In an embodiment, as shown in fig. 3, the method for repairing a film based on speech synthesis further includes the following steps:
s301: acquiring target audio samples corresponding to target actors in the same or different films, and converting the target audio samples into files in a compressed format; wherein the target audio sample corresponds to a text sequence.
Because the synthesized voice conforming to the tone of the target actor needs to be synthesized subsequently, the audio sample corresponding to the target actor needs to be adopted for supervised training. Specifically, multiple acoustic samples of different target actors in the missing audio segment in the movie or other movie are collected as target audio samples. In this embodiment, the sum of the audio durations corresponding to the acoustic samples needs to be greater than a preset threshold, for example, 5 minutes, so as to facilitate subsequent sample feature extraction and avoid accidental spectral features extracted subsequently due to too short duration. In this embodiment, each acoustic sample may be a simple speech, about 10 seconds, and each target actor needs to collect 30 acoustic samples. It should be noted that the acquisition time and the number of samples for the acoustic samples may be adjusted according to actual needs, and are not limited herein.
Furthermore, when audio samples of different films of the target actor are collected, a plurality of films with the same or similar film year can be selected for collection, so that the tone characteristics of the actor are prevented from changing due to the lapse of time, and the quality of the samples is ensured.
Specifically, in this embodiment, in order to ensure the sound quality precision of the synthesized speech after the movie is repaired, it is necessary to convert the audio samples into a compression format file. Wherein, the compression format file can be mp3 format file or WAV format file. Further, since the WAV format file is the most lossless audio format, and some parts of the lossless compression file of mp3 are removed by encoding the audio to save space, the audio samples are converted into the lossless WAV format file in this embodiment.
S302: and converting the compressed format file into a Mel frequency spectrum sequence as a real label.
Specifically, because the Tacotron2 model mainly predicts the mel-frequency spectrum frame corresponding to the text sequence, and then the vocoder is used to generate the time-domain waveform of the mel-frequency spectrum frame, i.e. to generate the synthesized speech, the purpose of end-to-end speech synthesis is achieved.
In this embodiment, the audio sample is converted into a mel-frequency spectrum sequence through a fourier transform algorithm, so as to serve as a real label for a subsequent training Tacotron2 model, that is, characterize a mel-frequency spectrum sequence of a real voice corresponding to a certain text sequence. The fourier transform algorithm may use a Short Time Fourier Transform (STFT), or other fourier transform algorithms, which is not limited herein.
S303: and inputting the actor identification of the target actor into a pre-trained identity feature extraction network, and extracting the identity feature of the target actor.
The identity feature extraction network can be used for extracting the identity features of actors by adopting audio samples of part of actors in advance, namely, the actor identifications and corresponding actor voices are used as training samples to be trained. And extracting the identity characteristics of the target actor by inputting the actor identification of the target actor into a pre-trained identity characteristic extraction network, and providing a technical source for synthesizing the synthesized voice of the target actor end to end subsequently.
S304: and fine-tuning the pre-trained Tacotron2 model based on the real label, the text sequence and the identity characteristic to obtain the multi-speaker voice synthesis network.
Wherein the Tacotron2 model comprises an encoder, a first splicing module, an attention mechanism based decoder, and a second splicing module. Specifically, a pre-trained Tacotron2 model is trained by using an existing large corpus to obtain a pre-trained Tacotron2 model, then fine tuning and re-learning are performed on the pre-trained Tacotron2 model according to a real label, the text sequence and the identity characteristics of the target actor, so that a multi-speaker voice synthesis network can be obtained, and accordingly, a multi-speaker voice synthesis model is constructed according to the identity characteristic network and the fine-tuned multi-speaker voice synthesis network, so that voices corresponding to actor identifications can be synthesized directly according to the actor identifications and corresponding text sequences of different target actors in the following process, and the purpose of end-to-end voice synthesis of the multi-speaker is achieved.
Further, after acquiring audio samples of the target actors in the same or different movies, in order to ensure sample quality, in this embodiment, an adaptive filter noise reduction algorithm is adopted to perform noise reduction processing on the target audio samples, and obtain noise-reduced target audio samples, so as to update the target audio samples.
In an embodiment, the adaptive filter noise reduction algorithm may be a Least Mean Square (LMS) algorithm, which is a common adaptive filter noise reduction algorithm, and the LMS algorithm performs an iterative operation with a certain step length from an error between an expected signal and an actual signal by using a steepest descent method, and updates a filtered weight parameter, thereby achieving noise reduction.
In an embodiment, as shown in fig. 4, the method for repairing a film based on speech synthesis further includes the following steps:
s401: text features of an audio text sequence are extracted by an encoder.
S402: and splicing the text features and the identity features through a first splicing module to obtain first splicing features.
S403: and predicting a first Mel frequency spectrum frame sequence corresponding to the output audio text sequence by an attention-based decoder based on the first splicing characteristics and the second splicing characteristics output in the previous round. The second splicing characteristic is the splicing characteristic of a second Mel frequency spectrum frame sequence predicted and output by a previous decoder and the identity characteristic;
s404: and splicing the first Mel frequency spectrum frame sequence and the identity characteristics through a second splicing module to obtain second splicing characteristics, and taking the second splicing characteristics as the input of a next decoder.
In this embodiment, the model architecture of the speech synthesis scenario that can only apply to a single identity in the Tacotron2 model is improved, so that the Tacotron2 model is applicable to a multi-identity speech synthesis scenario. An encoder and an attention-based decoder are typically included in the Tacotron2 model. In Tacotron2, the encoder is a feature encoder for extracting text features; the attention-based decoder is an autoregressive recurrent neural network that predicts the output mel-frequency spectrum frame sequence from the output sequence of the encoder (i.e., the first splicing feature). In order to realize a multi-identity voice synthesis scene, a first splicing module is introduced at the output end of an encoder and is used for splicing text features and identity features output by the encoder; and introducing a second splicing module at the output end of the decoder, wherein the second splicing module is used for splicing the first Mel-frequency frame sequence and the identity characteristics which are predicted and output by the current decoder. Furthermore, the Tacotron2 model further includes a vocoder for generating a corresponding time-domain waveform according to the mel-frequency spectrum frame sequence predicted and output by the decoder, i.e. the time-domain waveform can be used as a synthesized speech.
Specifically, for the sake of understanding, the specific process of using the first stitching module and the second stitching module in the Tacotron2 model is explained by taking a partial structure in the Tacotron2 model as an example.
Firstly, extracting text features of the audio text sequence through an encoder, namely, a hidden state variable of the encoder. The encoder comprises a Character Embedding layer (Character Embedding), a group of convolution layers formed by 3 layers of one-dimensional convolution, a batch normalization activation layer and a bidirectional LSTM layer.
Specifically, the steps performed for the encoder to extract the text features of the audio text sequence are as follows: firstly, an input audio text sequence is coded into 512-dimensional character vectors through a character embedding layer, then three layers of one-dimensional convolution are carried out (wherein each layer of convolution comprises 512 convolution kernels of 5x 1), then batch normalization processing is carried out on the output of the convolution layer through a batch normalization activation layer, and ReLu activation is used; and finally, inputting the activated output to a bidirectional LSTM layer for processing, wherein the hidden state variable output by the layer is the text characteristic.
The encoder then passes the text to an attention-based decoder that includes a "pre-net" layer (which is a two-layer fully-connected layer of 256 hidden ReLU units per layer), an attention network, an LSTM layer, a linear projection layer, and a post-processing network.
Specifically, the implementation steps for the attention-based decoder are as follows: the predicted second mel-frequency frame sequence of the previous round (i.e., the predicted output indicative of the decoder of the previous round) is first passed into a "pre-net" layer, and then the output of the pre-net layer is concatenated with the attention context vector into a two-layer stacked unidirectional LSTM layer of N (e.g., 1024) units. The output of the LSTM layer is spliced again with the attention context vector, then a spectrum frame sequence is predicted by linear projection, then the spectrum frame sequence is passed through PostNet (post-processing network) formed by 5-layer convolution, then the output is added (i.e. residual connection) with the output of the linear projection layer to be the first mel spectrum frame sequence of the predicted output (i.e. indicating the predicted output of the current round of decoder), and the first mel spectrum frame sequence of the predicted output of the decoder is spliced with the identity characteristic by the second splicing module and is used as the input of the next round of decoder. Wherein, the calculation of the attention context vector is consistent with the calculation in the traditional Tacotron2 model, namely, the attention context vector is calculated by adopting a mixed attention mechanism, and the formula is as follows
Figure BDA0003093891210000102
Wherein the content of the first and second substances,
Figure BDA0003093891210000101
w, U, V, F, b is the training parameter, siHidden state variable representing the current decoder, cai-1Cumulative attention weight for previous decoding process, fi,jRepresenting the attention weight alphai-1Position feature, h, obtained after convolutionjRepresenting the hidden state of the jth encoder. The attention weight corresponding to each encoder can be calculated through the formula, and then the attention weight is added with the hidden state weight of the corresponding decoder in a weighted modeAnd averaging the weights to obtain an attention context vector.
S405: based on the first sequence of mel-frequency spectrum frames and the real label, a network loss is calculated.
S406: and updating the model parameters according to the network loss, and repeatedly executing the step of extracting the text features of the audio text sequence through the encoder until the network converges to obtain the multi-speaker speech synthesis network.
The network loss in this embodiment is consistent with the construction of the network loss of the conventional Tacotron2 model, that is, the network loss includes the spectral loss, the loss of the post-processing network, and the regularization term of the model parameter. Specifically, after the predicted output of the decoder is obtained, the spectral loss can be calculated based on the first mel frequency spectrum sequence of the predicted output and the real label, the loss of the post-processing network can be calculated according to the mean square error of the data before the input of the post-processing network and the data after the input of the post-processing network, and the loss of the post-processing network can be calculated according to the regular term (namely the regular term) of the model parameters
Figure BDA0003093891210000111
Wherein, W represents the model parameter, P represents the total number of the model parameter, and lambda represents the regular coefficient), and the network loss can be obtained by accumulating the parameters; and then updating model parameters needing to be trained according to the network loss, and repeatedly executing the step of extracting the text features of the audio text sequence through the encoder until the network converges to obtain the multi-speaker speech synthesis network. It should be noted that the process of network update is consistent with the update process of the conventional tacontron 2 model, and is not described here again.
In one embodiment, as shown in fig. 5, the step S301 of acquiring a target audio sample of the target actor in the same or different movie includes:
s501: adopting a pre-trained voiceprint recognition model to recognize a plurality of original audio samples in a film, and obtaining a recognition result; wherein the recognition result is used for indicating the actor identification corresponding to each original audio sample.
S502: and if the actor identifier indicated by the identification result is the target actor, taking the original audio sample corresponding to the actor identifier as the target audio sample.
The original audio samples can be a speech line or a sentence of a speech of each actor, and each original audio sample is input into a pre-trained voiceprint recognition model for recognition so as to classify the original audio samples according to actor identifications, and then a target actor corresponding to each original audio sample can be automatically recognized, thereby realizing the automatic collection of the audio samples of different target actors.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, a movie repair apparatus based on speech synthesis is provided, and the movie repair apparatus based on speech synthesis corresponds to the movie repair method based on speech synthesis in the above embodiment one to one. As shown in fig. 6, the film restoration apparatus based on speech synthesis includes a data acquisition module 10, a speech synthesis module 20, and an audio restoration module 30. The functional modules are explained in detail as follows:
the data acquisition module 10 is configured to acquire an audio missing segment in a movie to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text and an actor identifier.
A speech synthesis module 20, configured to input the target audio text of each target actor and the actor identifier into a pre-trained multi-speaker speech synthesis model for speech synthesis, so as to obtain a synthesized speech corresponding to each target actor.
And the audio repairing module 30 is configured to repair the audio missing segment in the to-be-repaired video according to the synthesized speech corresponding to each target actor.
Specifically, the multi-speaker speech synthesis model comprises an identity feature extraction network and a multi-speaker speech synthesis network obtained by training based on a Tacotron2 model; the Tacotron2 model comprises an encoder, a first splicing module, a decoder based on an attention mechanism and a second splicing module which are connected in sequence; the identity feature extraction network is respectively connected with the first splicing module and the second splicing module; the encoder is used for extracting text features; the first splicing module is used for splicing the text features and the identity features extracted through the identity feature extraction network; the attention-based decoder is configured to predict a sequence of output mel-frequency spectrum frames; the second splicing module is used for splicing the Mel frequency spectrum frame sequence and the identity feature.
Specifically, the film restoration device based on speech synthesis further comprises a sample acquisition module, a spectrum conversion module, an identity feature extraction module and a model fine-tuning module model.
The sample acquisition module is used for acquiring a target audio sample corresponding to the target actor in the same or different films and converting the target audio sample into a compressed format file; wherein the target audio sample corresponds to a text sequence.
And the spectrum conversion module is used for converting the compressed format file into a Mel spectrum sequence as a real label.
And the identity characteristic extraction module is used for inputting the actor identification of the target actor into a pre-trained identity characteristic extraction network and extracting the identity characteristic of the target actor.
And the model fine-tuning module is used for fine-tuning the pre-trained Tacotron2 model based on the real label, the text sequence and the identity characteristic so as to obtain the multi-speaker voice synthesis network.
Specifically, the model fine-tuning module comprises a text feature extraction module, a first splicing module, a decoding module, a second splicing module, a network loss calculation module and a network updating module.
The text feature extraction module is used for extracting text features of the text sequence through the encoder;
and the first splicing module is used for splicing the text features and the identity features through the first splicing module to obtain first splicing features.
A decoding module, configured to predict and output, by an attention-based decoder, a first mel-frequency spectrum frame sequence corresponding to the audio text sequence based on the first splicing feature and a second splicing feature output in a previous round; wherein the second splicing characteristic is a splicing characteristic of a second Mel frequency spectrum frame sequence predicted and output by a previous round of decoder and the identity characteristic.
And the second splicing module is used for splicing the first Mel-spectrum frame sequence and the identity characteristic through the second splicing module to obtain a second splicing characteristic, and the second splicing characteristic is used as the input of a next decoder.
A network loss calculation module for calculating a network loss based on the first sequence of mel-frequency spectrum frames and the real tag.
And the network updating module is used for updating the model parameters according to the network loss and repeatedly executing the step of extracting the text features of the audio text sequence through the encoder until the network converges to obtain the multi-speaker speech synthesis network.
Specifically, the sample collection module comprises an identification unit and an automatic sample collection unit.
The identification unit is used for identifying a plurality of original audio samples in the film by adopting a pre-trained voiceprint identification model to obtain an identification result; wherein the identification result is used for indicating the actor identification corresponding to each original audio sample.
And the sample automatic acquisition unit is used for taking the original audio sample corresponding to the actor identifier as a target audio sample if the actor identifier indicated by the identification result is the target actor.
Specifically, after the target audio samples corresponding to the target actors in the same or different videos are collected, the film restoration device based on speech synthesis further includes a noise reduction unit, configured to perform noise reduction processing on the target audio samples by using an adaptive filter noise reduction algorithm, and obtain noise-reduced audio texts.
For specific limitations of the film restoration device based on speech synthesis, reference may be made to the above limitations of the film restoration method based on speech synthesis, which are not described herein again. The various modules in the above-described movie repair apparatus based on speech synthesis may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a computer storage medium and an internal memory. The computer storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the computer storage media. The database of the computer device is used for storing data generated or acquired during the execution of a method for film restoration based on speech synthesis, such as a multi-speaker speech synthesis model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for film repair based on speech synthesis.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring audio missing segments in a film to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text and an actor identifier;
inputting the target audio text of each target actor and the actor identification into a pre-trained multi-speaker voice synthesis model for voice synthesis to obtain a synthetic voice corresponding to each target actor;
and repairing audio missing segments in the film to be repaired according to the synthesized voice corresponding to each target actor.
Alternatively, the processor implements the functions of each module/unit in the movie repair apparatus based on speech synthesis when executing the computer program, for example, the functions of each module/unit shown in fig. 6, and are not described herein again to avoid repetition.
In one embodiment, a computer storage medium is provided having a computer program stored thereon, the computer program when executed by a processor implementing the steps of:
acquiring audio missing segments in a film to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text and an actor identifier;
inputting the target audio text of each target actor and the actor identification into a pre-trained multi-speaker voice synthesis model for voice synthesis to obtain a synthetic voice corresponding to each target actor;
and repairing audio missing segments in the film to be repaired according to the synthesized voice corresponding to each target actor.
Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units in the above-mentioned movie repair apparatus based on speech synthesis, for example, the functions of the modules/units shown in fig. 6, and are not described herein again to avoid repetition.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method for repairing a film based on speech synthesis, comprising:
acquiring audio missing segments in a film to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text and an actor identifier;
inputting the target audio text of each target actor and the actor identification into a pre-trained multi-speaker voice synthesis model for voice synthesis to obtain a synthetic voice corresponding to each target actor;
and repairing audio missing segments in the film to be repaired according to the synthesized voice corresponding to each target actor.
2. A method for repairing a film based on speech synthesis as claimed in claim 1, wherein the multi-speaker speech synthesis model comprises an identity feature extraction network and a multi-speaker speech synthesis network trained based on a Tacotron2 model; the Tacotron2 model comprises an encoder, a first splicing module, a decoder based on an attention mechanism and a second splicing module which are connected in sequence; the identity feature extraction network is respectively connected with the first splicing module and the second splicing module; the encoder is used for extracting text features; the first splicing module is used for splicing the text features and the identity features extracted through the identity feature extraction network; the attention-based decoder is configured to predict a sequence of output mel-frequency spectrum frames; the second splicing module is used for splicing the Mel frequency spectrum frame sequence and the identity feature.
3. The method for repairing a film based on speech synthesis according to claim 2, wherein said training based on Tacotron2 model obtains a multi-speaker speech synthesis network, comprising:
acquiring target audio samples corresponding to the target actors in the same or different films, and converting the target audio samples into files in a compressed format; wherein the target audio sample corresponds to a text sequence;
converting the compressed format file into a Mel frequency spectrum sequence as a real label;
inputting the actor identification of the target actor into a pre-trained identity feature extraction network, and extracting the identity feature of the target actor;
and fine-tuning the pre-trained Tacotron2 model based on the real label, the text sequence and the identity characteristic to obtain a multi-speaker voice synthesis network.
4. A method for repairing a film based on speech synthesis as claimed in claim 3, wherein said fine-tuning the pre-trained Tacotron2 model based on said real label, said text sequence and said identity feature to obtain a multi-speaker speech synthesis network comprises:
extracting, by the encoder, text features of the text sequence;
splicing the text feature and the identity feature through the first splicing module to obtain a first splicing feature;
predicting and outputting a first Mel frequency spectrum frame sequence corresponding to the audio text sequence based on the first splicing characteristics and a second splicing characteristics output in the previous round by an attention-based decoder; wherein the second splicing feature is a splicing feature of a second Mel frequency spectrum frame sequence predicted and output by a previous round of decoder and the identity feature;
splicing the first Mel-spectrum frame sequence and the identity characteristics through the second splicing module to obtain second splicing characteristics, and using the second splicing characteristics as the input of a decoder in the next round;
calculating a network loss based on the first sequence of mel-frequency spectrum frames and the real label;
and updating the model parameters according to the network loss, and repeatedly executing the step of extracting the text features of the audio text sequence through the encoder until the network converges to obtain the multi-speaker speech synthesis network.
5. A method as claimed in claim 3, wherein the acquiring the corresponding target audio samples of the target actors in the same or different movies comprises:
recognizing a plurality of original audio samples in the film by adopting a pre-trained voiceprint recognition model to obtain a recognition result; wherein the identification result is used for indicating the actor identification corresponding to each original audio sample;
and if the actor identifier indicated by the identification result is the target actor, taking the original audio sample corresponding to the actor identifier as a target audio sample.
6. A method for film restoration based on speech synthesis as defined in claim 3, wherein after the acquiring of the target audio sample corresponding to the target actor in the same or different film, the method for film restoration further comprises:
and carrying out noise reduction processing on the target audio sample by adopting a self-adaptive filter noise reduction algorithm to obtain a noise-reduced target audio sample.
7. A film restoration apparatus based on speech synthesis, comprising:
the data acquisition module is used for acquiring audio missing segments in the film to be repaired; wherein the audio missing segment corresponds to at least one target actor; each target actor corresponds to a target audio text and an actor identifier;
a voice synthesis module, configured to input the target audio text of each target actor and the actor identifier into a pre-trained multi-speaker voice synthesis model for voice synthesis, so as to obtain a synthesized voice corresponding to each target actor;
and the audio repairing module is used for repairing audio missing segments in the film to be repaired according to the synthesized voice corresponding to each target actor.
8. The voice synthesis-based film restoration apparatus according to claim 7, further comprising:
the sample acquisition module is used for acquiring a target audio sample corresponding to the target actor in the same or different films and converting the audio sample into a compressed format file; wherein the audio samples correspond to a text sequence;
the frequency spectrum conversion module is used for converting the compressed format file into a Mel frequency spectrum sequence as a real label;
the identity characteristic extraction module is used for inputting the actor identification of the target actor into a pre-trained identity characteristic extraction network and extracting the identity characteristic of the target actor;
and the model fine-tuning module is used for fine-tuning the pre-trained Tacotron2 model based on the real label, the text sequence and the identity characteristics so as to obtain the multi-speaker voice synthesis network.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for repairing a film based on speech synthesis according to any one of claims 1 to 6 when executing the computer program.
10. A computer storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the method for repairing a film based on speech synthesis according to any one of claims 1 to 6.
CN202110605270.5A 2021-05-31 2021-05-31 Film restoration method, device, equipment and medium based on voice synthesis Active CN113345414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110605270.5A CN113345414B (en) 2021-05-31 2021-05-31 Film restoration method, device, equipment and medium based on voice synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110605270.5A CN113345414B (en) 2021-05-31 2021-05-31 Film restoration method, device, equipment and medium based on voice synthesis

Publications (2)

Publication Number Publication Date
CN113345414A true CN113345414A (en) 2021-09-03
CN113345414B CN113345414B (en) 2022-12-27

Family

ID=77473624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110605270.5A Active CN113345414B (en) 2021-05-31 2021-05-31 Film restoration method, device, equipment and medium based on voice synthesis

Country Status (1)

Country Link
CN (1) CN113345414B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697974A (en) * 2017-10-19 2019-04-30 百度(美国)有限责任公司 Use the system and method for the neural text-to-speech that convolution sequence learns
CN110808027A (en) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 Voice synthesis method and device and news broadcasting method and system
CN112349273A (en) * 2020-11-05 2021-02-09 携程计算机技术(上海)有限公司 Speech synthesis method based on speaker, model training method and related equipment
CN112509550A (en) * 2020-11-13 2021-03-16 中信银行股份有限公司 Speech synthesis model training method, speech synthesis device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697974A (en) * 2017-10-19 2019-04-30 百度(美国)有限责任公司 Use the system and method for the neural text-to-speech that convolution sequence learns
CN110808027A (en) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 Voice synthesis method and device and news broadcasting method and system
CN112349273A (en) * 2020-11-05 2021-02-09 携程计算机技术(上海)有限公司 Speech synthesis method based on speaker, model training method and related equipment
CN112509550A (en) * 2020-11-13 2021-03-16 中信银行股份有限公司 Speech synthesis model training method, speech synthesis device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JONATHAN SHEN等: "NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM PREDICTIONS", 《ARXIV》 *
王峥: "语音合成技术在声音修复上的尝试", 《现代电影技术》 *

Also Published As

Publication number Publication date
CN113345414B (en) 2022-12-27

Similar Documents

Publication Publication Date Title
CN110223705B (en) Voice conversion method, device, equipment and readable storage medium
CN111247585B (en) Voice conversion method, device, equipment and storage medium
CN106683677B (en) Voice recognition method and device
CN110148400B (en) Pronunciation type recognition method, model training method, device and equipment
JP7243760B2 (en) Audio feature compensator, method and program
CN108520741A (en) A kind of whispering voice restoration methods, device, equipment and readable storage medium storing program for executing
CN112037766A (en) Voice tone conversion method and related equipment
CN112037754A (en) Method for generating speech synthesis training data and related equipment
CN110136696B (en) Audio data monitoring processing method and system
WO2023116660A2 (en) Model training and tone conversion method and apparatus, device, and medium
CN112562634A (en) Multi-style audio synthesis method, device, equipment and storage medium
CN114627856A (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
CN110837758A (en) Keyword input method and device and electronic equipment
CN112802444A (en) Speech synthesis method, apparatus, device and storage medium
CN113362804B (en) Method, device, terminal and storage medium for synthesizing voice
CN110176243B (en) Speech enhancement method, model training method, device and computer equipment
Mandel et al. Audio super-resolution using concatenative resynthesis
KR102319753B1 (en) Method and apparatus for producing video contents based on deep learning
CN113611281A (en) Voice synthesis method and device, electronic equipment and storage medium
CN113345414B (en) Film restoration method, device, equipment and medium based on voice synthesis
Gref et al. Multi-Staged Cross-Lingual Acoustic Model Adaption for Robust Speech Recognition in Real-World Applications--A Case Study on German Oral History Interviews
CN113516964B (en) Speech synthesis method and readable storage medium
CN115472174A (en) Sound noise reduction method and device, electronic equipment and storage medium
CN114822497A (en) Method, apparatus, device and medium for training speech synthesis model and speech synthesis
CN111798849A (en) Robot instruction identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant