CN112652309A - Dialect voice conversion method, device, equipment and storage medium - Google Patents

Dialect voice conversion method, device, equipment and storage medium Download PDF

Info

Publication number
CN112652309A
CN112652309A CN202011518758.6A CN202011518758A CN112652309A CN 112652309 A CN112652309 A CN 112652309A CN 202011518758 A CN202011518758 A CN 202011518758A CN 112652309 A CN112652309 A CN 112652309A
Authority
CN
China
Prior art keywords
dialect
target
voice
speaker
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011518758.6A
Other languages
Chinese (zh)
Inventor
鲍晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202011518758.6A priority Critical patent/CN112652309A/en
Publication of CN112652309A publication Critical patent/CN112652309A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a dialect voice conversion method, a dialect voice conversion device, dialect voice conversion equipment and a storage medium, wherein the method comprises the following steps: acquiring source dialect voice of a target speaker; converting the source dialect voice into a target dialect text, and extracting speaker information of a target speaker from the source dialect voice; and synthesizing the target dialect voice according with the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker. The dialect voice conversion method can convert the source dialect voice of the target speaker into the target dialect voice which accords with the speaking characteristic of the target speaker.

Description

Dialect voice conversion method, device, equipment and storage medium
Technical Field
The present application relates to the field of speech synthesis technologies, and in particular, to a dialect speech conversion method, apparatus, device, and storage medium.
Background
Chinese dialects are branches of Chinese, and are wide in Chinese territory and numerous in Chinese dialects. The Chinese society has different differentiation and unification in the development process, so that Chinese gradually generates dialects. The difference between dialects of modern Chinese is expressed in the aspects of voice, vocabulary and grammar, and the voice aspect is particularly prominent.
With the increase of population mobility, because each region has a unique language pronunciation, people in different regions cannot understand each other, so that the problem of communication barrier caused by language incoherence is a problem to be solved urgently. At present, products with translation functions are all conversion between different languages of voice, and conversion between dialect voice is not involved, so how to realize conversion between dialect voice is a problem which needs to be solved at present.
Disclosure of Invention
In view of the above, the present application provides a dialect voice conversion method, device, apparatus and storage medium, which are used to convert a source dialect voice into a target dialect voice, and the technical solution is as follows:
a dialect voice conversion method, comprising:
acquiring source dialect voice of a target speaker;
converting the source dialect voice into a target dialect text, and extracting speaker information of the target speaker from the source dialect voice;
and synthesizing the target dialect voice according with the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker.
Optionally, the converting the source dialect speech into the target dialect text includes:
carrying out voice recognition on the source dialect voice to obtain a source dialect text;
converting the source dialect text to target dialect text.
Optionally, synthesizing a target dialect speech conforming to the speaking characteristic of the target speaker according to the target dialect text and the speaker information of the target speaker, including:
synthesizing target dialect voice according with the speaking characteristics of the target speaker by using a pre-established voice synthesis model according to the target dialect text and the speaker information of the target speaker;
the voice synthesis model is obtained by training a target dialect text and speaker information which are obtained according to a source dialect training voice as training samples; the training targets of the speech synthesis model include: and enabling the speaker information extracted from the target language synthetic voice corresponding to the source dialect training voice to be consistent with the speaker information extracted from the source dialect training voice.
Optionally, the speech synthesis model adopts a generation network in a countermeasure generation network;
the training targets of the speech synthesis model further comprise: and enabling a discrimination network in the countermeasure generation network to be unable to discriminate whether the target language synthesized voice corresponding to the source dialect training voice is synthesized voice or real voice.
Optionally, the synthesizing the target dialect speech according with the speaking characteristic of the target speaker by using the pre-established speech synthesis model based on the target dialect text and the speaker information of the target speaker includes:
coding the target dialect text and the speaker information of the target speaker respectively to obtain a representation vector of the target dialect text and a representation vector of the speaker information of the target speaker;
inputting the representation vector of the target dialect text and the representation vector of the speaker information of the target speaker into the voice synthesis model to obtain a frequency domain signal of the target dialect voice capable of being represented and synthesized;
and converting the frequency domain signal capable of representing the synthesized target dialect voice into a time domain to obtain the synthesized target dialect voice.
Optionally, the process of establishing the speech synthesis model includes:
constructing a training sample set according to a source dialect training voice set, wherein each training sample in the training sample set comprises a target dialect text and speaker information obtained according to one source dialect training voice in the source dialect training voice set;
and training a generation network serving as a voice synthesis model in the countermeasure generation network by using the training sample set and a discriminant network in the countermeasure generation network.
Optionally, the training, by using the training sample set and a discriminant network in the confrontation generation network, a generation network in the confrontation generation network as a speech synthesis model includes:
obtaining training samples from the set of training samples;
performing voice synthesis by using a generation network serving as a voice synthesis model in a countermeasure generation network based on the obtained training sample to obtain a frequency domain signal capable of representing a target dialect synthesized voice corresponding to the training sample;
extracting speaker information from the frequency domain signal, and determining a speaker information consistency characterization value according to the speaker information extracted from the frequency domain signal and the speaker information in the training sample;
classifying real voice and synthesized voice of the target dialect synthesized voice corresponding to the training sample by using a discrimination network in the confrontation generating network and the frequency domain signal to obtain a classification result;
and updating parameters of a generation network used as a speech synthesis model according to the speaker information consistency representation value and the classification result.
Optionally, the classifying, by using the discrimination network in the countermeasure generation network and the frequency domain signal, the real speech and the synthesized speech of the target dialect synthesized speech corresponding to the training sample to obtain a classification result includes:
acquiring a frequency domain signal of target dialect synthesized voice corresponding to the training sample;
and inputting the frequency domain signal of the target dialect synthesized voice corresponding to the training sample as the dialect voice classification model of the discrimination network for classification to obtain the classification result of the target dialect synthesized voice corresponding to the training sample.
Optionally, the determining a speaker information consistency characterizing value according to the speaker information extracted from the frequency domain signal and the speaker information in the training sample includes:
and calculating the error between the speaker information extracted from the frequency domain signal and the speaker information in the training sample as a speaker information consistency characterization value.
A dialect speech conversion apparatus comprising: the system comprises a voice acquisition module, a voice conversion module, a speaker information extraction module and a voice synthesis module;
the voice acquisition module is used for acquiring the source dialect voice of the target speaker;
the voice conversion module is used for converting the source dialect voice into a target dialect text;
the speaker information extraction module is used for extracting the speaker information of the target speaker from the source dialect voice;
and the voice synthesis module is used for synthesizing the target dialect voice according with the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker.
Optionally, the speech synthesis module is specifically configured to synthesize a target dialect speech that meets the speaking characteristics of the target speaker by using a pre-established speech synthesis model based on the target dialect text and the speaker information of the target speaker;
the voice synthesis model is obtained by training a target dialect text and speaker information which are obtained according to a source dialect training voice as training samples; the training targets of the speech synthesis model include: and enabling the speaker information extracted from the target language synthetic voice corresponding to the source dialect training voice to be consistent with the speaker information extracted from the source dialect training voice.
Optionally, the speech synthesis model adopts a generation network in a countermeasure generation network;
the training targets of the speech synthesis model further comprise: and enabling a discrimination network in the countermeasure generation network to be unable to discriminate whether the target language synthesized voice corresponding to the source dialect training voice is synthesized voice or real voice.
A dialect voice conversion apparatus, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of any one of the dialect speech conversion methods described above.
A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the dialect speech conversion method of any one of the preceding claims.
According to the above scheme, after the dialect speech conversion method provided by the application obtains the source dialect speech of the target speaker, on one hand, the source dialect speech is converted into the target dialect text, on the other hand, the speaker information of the target speaker is extracted from the source dialect speech, and then the target dialect speech conforming to the speaking characteristic of the target speaker is synthesized according to the target dialect text and the speaker information of the target speaker.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flowchart of a dialect speech conversion method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of establishing a speech synthesis model according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a process of training a generation network serving as a speech synthesis model in a confrontation generation network by using a training sample set and a discrimination network in the confrontation generation network according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a process of training a discriminative network in a confrontation generating network as a speech synthesis model by using a training sample set and a discriminative network in the confrontation generating network according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating the intuitive distribution of real dialects and synthesized dialects before and after the countermeasure training provided by the embodiment of the present application;
fig. 6 is a schematic structural diagram of a dialect voice conversion apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of dialect voice conversion equipment according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to realize the conversion between dialect voices, the inventor of the present invention studied, and the initial thought was:
firstly, performing voice recognition on source dialect voice to obtain a source dialect text, then translating the source dialect text into a target dialect text, and finally performing voice synthesis on the target dialect text by using a voice synthesis model corresponding to the target dialect (the voice synthesis model is obtained by training a training text of the target dialect and a real voice of the target dialect corresponding to the training text), thereby obtaining the target dialect voice.
However, the inventor of the present invention has found, through research on the dialect speech conversion scheme, that the dialect speech conversion scheme includes three parts, namely, dialect speech recognition, dialect text translation, and dialect speech synthesis, where the dialect speech recognition and the dialect text translation are relatively mature technologies, but the dialect speech synthesis is not mature, and there are some problems in performing speech synthesis on a target dialect text by using a speech synthesis model corresponding to a target dialect, which are specifically embodied as:
firstly, the voice synthesized by the voice synthesis model is not natural and stable enough, especially for long-term voice synthesis; secondly, because the speech synthesis is only performed based on the target dialect text, it can only synthesize the speech of the fixed character, that is, the synthesized speech does not conform to the speaking characteristics of the speaker corresponding to the source dialect speech.
In view of the problems of the above solutions, the present inventors have continued research and finally provided a dialect speech conversion method with better effect, which can convert the source dialect speech of the target speaker into the target dialect speech conforming to the speaking characteristics of the target speaker, and the converted speech is more natural and stable, and the dialect speech conversion method has the basic concept that:
after the source dialect voice of the target speaker is obtained, on one hand, the source dialect voice is converted into a target dialect text, on the other hand, the speaker information of the target speaker is extracted from the source dialect voice, and then voice synthesis is carried out according to the target dialect text and the speaker information of the target speaker.
The dialect voice conversion method provided by the application can be applied to electronic equipment with data processing capacity, the electronic equipment can be a server on a network side and can also be a terminal used by a user side, such as a mobile phone, a Personal Computer (PC), a PAD and the like, the electronic equipment can acquire the source dialect voice of a target speaker, and the source dialect voice of the target speaker is converted into the target dialect voice which accords with the speaking characteristics of the target speaker by adopting the dialect voice conversion method provided by the application.
The dialect speech conversion method provided by the present application is described next by the following embodiments.
First embodiment
Referring to fig. 1, a flow chart of a dialect voice conversion method provided in an embodiment of the present application is shown, where the method may include:
step S101: the source dialect speech of the target speaker is obtained.
The source dialect speech is dialect speech to be subjected to speech conversion, and may be speech of any dialect, and in addition, the source dialect speech may be dialect speech acquired through any channel.
Step S102 a: the source dialect speech is converted to target dialect text.
Specifically, the process of converting the source dialect speech into the target dialect text may include:
step S102a-1, performing voice recognition on the source dialect voice to obtain a source dialect text.
Specifically, the source dialect speech may be input into the speech recognition system to obtain the source dialect text output by the speech recognition system.
Step S102a-2, converting the source dialect text into the target dialect text.
Specifically, the source dialect text can be input into the dialect text translation system to obtain the target dialect text output by the dialect text translation system.
Step S102 b: speaker information of a target speaker is extracted from a source dialect speech.
The speaker information of the target speaker can embody the speaking characteristics of the target speaker.
Specifically, the process of extracting the speaker information of the target speaker from the source dialect speech includes: firstly, converting a source dialect voice to a frequency domain to obtain a frequency domain signal, then inputting the obtained frequency domain signal into a voiceprint extraction model to obtain a voiceprint extracted by the voiceprint extraction model, and taking the extracted voiceprint as the speaker information of a target speaker.
The voiceprint extraction model is obtained by training with a speaker classification task as an optimization target, and voiceprints with speaker distinguishability can be extracted from frequency domain signals of source dialect speech based on the voiceprint extraction model.
Step S103: and synthesizing the target dialect voice according with the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker.
Specifically, the process of synthesizing the target dialect speech conforming to the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker may include: and synthesizing the target dialect voice according with the speaking characteristics of the target speaker by using a pre-established voice synthesis model based on the target dialect text and the speaker information of the target speaker.
In one possible implementation, in order to enable the speech synthesis model to synthesize the target dialect speech according to the speaking characteristics of the target speaker, the speech synthesis model may use the target dialect text and the speaker information obtained from the source dialect training speech as training samples, so that the speaker information extracted from the target language synthetic speech corresponding to the source dialect training speech and the speaker information extracted from the source dialect training speech are consistent and obtained as training targets.
In another possible implementation manner, in order to enable the speech synthesis model to synthesize both the target dialect speech conforming to the speaking characteristics of the target speaker and the speech closer to the real target dialect speech, the speech synthesis model takes the target dialect text and the speaker information acquired according to the source dialect training speech as the training samples, so that the speaker information extracted from the target language synthetic speech corresponding to the source dialect training speech is consistent with the speaker information extracted from the source dialect training speech as the first target, and the discrimination network in the countermeasure generation network cannot discriminate whether the target language synthetic speech corresponding to the source dialect training speech is the synthetic speech or the real speech is trained as the second target.
More specifically, the process of synthesizing the target dialect speech according with the speaking characteristics of the target speaker by using the pre-established speech synthesis model based on the target dialect text and the speaker information of the target speaker may include: firstly, coding a target dialect text and speaker information of a target speaker respectively to obtain a representation vector of the target dialect text and a representation vector of the speaker information of the target speaker; then, inputting the representation vector of the target dialect text and the representation vector of the speaker information of the target speaker into a voice synthesis model to obtain a frequency domain signal which is output by the voice synthesis model and can represent the synthesized target dialect voice; and finally, converting the frequency domain signal capable of representing the synthesized target dialect voice into a time domain to obtain the synthesized target dialect voice.
According to the dialect speech conversion method, after the source dialect speech of the target speaker is obtained, on one hand, the source dialect speech is converted into the target dialect text, on the other hand, the speaker information of the target speaker is extracted from the source dialect speech, and then the target dialect speech which accords with the characteristics of the speaker of the target speaker is synthesized according to the target dialect text and the speaker information of the target speaker.
Second embodiment
In the above embodiments, it is mentioned that the target dialect speech conforming to the speaking characteristics of the target speaker is synthesized by using the pre-established speech synthesis model, and the present embodiment focuses on the description of the process of establishing the speech synthesis model.
There are various implementations of establishing the speech synthesis model, and the present embodiment provides two alternative implementations as follows:
the first implementation mode comprises the following steps:
step a 1: and constructing a training sample set according to the source dialect training voice set.
The source dialect training speech set includes at least one source dialect training speech, and generally speaking, the source dialect training speech set includes a plurality of source dialect training speech.
Specifically, the process of constructing the training sample set according to the source dialect training speech set includes: for each source dialect training voice in the source dialect training voice set, converting the source dialect training voice into a target dialect text, extracting speaker information from the source dialect voice, forming a training sample by the obtained target dialect text and the speaker information, and forming a training sample set by all the obtained training samples.
Wherein the process of converting the source dialect training speech into the target dialect text comprises: firstly, carrying out voice recognition on a source dialect training voice to obtain a source dialect text, and then converting the source dialect text into a target dialect text.
As can be seen from the above process, each training sample in the training sample set includes target dialect text and speaker information obtained from one of the source dialect training voices in the source dialect training voice set.
Step a 2: and training the voice synthesis model by using the training sample set.
Specifically, the process of training the speech synthesis model by using the training sample set may include:
step a 21: a training sample (assumed to be x) is taken from the set of training samples.
Step a 22: and performing voice synthesis by using the obtained training sample x as a basis and using a voice synthesis model to obtain a frequency domain signal capable of representing the target dialect synthesized voice corresponding to the training sample x.
It should be noted that, assuming that the training sample x is obtained according to the source dialect training speech s1, the target dialect synthesized speech corresponding to the training sample x is the target dialect speech obtained by dialect speech conversion on the source dialect training speech s 1.
Specifically, the process of obtaining a frequency domain signal capable of representing the target dialect synthesized speech corresponding to the training sample x by using the obtained training sample x as a basis and performing speech synthesis by using the speech synthesis model may include: firstly, respectively coding a target dialect text and speaker information in a training sample x to obtain a representation vector of the target dialect text and a representation vector of the speaker information in the training sample x; and then, inputting the characteristic vector of the target dialect text and the characteristic vector of the speaker information in the training sample x into a speech synthesis model for speech synthesis to obtain a frequency domain signal capable of representing the target dialect synthesized speech corresponding to the training sample x.
Step a 23: and extracting the speaker information from the frequency domain signal capable of representing the target dialect synthesized voice corresponding to the training sample x, and determining a speaker information consistency representation value according to the extracted speaker information and the speaker information in the training sample x.
Specifically, an error between the speaker information extracted from the frequency domain signal capable of representing the target dialect synthesized speech corresponding to the training sample x and the speaker information in the training sample x may be calculated as a speaker information consistency characterizing value, and optionally, a minimum mean square error between two speaker information may be calculated as a speaker information consistency characterizing value according to the following formula:
Figure BDA0002848829750000101
wherein, EmbeddingARepresenting speaker information, Embedding, extracted from a frequency domain signal that is capable of characterizing a target dialect synthesized speech corresponding to a training sample xBRepresenting speaker information, Error, in a training sample xspkRepresents EmbeddingAAnd EmbeddingBCan characterize whether the speaker information of the source dialect training speech s1 is consistent with the speaker information of the target dialect synthesized speech corresponding to the training sample x.
Step a 24: and updating the parameters of the generation network of the speech synthesis model based on the speaker information consistency representation value.
And (4) carrying out a plurality of times of iterative training according to the steps a21 to a24 until a training end condition is met.
The speech synthesis model established through the above process can synthesize the target dialect speech which accords with the speaking characteristics of the target speaker.
The second implementation mode comprises the following steps:
referring to fig. 2, a flow diagram of a second implementation of building a speech synthesis model is shown, which may include:
step S201: and constructing a training sample set according to the source dialect training voice set.
The specific implementation process of step S201 is the same as the implementation process of step a1, and this embodiment is not described herein again.
Step S202: and training a generation network serving as a voice synthesis model in the countermeasure generation network by using the training sample set and the discrimination network in the countermeasure generation network.
The countermeasure generation network includes a generation network and a discrimination network, and the generation network in the countermeasure generation network is used as a speech synthesis model, and the discrimination network in the countermeasure generation network assists the generation network used as the speech synthesis model to train.
Referring to fig. 3, a schematic flow chart of training a generation network as a speech synthesis model in a confrontation generation network by using a training sample set and a discriminant network in the confrontation generation network is shown, which may include:
step S301: a training sample (assumed to be x) is taken from the set of training samples.
Step S302: and performing voice synthesis by using the generation network in the countermeasure generation network according to the obtained training sample x to obtain a frequency domain signal capable of representing the target dialect synthesized voice corresponding to the training sample x.
The specific implementation process of step S302 is the same as the specific implementation process of step a22, and this embodiment is not described herein again.
Step S303 a: and extracting the speaker information from the frequency domain signal capable of representing the target dialect synthesized voice corresponding to the training sample x, and determining a speaker information consistency representation value according to the extracted speaker information and the speaker information in the training sample x.
The specific implementation process of step S303a is the same as the implementation process of step a23, and is not described herein again.
Step S303 b: and classifying the real voice and the synthesized voice of the target dialect synthesized voice corresponding to the training sample x by utilizing a discrimination network in the confrontation generation network and a frequency domain signal capable of representing the target dialect synthesized voice corresponding to the training sample x to obtain a classification result.
The dialect voice classification model is a two-classification model of real voice and synthesized voice.
Specifically, as shown in fig. 4, a frequency domain signal capable of characterizing the target dialect synthesized speech corresponding to the training sample x is input to a discrimination network in the countermeasure generation network, and the discrimination network discriminates whether the target dialect synthesized speech corresponding to the training sample x is real speech or synthesized speech according to the input frequency domain signal.
Step S304: and updating parameters of a generation network used as a speech synthesis model according to the speaker information consistency representation value and the classification result.
On one hand, parameters of a generating network used as a speech synthesis model are updated according to the speaker information consistency characterization value, so that target dialect speech and source dialect speech synthesized by the generating network used as the speech synthesis model have the same speaker characteristics; on the other hand, based on the classification result of the discrimination network, the parameters of the generation network as the speech synthesis model are adjusted based on the gradient inversion strategy, so that the target dialect speech synthesized by the generation network as the speech synthesis model is closer to the real target dialect speech, and the discrimination network is difficult to distinguish.
And performing iterative training for multiple times according to the steps S301 to S304 until a training end condition is met. And the generated network in the confrontation generated network obtained after the training is finished is used as a speech synthesis model.
According to the training process, in order to enable the target dialect voice synthesized by the voice synthesis model to be closer to the real target dialect voice, the method carries out countermeasure training based on the dialect voice classification model, enables the generation network serving as the voice synthesis model to carry out game countermeasure with the dialect voice classification model, achieves the game countermeasure of the two networks by introducing the gradient inversion layer, and aims to enable the dialect classification network serving as the voice synthesis model to be incapable of distinguishing the dialect from the real dialect after the gradient inversion layer is formed. Referring to fig. 5, fig. 5(a) shows the visual distribution diagram of the real dialect and the synthesized dialect before the confrontational training, as can be seen from fig. 5(a), the real dialect and the synthesized dialect are very clearly distinguished, and fig. 5(b) shows the visual distribution diagram of the real dialect and the synthesized dialect after the confrontational training, as can be seen from fig. 5(b), the dialect classification model has difficulty in distinguishing the synthesized dialect from the real dialect. In order to enable the target dialect voice synthesized by the voice synthesis model to have the speaking characteristics of the speaker of the corresponding source dialect voice, the method adopts a speaker consistency training mode, namely, during training, speaker information extracted from the synthesized target dialect voice is consistent with speaker information extracted from the corresponding source dialect voice.
Third embodiment
The following describes the dialect speech conversion device provided by the embodiment of the present application, and the dialect speech conversion device described below and the dialect speech conversion device described above may be referred to in correspondence with each other.
Referring to fig. 6, a schematic structural diagram of a dialect speech conversion apparatus provided in the embodiment of the present application is shown, which may include: a voice acquiring module 601, a voice converting module 602a, a speaker information extracting module 602b and a voice synthesizing module 603.
A voice obtaining module 601, configured to obtain a source dialect voice of a target speaker;
a voice conversion module 602a, configured to convert the source dialect voice into a target dialect text;
a speaker information extraction module 602b, configured to extract speaker information of the target speaker from the source dialect speech;
and a speech synthesis module 603, configured to synthesize a target dialect speech according to the target dialect text and the speaker information of the target speaker, where the target dialect speech conforms to the speaking characteristics of the target speaker.
Optionally, the speech conversion module 602a may include a speech recognition module and a text conversion module.
And the voice recognition module is used for carrying out voice recognition on the source dialect voice to obtain a source dialect text.
The text conversion module is used for converting the source dialect text into the target dialect text.
Optionally, the speech synthesis module is specifically configured to synthesize the target dialect speech according with the speaking characteristic of the target speaker by using a pre-established speech synthesis model based on the target dialect text and the speaker information of the target speaker.
The voice synthesis model is obtained by training a target dialect text and speaker information which are obtained according to a source dialect training voice as training samples; the training targets of the speech synthesis model include: and enabling the speaker information extracted from the target language synthetic voice corresponding to the source dialect training voice to be consistent with the speaker information extracted from the source dialect training voice.
Optionally, the speech synthesis model adopts a generation network in a countermeasure generation network;
the training targets of the speech synthesis model further comprise: and enabling a discrimination network in the countermeasure generation network to be unable to discriminate whether the target language synthesized voice corresponding to the source dialect training voice is synthesized voice or real voice.
Optionally, the speech synthesis module 603 is specifically configured to encode the target dialect text and the speaker information of the target speaker respectively to obtain a characterization vector of the target dialect text and a characterization vector of the speaker information of the target speaker when synthesizing the target dialect speech conforming to the speaking characteristics of the target speaker by using a pre-established speech synthesis model based on the target dialect text and the speaker information of the target speaker; inputting the representation vector of the target dialect text and the representation vector of the speaker information of the target speaker into the voice synthesis model to obtain a frequency domain signal of the target dialect voice capable of being represented and synthesized; and converting the frequency domain signal capable of representing the synthesized target dialect voice into a time domain to obtain the synthesized target dialect voice.
Optionally, the dialect voice conversion apparatus provided in this embodiment of the present application may further include: and a speech synthesis model building module.
The speech synthesis model building module comprises: the device comprises a training sample set acquisition module and a model training module.
The training sample set obtaining module is used for constructing a training sample set according to a source dialect training voice set, wherein each training sample in the training sample set comprises a target dialect text and speaker information obtained according to one source dialect training voice in the source dialect training voice set.
And the model training module is used for training a generation network serving as a voice synthesis model in the confrontation generation network by using the training sample set and a discrimination network in the confrontation generation network.
Optionally, the model training module is specifically configured to obtain training samples from the training sample set when training a generation network serving as a speech synthesis model in the confrontation generation network by using the training sample set and a discrimination network in the confrontation generation network; performing voice synthesis by using a generation network serving as a voice synthesis model in a countermeasure generation network based on the obtained training sample to obtain a frequency domain signal capable of representing a target dialect synthesized voice corresponding to the training sample; extracting speaker information from the frequency domain signal, and determining a speaker information consistency characterization value according to the speaker information extracted from the frequency domain signal and the speaker information in the training sample; classifying real voice and synthesized voice of the target dialect synthesized voice corresponding to the training sample by using a discrimination network in the confrontation generating network and the frequency domain signal to obtain a classification result; and updating parameters of a generation network used as a speech synthesis model according to the speaker information consistency representation value and the classification result.
Optionally, the model training module is specifically configured to obtain a frequency domain signal of a target dialect synthesized speech corresponding to the training sample when the training sample corresponds to the discriminating network in the countermeasure generating network and the frequency domain signal; and inputting the frequency domain signal of the target dialect synthesized voice corresponding to the training sample as the dialect voice classification model of the discrimination network for classification to obtain the classification result of the target dialect synthesized voice corresponding to the training sample.
Optionally, the model training module is specifically configured to calculate an error between the speaker information extracted from the frequency domain signal and the speaker information in the training sample as the speaker information consistency characterizing value when determining the speaker information consistency characterizing value according to the speaker information extracted from the frequency domain signal and the speaker information in the training sample.
According to the dialect voice conversion device, after the source dialect voice of the target speaker is obtained, on one hand, the source dialect voice is converted into the target dialect text, on the other hand, the speaker information of the target speaker is extracted from the source dialect voice, and then the target dialect voice which accords with the characteristics of the speaker of the target speaker is synthesized according to the target dialect text and the speaker information of the target speaker.
Fourth embodiment
An embodiment of the present application further provides a dialect voice converting apparatus, please refer to fig. 7, which shows a schematic structural diagram of the dialect voice converting apparatus, where the dialect voice converting apparatus may include: at least one processor 701, at least one communication interface 702, at least one memory 703 and at least one communication bus 704;
in the embodiment of the present application, the number of the processor 701, the communication interface 702, the memory 703 and the communication bus 704 is at least one, and the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704;
the processor 701 may be a central processing unit CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;
the memory 703 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring source dialect voice of a target speaker;
converting the source dialect voice into a target dialect text, and extracting speaker information of the target speaker from the source dialect voice;
and synthesizing the target dialect voice according with the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker.
Alternatively, the detailed function and the extended function of the program may be as described above.
Fifth embodiment
Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:
acquiring source dialect voice of a target speaker;
converting the source dialect voice into a target dialect text, and extracting speaker information of the target speaker from the source dialect voice;
and synthesizing the target dialect voice according with the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker.
Alternatively, the detailed function and the extended function of the program may be as described above.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (14)

1. A dialect voice conversion method, comprising:
acquiring source dialect voice of a target speaker;
converting the source dialect voice into a target dialect text, and extracting speaker information of the target speaker from the source dialect voice;
and synthesizing the target dialect voice according with the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker.
2. The dialect speech conversion method of claim 1, wherein said converting the source dialect speech into target dialect text comprises:
carrying out voice recognition on the source dialect voice to obtain a source dialect text;
converting the source dialect text to target dialect text.
3. The dialect speech conversion method of claim 1, wherein synthesizing a target dialect speech conforming to the speaking characteristic of the target speaker based on the target dialect text and the speaker information of the target speaker comprises:
synthesizing target dialect voice according with the speaking characteristics of the target speaker by using a pre-established voice synthesis model according to the target dialect text and the speaker information of the target speaker;
the voice synthesis model is obtained by training a target dialect text and speaker information which are obtained according to a source dialect training voice as training samples; the training targets of the speech synthesis model include: and enabling the speaker information extracted from the target language synthetic voice corresponding to the source dialect training voice to be consistent with the speaker information extracted from the source dialect training voice.
4. The dialect speech conversion method of claim 3, wherein the speech synthesis model employs a generation network in a countermeasure generation network;
the training targets of the speech synthesis model further comprise: and enabling a discrimination network in the countermeasure generation network to be unable to discriminate whether the target language synthesized voice corresponding to the source dialect training voice is synthesized voice or real voice.
5. The dialect speech conversion method of claim 3, wherein the synthesizing the target dialect speech according with the speaking characteristics of the target speaker by using the pre-established speech synthesis model based on the target dialect text and the speaker information of the target speaker comprises:
coding the target dialect text and the speaker information of the target speaker respectively to obtain a representation vector of the target dialect text and a representation vector of the speaker information of the target speaker;
inputting the representation vector of the target dialect text and the representation vector of the speaker information of the target speaker into the voice synthesis model to obtain a frequency domain signal of the target dialect voice capable of being represented and synthesized;
and converting the frequency domain signal capable of representing the synthesized target dialect voice into a time domain to obtain the synthesized target dialect voice.
6. The dialect speech conversion method of claim 4, wherein the establishing process of the speech synthesis model comprises:
constructing a training sample set according to a source dialect training voice set, wherein each training sample in the training sample set comprises a target dialect text and speaker information obtained according to one source dialect training voice in the source dialect training voice set;
and training a generation network serving as a voice synthesis model in the countermeasure generation network by using the training sample set and a discriminant network in the countermeasure generation network.
7. The dialect speech conversion method of claim 6, wherein the training a generation network as a speech synthesis model in the countermeasure generation network using the training sample set and a discriminant network in the countermeasure generation network comprises:
obtaining training samples from the set of training samples;
performing voice synthesis by using a generation network serving as a voice synthesis model in a countermeasure generation network based on the obtained training sample to obtain a frequency domain signal capable of representing a target dialect synthesized voice corresponding to the training sample;
extracting speaker information from the frequency domain signal, and determining a speaker information consistency characterization value according to the speaker information extracted from the frequency domain signal and the speaker information in the training sample;
classifying real voice and synthesized voice of the target dialect synthesized voice corresponding to the training sample by using a discrimination network in the confrontation generating network and the frequency domain signal to obtain a classification result;
and updating parameters of a generation network used as a speech synthesis model according to the speaker information consistency representation value and the classification result.
8. The dialect speech conversion method of claim 7, wherein the classifying the target dialect synthesized speech corresponding to the training sample into real speech and synthesized speech by using the discrimination network in the countermeasure generation network and the frequency domain signal to obtain a classification result comprises:
acquiring a frequency domain signal of target dialect synthesized voice corresponding to the training sample;
and inputting the frequency domain signal of the target dialect synthesized voice corresponding to the training sample as the dialect voice classification model of the discrimination network for classification to obtain the classification result of the target dialect synthesized voice corresponding to the training sample.
9. The dialect speech conversion method of claim 7, wherein the determining a speaker information consistency characterization value based on the speaker information extracted from the frequency domain signal and the speaker information in the training samples comprises:
and calculating the error between the speaker information extracted from the frequency domain signal and the speaker information in the training sample as a speaker information consistency characterization value.
10. A dialect speech conversion apparatus, comprising: the system comprises a voice acquisition module, a voice conversion module, a speaker information extraction module and a voice synthesis module;
the voice acquisition module is used for acquiring the source dialect voice of the target speaker;
the voice conversion module is used for converting the source dialect voice into a target dialect text;
the speaker information extraction module is used for extracting the speaker information of the target speaker from the source dialect voice;
and the voice synthesis module is used for synthesizing the target dialect voice according with the speaking characteristics of the target speaker according to the target dialect text and the speaker information of the target speaker.
11. The dialect speech conversion apparatus of claim 10, wherein the speech synthesis module is specifically configured to synthesize a target dialect speech that matches a speaking characteristic of the target speaker based on the target dialect text and the speaker information of the target speaker by using a pre-established speech synthesis model;
the voice synthesis model is obtained by training a target dialect text and speaker information which are obtained according to a source dialect training voice as training samples; the training targets of the speech synthesis model include: and enabling the speaker information extracted from the target language synthetic voice corresponding to the source dialect training voice to be consistent with the speaker information extracted from the source dialect training voice.
12. The dialect speech conversion apparatus of claim 11, wherein the speech synthesis model employs a generation network in a countermeasure generation network;
the training targets of the speech synthesis model further comprise: and enabling a discrimination network in the countermeasure generation network to be unable to discriminate whether the target language synthesized voice corresponding to the source dialect training voice is synthesized voice or real voice.
13. A dialect speech conversion apparatus, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the dialect speech conversion method according to any one of claims 1 to 9.
14. A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the dialect speech conversion method of any one of claims 1 to 9.
CN202011518758.6A 2020-12-21 2020-12-21 Dialect voice conversion method, device, equipment and storage medium Pending CN112652309A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011518758.6A CN112652309A (en) 2020-12-21 2020-12-21 Dialect voice conversion method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011518758.6A CN112652309A (en) 2020-12-21 2020-12-21 Dialect voice conversion method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112652309A true CN112652309A (en) 2021-04-13

Family

ID=75358488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011518758.6A Pending CN112652309A (en) 2020-12-21 2020-12-21 Dialect voice conversion method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112652309A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113611293A (en) * 2021-08-19 2021-11-05 内蒙古工业大学 Mongolian data set expansion method
CN116844523A (en) * 2023-08-31 2023-10-03 深圳市声扬科技有限公司 Voice data generation method and device, electronic equipment and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020188449A1 (en) * 2001-06-11 2002-12-12 Nobuo Nukaga Voice synthesizing method and voice synthesizer performing the same
US6598021B1 (en) * 2000-07-13 2003-07-22 Craig R. Shambaugh Method of modifying speech to provide a user selectable dialect
US20170309271A1 (en) * 2016-04-21 2017-10-26 National Taipei University Speaking-rate normalized prosodic parameter builder, speaking-rate dependent prosodic model builder, speaking-rate controlled prosodic-information generation device and prosodic-information generation method able to learn different languages and mimic various speakers' speaking styles
CN108682420A (en) * 2018-05-14 2018-10-19 平安科技(深圳)有限公司 A kind of voice and video telephone accent recognition method and terminal device
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
US20200169591A1 (en) * 2019-02-01 2020-05-28 Ben Avi Ingel Systems and methods for artificial dubbing
CN111402842A (en) * 2020-03-20 2020-07-10 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio
CN111737998A (en) * 2020-06-23 2020-10-02 北京字节跳动网络技术有限公司 Dialect text generation method and device, storage medium and electronic equipment
CN111899719A (en) * 2020-07-30 2020-11-06 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6598021B1 (en) * 2000-07-13 2003-07-22 Craig R. Shambaugh Method of modifying speech to provide a user selectable dialect
US20020188449A1 (en) * 2001-06-11 2002-12-12 Nobuo Nukaga Voice synthesizing method and voice synthesizer performing the same
US20170309271A1 (en) * 2016-04-21 2017-10-26 National Taipei University Speaking-rate normalized prosodic parameter builder, speaking-rate dependent prosodic model builder, speaking-rate controlled prosodic-information generation device and prosodic-information generation method able to learn different languages and mimic various speakers' speaking styles
CN108682420A (en) * 2018-05-14 2018-10-19 平安科技(深圳)有限公司 A kind of voice and video telephone accent recognition method and terminal device
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
US20200169591A1 (en) * 2019-02-01 2020-05-28 Ben Avi Ingel Systems and methods for artificial dubbing
CN111402842A (en) * 2020-03-20 2020-07-10 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio
CN111737998A (en) * 2020-06-23 2020-10-02 北京字节跳动网络技术有限公司 Dialect text generation method and device, storage medium and electronic equipment
CN111899719A (en) * 2020-07-30 2020-11-06 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113611293A (en) * 2021-08-19 2021-11-05 内蒙古工业大学 Mongolian data set expansion method
CN116844523A (en) * 2023-08-31 2023-10-03 深圳市声扬科技有限公司 Voice data generation method and device, electronic equipment and readable storage medium
CN116844523B (en) * 2023-08-31 2023-11-10 深圳市声扬科技有限公司 Voice data generation method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN110111775B (en) Streaming voice recognition method, device, equipment and storage medium
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN105976812B (en) A kind of audio recognition method and its equipment
CN107195296B (en) Voice recognition method, device, terminal and system
CN108346427A (en) A kind of audio recognition method, device, equipment and storage medium
CN111667814A (en) Multi-language voice synthesis method and device
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
KR102276951B1 (en) Output method for artificial intelligence speakers based on emotional values calculated from voice and face
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
KR20180121831A (en) Interest determination system, interest determination method, and storage medium
CA2737142C (en) Method for creating a speech model
CN105206257A (en) Voice conversion method and device
CN112652309A (en) Dialect voice conversion method, device, equipment and storage medium
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
CN114822519A (en) Chinese speech recognition error correction method and device and electronic equipment
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN113345407A (en) Style speech synthesis method and device, electronic equipment and storage medium
CN110556092A (en) Speech synthesis method and device, storage medium and electronic device
CN111046674B (en) Semantic understanding method and device, electronic equipment and storage medium
CN116665642A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN115132170A (en) Language classification method and device and computer readable storage medium
CN113539239B (en) Voice conversion method and device, storage medium and electronic equipment
CN113299270B (en) Method, device, equipment and storage medium for generating voice synthesis system
CN115221351A (en) Audio matching method and device, electronic equipment and computer-readable storage medium
CN115035885A (en) Voice synthesis method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination