CN113327583A - Optimal mapping cross-language tone conversion method and system based on PPG consistency - Google Patents

Optimal mapping cross-language tone conversion method and system based on PPG consistency Download PDF

Info

Publication number
CN113327583A
CN113327583A CN202110567496.0A CN202110567496A CN113327583A CN 113327583 A CN113327583 A CN 113327583A CN 202110567496 A CN202110567496 A CN 202110567496A CN 113327583 A CN113327583 A CN 113327583A
Authority
CN
China
Prior art keywords
ppg
voice
speech
sequence
waveform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110567496.0A
Other languages
Chinese (zh)
Inventor
吴志勇
户建坤
陈学源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN202110567496.0A priority Critical patent/CN113327583A/en
Publication of CN113327583A publication Critical patent/CN113327583A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an optimal mapping cross-language tone conversion method, system and electronic equipment based on PPG consistency, and relates to the optimal mapping cross-language tone conversion method based on PPG consistency. Meanwhile, the optimal search is carried out in the PPG set of the target speaker by combining a preset large corpus of the target speaker, so that a mapping sequence which can accurately represent the voice content of the converted voice and conforms to the characteristics of the target speaker is obtained. And finally, converting the voice waveform into a natural voice waveform through a neural network acoustic model and a vocoder. The invention represents the relation between the converted voice and the corpus of the target speaker by the PPG modeling through the voice content at the frame level, and does not relate to the limitation of specific languages, thereby realizing the cross-language tone conversion.

Description

Optimal mapping cross-language tone conversion method and system based on PPG consistency
Technical Field
The invention relates to the technical field of language identification, in particular to an optimal mapping cross-language tone conversion method and system based on PPG consistency and electronic equipment.
Background
Speech conversion aims to modify the speech of one speaker to sound as if it were uttered by another particular speaker. The voice conversion can be widely applied to a plurality of fields including customized feedback of a computer-aided pronunciation pruning system, personalized speaking aided development of a voice obstacle subject, movie dubbing by utilizing various voices and the like.
Due to globalization, text or voice alternates in different language content in social media text, informal information, and voice navigation. In the human-computer spoken language dialogue system, when such sentences are synthesized, the voices are consistent, and the pronunciation is accurate and natural. The use of cross-language timbre conversion techniques is an important approach to such tasks.
At present, the traditional cross-language timbre conversion mainly has the following problems:
1) the traditional method does not effectively and thoroughly decouple the content characteristics, tone characteristics and language characteristics of the speech, namely, the difference between a phoneme set corresponding to the PPG characteristics and a phoneme set describing different languages in cross-language is not considered, so that the problems of inaccuracy of feature capturing (pronunciation intelligibility) in cross-language of the PPG characteristics and different coverage of PPG of corpora of different speakers in different languages can be caused;
2) not considering that the coverage of PPG may be different for different speaker's corpus in different languages; for example, the existing neural network acoustic model receives input data that is not seen during training in the synthesis stage, and then forcibly obtains the fitted output, resulting in inaccuracy and instability of the result;
3) there is no method proposed to objectively measure the consistency of the content of the voice after the cross-language timbre conversion.
Disclosure of Invention
The invention aims to make up for the defects of the prior art, and provides an optimal mapping cross-language tone conversion method, system and electronic equipment based on PPG consistency, so as to solve the problems of inaccurate cross-language post-description of PPG, different distribution coverage and inconsistent input data in the neural network training and synthesizing stages in the prior art.
In a first aspect, to solve the above technical problem, the present invention provides an optimal mapping cross-language timbre conversion method based on PPG consistency, including:
s1, acquiring the original voice waveform input by the user;
s2, determining a first PPG sequence corresponding to the original voice waveform and a target PPG set corresponding to a corpus of a preset target speaker based on a preset PPG determination strategy;
s3, starting from a first speech frame in the first PPG sequence, searching a frame of PPG in a target PPG set closest to a PPG corresponding to a current speech frame in a PPG sequence corresponding to the first speech frame from the target PPG set until the first PPG sequence is traversed, and forming an optimal mapping PPG sequence from the second speech posterior probabilities PPG searched for each speech frame in the first PPG sequence;
and S4, inputting the optimal mapping PPG sequence into a pre-trained neural network acoustic model to obtain the Mel spectrum of the target speaker, and converting the Mel spectrum of the target speaker into the voice waveform of the target speaker according to a preset voice code conversion strategy, thereby realizing the conversion of the original voice waveform input by the user into the voice waveform of the target speaker.
Optionally, the first PPG sequence is composed of a first speech posterior probability PPG corresponding to each speech frame included in the original speech waveform; the target PPG set is a set formed by all frame PPG sequences corresponding to each sentence of voice in the corpus of the preset target speaker.
Optionally, the step S2 of determining the first PPG sequence corresponding to the original speech waveform based on a preset PPG determination policy includes:
s2.1, extracting acoustic features corresponding to each voice frame contained in the original voice waveform from the original voice waveform according to a preset voice signal processing technology;
s2.2, obtaining a first speech posterior probability PPG corresponding to each speech frame contained in the original speech waveform by utilizing a pre-trained automatic speech recognition ASR model;
and S2.3, forming a first PPG sequence corresponding to the original voice waveform by using the first voice posterior probability PPG corresponding to each voice frame.
Optionally, the original speech waveform is different from the speech waveform of the target speaker.
Optionally, the language type corresponding to the original speech waveform is the same as and/or different from the language corresponding to each sentence of speech in the corpus of the preset target speaker.
Optionally, the method may further include:
and determining the distance between a first PPG sequence corresponding to the original voice waveform and a third PPG sequence corresponding to the voice waveform of the target speaker obtained by a final vocoder, and judging whether the voice content of the target speaker meets the requirement of a consistency standard or not according to the distance between the two PPG sequences.
In a second aspect, based on the above optimal mapping cross-language tone conversion method based on PPG consistency, the present invention further provides a system of the optimal mapping cross-language tone conversion method based on PPG consistency, where the system includes:
the acquisition module is used for acquiring an original voice waveform input by a user;
the PPG extraction module is used for determining a first PPG sequence corresponding to the original voice waveform based on a preset PPG determination strategy; the first PPG sequence consists of a first voice posterior probability PPG corresponding to each voice frame contained in the original voice waveform; all frames of PPG corresponding to each sentence of voice in the corpus of the preset target speaker obtained according to the preset PPG determination strategy form a target PPG set corresponding to the corpus of the preset target speaker;
the optimal mapping module is used for searching a frame of PPG in a target PPG set which is closest to the PPG corresponding to the current speech frame in the first PPG sequence from the target PPG set from a first speech frame in the first PPG sequence until the first PPG sequence is traversed, and forming an optimal mapping PPG sequence from the second speech posterior probability PPG searched for each speech frame in the first PPG sequence;
the neural network acoustic model module is used for inputting the optimal mapping PPG sequence into a pre-trained neural network acoustic model to obtain a Mel spectrum of the target speaker;
and the vocoder module is used for converting the Mel spectrum of the target speaker into the voice waveform of the target speaker according to a preset voice code conversion strategy, so that the original voice waveform input by the user is converted into the voice waveform of the target speaker.
Optionally, the PPG extraction module includes:
the acoustic feature determination unit is used for extracting acoustic features corresponding to each voice frame contained in the original voice waveform from the original voice waveform according to a preset voice signal processing technology;
the first voice posterior probability PPG determining unit is used for obtaining a first voice posterior probability PPG corresponding to each voice frame contained in the original voice waveform by utilizing a pre-trained automatic voice recognition ASR model;
and the first PPG sequence forming unit is used for forming a first PPG sequence corresponding to the original voice waveform by using the first voice posterior probability PPG corresponding to each voice frame.
Optionally, the language type corresponding to the original speech waveform is the same as and/or different from the language corresponding to each sentence of speech in the corpus of the preset target speaker.
Optionally, the system further includes: a PPG consistency evaluating module;
and the PPG consistency evaluating module is used for determining the distance between a first PPG sequence corresponding to the original voice waveform and a third PPG sequence corresponding to the voice waveform of the target speaker synthesized by the vocoder, and judging whether the voice content of the target speaker meets the requirement of the consistency standard according to the distance between the two PPG sequences.
In a third aspect, based on the above optimal mapping cross-language tone conversion method based on PPG consistency, the present invention further provides an electronic device, which is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the optimal mapping cross-language tone conversion method based on PPG consistency when executing the program stored in the memory.
In a fourth aspect, to solve the above technical problem, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the optimal mapping cross-language timbre conversion method based on PPG consistency as described in any one of the above.
Compared with the prior art, the technical scheme of the invention has at least one of the following beneficial effects:
the invention relates to an optimal mapping cross-language tone conversion method based on PPG consistency. Meanwhile, the optimal search is carried out in the PPG set of the target speaker by combining a preset large corpus of the target speaker, so that a mapping sequence which can accurately represent the voice content of the converted voice and conforms to the characteristics of the target speaker is obtained. And finally, converting the voice waveform into a natural voice waveform through a neural network acoustic model and a vocoder. The invention represents the relation between the converted voice and the corpus of the target speaker by the PPG modeling through the voice content at the frame level, and does not relate to the limitation of specific languages, thereby realizing the cross-language tone conversion.
Meanwhile, the PPG consistency evaluation standard and the optimal mapping algorithm which is consistent with the PPG consistency evaluation standard ensure that the voice contents before and after tone conversion are similar. The invention can automatically and effectively not limit the tone conversion of the converted voice language, is applied to an intelligent voice interaction system, is beneficial to better transmitting information and intention of the system, enriches the selection of the voice speaker and improves the satisfaction degree of the user.
Drawings
Fig. 1 is a schematic flowchart of an optimal mapping cross-language timbre conversion method based on PPG consistency according to the present invention.
Fig. 2 is a graph of the optimal mapping PPG sequence correspondence between an original speech waveform and a target speaker speech waveform provided by the present invention;
FIG. 3 is a schematic structural diagram of an optimal mapping cross-language timbre conversion system based on PPG consistency according to the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following describes in more detail embodiments of the present invention with reference to the schematic drawings. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.
First, terms in the present application are explained:
PPG: when the classification task of ASR recognition as a specific phoneme is performed, the posterior probability that a certain speech frame belongs to all possible phonemes can be obtained, which is called a phoneme posterior probability graph. The PPG per frame may represent a representation of the speech content of the current speech frame. By means of the multi-speaker general ASR, PPG of the voice of any speaker can be extracted.
KL divergence distance: also known as relative entropy, information divergence, information gain, etc., is a measure of the asymmetry describing the difference between the two probability distributions P and Q. Since ASR (Automatic Speech Recognition technology) is a technology of converting human Speech into text, and KL divergence is used as loss training, using KL divergence distance to measure the degree of similarity between PPG is more consistent with the training optimization goal of ASR than euclidean distance and the like.
And the PPG extraction module inputs an original voice waveform and outputs a frame-level voice content characteristic representation PPG.
PPG consistency evaluation module: and evaluating the consistency degree of the voice content of the converted voice and the original voice.
An optimal mapping module: and the module is used for searching a mapping pair closest to the target speaker in a high-dimensional space at a frame level by taking the PPG as a bridge of the converted voice and the corpus of the target speaker, and traversing and replacing all frames of the converted voice to realize cross-language tone conversion.
A neural network acoustic model module: the PPG sequence is used as an input and is mapped to the Mel spectrum of the tone of the target speaker, and the Mel spectrum has a natural context relationship.
A vocoder module: spectral parameters, such as the Mel spectrum, are restored to speech waveforms by mathematical methods or machine learning methods.
As described in the background, due to the rise of globalization, text or voice is now alternated with different language contents in social media text, informal information and voice navigation. In the human-computer spoken language dialogue system, when such sentences are synthesized, the voices are consistent, and the pronunciation is accurate and natural. The use of cross-language timbre conversion techniques is an important approach to such tasks.
At present, the traditional cross-language timbre conversion mainly has the following problems:
1) the traditional method does not effectively and thoroughly decouple the content characteristics, tone characteristics and language characteristics of the speech, namely, the difference between a phoneme set corresponding to the PPG characteristics and a phoneme set describing different languages in cross-language is not considered, so that the problems of inaccuracy of feature capturing (pronunciation intelligibility) in cross-language of the PPG characteristics and different coverage of PPG of corpora of different speakers in different languages can be caused;
2) not considering that the coverage of PPG may be different for different speaker's corpus in different languages; for example, the existing neural network acoustic model receives input data that is not seen during training in the synthesis stage, and then forcibly obtains the fitted output, resulting in inaccuracy and instability of the result;
3) there is no method proposed to objectively measure the consistency of the content of the voice after the cross-language timbre conversion.
In view of the first problem, with the development of deep learning, more and more researchers try to perform decoupling of features through different structural designs in a neural network, and then can freely combine the features during conversion. However, the current research still cannot achieve complete decoupling between different characteristics, and the converted speech has wrong accents or unstable quality.
It should be noted that the PPG has the characteristic of representing content features across languages and speakers, but due to the inaccuracy of the phoneme set used by the PPG features during cross languages and the difference of the linguistic data of different speakers in different languages to the coverage of the PPG features, the speech expressiveness of the conventional PPG-based cross-language timbre conversion is poor, and the uncovered pronunciation content is mispronounced. And the cross-language tone conversion can be better carried out after the characteristics of inaccuracy and different coverage of the PPG after cross-language are considered.
Moreover, the traditional method usually measures the timbre similarity and the accuracy of the voice content by subjectively scoring the cross-language timbre conversion. For the accuracy of the speech content, the recognition result of ASR is similar to the auditory sense, but the ASR result cannot accurately reflect the content characteristics of speech on the tone conversion task due to the language model decoding diagram modification and forced correspondence to the discretized phonemes (or characters).
Based on the PPG consistency principle, the invention provides the PPG consistency principle, and the consistency degree of the voice content is accurately measured through the distance of the PPG sequence corresponding to the voice before and after conversion. And on the basis of the principle, an optimal mapping scheme is provided, so that the generated voice not only ensures the tone similarity, but also realizes the PPG consistency.
Specifically, the invention provides an optimal mapping cross-language tone conversion method, an optimal mapping cross-language tone conversion system and electronic equipment based on PPG consistency, and aims to solve the problems that in the prior art, after PPG cross-language, description is inaccurate, distribution coverage is different, and input data in a neural network training and synthesizing stage are inconsistent.
The present invention will be described in detail below with reference to specific examples.
As shown in fig. 1, the optimal mapping cross-language timbre conversion method based on PPG consistency provided by the present invention may include the following steps:
and S1, acquiring the original voice waveform input by the user.
In this embodiment, the original voice waveform input by the user is an initial input voice to be subjected to voice conversion. The original speech waveform may also be referred to as converted speech, among other things.
And S2, determining a first PPG sequence corresponding to the original voice waveform and a target PPG set corresponding to a corpus of a preset target speaker based on a preset PPG determination strategy.
Wherein the first PPG sequence consists of a first speech posterior probability PPG corresponding to each speech frame contained in the original speech waveform.
In this embodiment, according to a preset speech signal processing technique, acoustic features corresponding to each speech frame included in the original speech waveform may be extracted from the original speech waveform; then, a pre-trained automatic speech recognition ASR model is used to obtain a first speech posterior probability PPG corresponding to each speech frame contained in the original speech waveform; and then, forming a first PPG sequence corresponding to the original voice waveform by using the first voice posterior probability PPG corresponding to each voice frame.
Specifically, the acoustic features of the frame level in the original speech waveform can be extracted through a speech signal processing technology, and the characterization PPG of the speech content of the frame level corresponding to the speech waveform is calculated through pre-trained automatic speech recognition ASR. The automatic speech recognition ASR can be obtained by Chinese corpus training, can also be obtained by English corpus training, and can also be used by both, so that bilingual PPG is formed.
S3, starting from the first speech frame in the first PPG sequence, searching a frame of PPG in the target PPG set closest to the PPG corresponding to the current speech frame in the first PPG sequence from the target PPG set until the first PPG sequence is traversed, and forming an optimal mapping PPG sequence from the second speech posterior probabilities PPG searched for each speech frame in the first PPG sequence.
In this embodiment, a preset large corpus of the target speaker may be combined to perform an optimal search in the target PPG set corresponding to the corpus of the target speaker, so as to obtain a mapping sequence that can accurately represent speech content of an original language waveform and also conform to characteristics of the target speaker. And finally, converting the voice waveform into a natural voice waveform through a neural network acoustic model and a vocoder. The invention characterizes the relation between the original language waveform and the target speaker corpus by the frame-level voice content, and does not relate to the limitation of specific languages, thereby realizing the cross-language tone conversion.
Specifically, the PPG can be used as a bridge between an original speech waveform and a corpus of a target speaker, and at a frame level, a closest mapping pair in a high-dimensional space is searched for, and all frames of the converted speech are traversed and replaced, so as to realize cross-language timbre conversion.
For example, as shown in fig. 2, a first PPG sequence (corresponding to the PPG sequence of the converted speech in fig. 2) obtained by passing the original speech waveform through a PPG extraction module is first provided that the length of the first PPG sequence is N. Then m sentences of speech of the whole corpus of the preset target speaker pass through a PPG extraction module in sequence to obtain a target PPG set (corresponding to the preset set in FIG. 2). Next, for each frame PPG in the first PPG sequence (i frame in fig. 3) corresponding to the original speech waveform, the closest frame (denoted as i' frame in the figure) is searched for in the target speaker set. And traversing N frames of PPG of the original voice waveform to obtain the PPG with the nearest distance of the N frames from the second PPG set, namely the optimal mapping sequence. The distance used in the search is consistent with the distance in the PPG consistency, and may be an euclidean distance or a KL divergence distance, for example. As the number of PPGs at the frame level of the target speaker corpus is too large, skills such as a kd-tree data structure or a pytorech parallel optimization can be adopted in the searching process.
It should be noted that, in the search process of the optimal mapping PPG sequence, other distance definitions and search algorithms or clustering algorithms may also be adopted, and the present invention is not particularly limited.
And S4, inputting the optimal mapping PPG sequence into a pre-trained neural network acoustic model to obtain the Mel spectrum of the target speaker, and converting the Mel spectrum of the target speaker into the voice waveform of the target speaker according to a preset voice code conversion strategy, thereby realizing the conversion of the original voice waveform input by the user into the voice waveform of the target speaker.
Wherein the original speech waveform is different from the speech waveform of the target speaker.
In the embodiment, the optimal mapping PPG sequence is taken as an input, the optimal mapping PPG sequence is mapped to a Mel spectrum of the tone of the target speaker, and the Mel spectrum has a natural context relationship. The module is used in the whole tone conversion process as follows: firstly, extracting a PPG sequence and a Mel spectrum of a corpus of a target speaker, and training a neural network acoustic model capable of performing a sequence mapping task, wherein the model takes the PPG sequence as input and the Mel spectrum as output. Then, during tone conversion, after an optimal mapping PPG sequence which accords with the characteristics of the target speaker is obtained by a PPG extraction module and an optimal mapping module, the PPG sequence is used as input and is sent to a trained neural network acoustic model, and the Mel spectrum of the target speaker is obtained. Compared with the PPG sequence before mapping, the optimal mapping PPG sequence is completely consistent with input data when being used as input and used for training a neural network, so that high PPG consistency and stability are ensured, an acoustic model is modeled by the neural network, and voice with natural auditory sense can be obtained. Then, the spectrum parameters, such as Mel spectrum, are restored to the voice waveform of the target speaker by mathematical method or machine learning method.
It should be noted that the neural network acoustic model focuses on modeling the context relationship between frames in the mel spectrum, and the specific model structure may be LSTM, CBHG, or the like with an isometric sequence mapping, or may be a non-isometric sequence mapping with an Attention structure. Meanwhile, the invention is not limited in particular by adopting an autoregressive mapping mode and a non-autoregressive mapping mode.
Further, the language type corresponding to the original speech waveform is the same as and/or different from the language corresponding to each sentence of speech in the corpus of the preset target speaker.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.
In addition, corresponding to the method embodiment, the embodiment of the present invention further provides an optimal mapping cross-language tone conversion system based on PPG consistency, as shown in fig. 3, where the apparatus may include:
an obtaining module 310, configured to obtain an original voice waveform input by a user;
a PPG extraction module 320, configured to determine, based on a preset PPG determination policy, a first PPG sequence corresponding to the original speech waveform; the first PPG sequence consists of a first voice posterior probability PPG corresponding to each voice frame contained in the original voice waveform; all frames of PPG corresponding to each sentence of voice in the corpus of the preset target speaker obtained according to the preset PPG determination strategy form a target PPG set corresponding to the corpus of the preset target speaker;
an optimal mapping module 330, configured to search, starting from a first speech frame in the first PPG sequence, a frame of PPG in a target PPG set closest to a PPG corresponding to a current speech frame in the first PPG sequence from the target PPG set until the first PPG sequence is traversed, and form an optimal mapping PPG sequence from the second speech posterior probabilities PPG searched for each speech frame in the first PPG sequence;
the neural network acoustic model module 340 is configured to input the optimal mapping PPG sequence into a pre-trained neural network acoustic model to obtain a mel spectrum of the target speaker;
the vocoder module 350 is configured to convert the mel spectrum of the target speaker into the voice waveform of the target speaker according to a preset vocoding conversion policy, so as to convert the original voice waveform input by the user into the voice waveform of the target speaker.
Optionally, the PPG extraction module may include:
the acoustic feature determination unit is used for extracting acoustic features corresponding to each voice frame contained in the original voice waveform from the original voice waveform according to a preset voice signal processing technology;
the first voice posterior probability PPG determining unit is used for obtaining a first voice posterior probability PPG corresponding to each voice frame contained in the original voice waveform by utilizing a pre-trained automatic voice recognition ASR model;
and the first PPG sequence forming unit is used for forming a first PPG sequence corresponding to the original voice waveform by using the first voice posterior probability PPG corresponding to each voice frame.
Optionally, the language type corresponding to the original speech waveform is the same as and/or different from the language corresponding to each sentence of speech in the corpus of the preset target speaker.
Optionally, the system further includes: a PPG consistency evaluating module;
and the PPG consistency evaluating module is used for determining the distance between a first PPG sequence corresponding to the original voice waveform and a third PPG sequence corresponding to the voice waveform of the target speaker synthesized by the vocoder, and judging whether the voice content of the target speaker meets the requirement of the consistency standard according to the distance between the two PPG sequences.
In summary, the invention relates to an optimal mapping cross-language timbre conversion method based on PPG consistency, which comprises the steps of firstly extracting the acoustic features of the frame level of the converted voice through a voice signal processing technology, and obtaining the characterization PPG of the voice content of the frame level corresponding to the voice waveform through ASR calculation. Meanwhile, the optimal search is carried out in the PPG set of the target speaker by combining a preset large corpus of the target speaker, so that a mapping sequence which can accurately represent the voice content of the converted voice and conforms to the characteristics of the target speaker is obtained. And finally, converting the voice waveform into a natural voice waveform through a neural network acoustic model and a vocoder. The invention represents the relation between the converted voice and the corpus of the target speaker by the PPG modeling through the voice content at the frame level, and does not relate to the limitation of specific languages, thereby realizing the cross-language tone conversion.
Meanwhile, the PPG consistency evaluation standard and the optimal mapping algorithm which is consistent with the PPG consistency evaluation standard ensure that the voice contents before and after tone conversion are similar. The invention can automatically and effectively not limit the tone conversion of the converted voice language, is applied to an intelligent voice interaction system, is beneficial to better transmitting information and intention of the system, enriches the selection of the voice speaker and improves the satisfaction degree of the user.
Moreover, an embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,
a memory 401 for storing a computer program;
a processor 402, configured to execute the program stored in the memory 402, to implement the optimal mapping cross-language tone conversion method based on PPG conformance as described above.
In addition, other implementation manners of the application setting method implemented by the processor 401 executing the program stored in the memory 403 are the same as those mentioned in the foregoing method embodiment, and are not described herein again.
The communication bus of the terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM), or may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component.
In another embodiment of the present invention, there is also provided a computer-readable storage medium, having stored therein instructions, which when run on a computer, cause the computer to perform any one of the above-mentioned methods for optimal mapping cross-language timbre conversion based on PPG consistency.
In another embodiment, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform any one of the above-mentioned optimal mapping cross-language timbre conversion methods based on PPG consistency.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (12)

1. An optimal mapping cross-language timbre conversion method based on PPG consistency, characterized in that the method comprises:
s1, acquiring the original voice waveform input by the user;
s2, determining a first PPG sequence corresponding to the original voice waveform and a target PPG set corresponding to a corpus of a preset target speaker based on a preset PPG determination strategy;
s3, starting from a first speech frame in the first PPG sequence, searching a frame of PPG in the target PPG set closest to the PPG corresponding to the current speech frame in the first PPG sequence from the target PPG set until the first PPG sequence is traversed, and forming an optimal mapping PPG sequence from the second speech posterior probabilities PPG searched for each speech frame in the first PPG sequence;
and S4, inputting the optimal mapping PPG sequence into a pre-trained neural network acoustic model to obtain the Mel spectrum of the target speaker, and converting the Mel spectrum of the target speaker into the voice waveform of the target speaker according to a preset voice code conversion strategy, thereby realizing the conversion of the original voice waveform input by the user into the voice waveform of the target speaker.
2. The method according to claim 1, wherein the first PPG sequence consists of a first speech posterior probability PPG corresponding to each speech frame contained in the original speech waveform; the target PPG set is a set formed by all frame PPG sequences corresponding to each sentence of voice in the corpus of the preset target speaker.
3. The method according to claim 1, wherein the step of determining the first PPG sequence corresponding to the original speech waveform in step S2 based on a preset PPG determination strategy comprises:
s2.1, extracting acoustic features corresponding to each voice frame contained in the original voice waveform from the original voice waveform according to a preset voice signal processing technology;
s2.2, obtaining a first speech posterior probability PPG corresponding to each speech frame contained in the original speech waveform by utilizing a pre-trained automatic speech recognition ASR model;
and S2.3, forming a first PPG sequence corresponding to the original voice waveform by using the first voice posterior probability PPG corresponding to each voice frame.
4. The method of claim 1, wherein the original speech waveform is different from a speech waveform of the target speaker.
5. The method according to claim 1, wherein the language type corresponding to the original speech waveform is the same as and/or different from the language corresponding to each sentence of speech in the corpus of the preset target speakers.
6. The method of claim 1, further comprising:
and determining the distance between a first PPG sequence corresponding to the original voice waveform and a third PPG sequence corresponding to the voice waveform of the target speaker obtained by a final vocoder, and judging whether the voice content of the target speaker meets the requirement of a consistency standard or not according to the distance between the two PPG sequences.
7. An optimal mapping cross-language timbre conversion system based on PPG consistency, characterized by comprising:
the acquisition module is used for acquiring an original voice waveform input by a user;
the PPG extraction module is used for determining a first PPG sequence corresponding to the original voice waveform based on a preset PPG determination strategy; the first PPG sequence consists of a first voice posterior probability PPG corresponding to each voice frame contained in the original voice waveform; all frames of PPG corresponding to each sentence of voice in the corpus of the preset target speaker obtained according to the preset PPG determination strategy form a target PPG set corresponding to the corpus of the preset target speaker;
the optimal mapping module is used for searching a frame of PPG in a target PPG set which is closest to the PPG corresponding to the current speech frame in the first PPG sequence from the target PPG set from a first speech frame in the first PPG sequence until the first PPG sequence is traversed, and forming an optimal mapping PPG sequence from the second speech posterior probability PPG searched for each speech frame in the first PPG sequence;
the neural network acoustic model module is used for inputting the optimal mapping PPG sequence into a pre-trained neural network acoustic model to obtain a Mel spectrum of the target speaker;
and the vocoder module is used for converting the Mel spectrum of the target speaker into the voice waveform of the target speaker according to a preset voice code conversion strategy, so that the original voice waveform input by the user is converted into the voice waveform of the target speaker.
8. The system of claim 7, wherein the PPG extraction module comprises:
the acoustic feature determination unit is used for extracting acoustic features corresponding to each voice frame contained in the original voice waveform from the original voice waveform according to a preset voice signal processing technology;
the first voice posterior probability PPG determining unit is used for obtaining a first voice posterior probability PPG corresponding to each voice frame contained in the original voice waveform by utilizing a pre-trained automatic voice recognition ASR model;
and the first PPG sequence forming unit is used for forming a first PPG sequence corresponding to the original voice waveform by using the first voice posterior probability PPG corresponding to each voice frame.
9. The system according to claim 7, wherein the language type corresponding to the original speech waveform is the same as and/or different from the language corresponding to each sentence of speech in the corpus of the preset target speakers.
10. The system of claim 7, further comprising: a PPG consistency evaluating module;
and the PPG consistency evaluating module is used for determining the distance between a first PPG sequence corresponding to the original voice waveform and a third PPG sequence corresponding to the voice waveform of the target speaker synthesized by the vocoder, and judging whether the voice content of the target speaker meets the requirement of the consistency standard according to the distance between the two PPG sequences.
11. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the optimal mapping cross-language timbre conversion method based on PPG conformance of any of claims 1-6 when executing the program stored in the memory.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, implements the steps of the PPG-consistency-based optimal mapping cross-language timbre conversion method according to any of claims 1 to 6.
CN202110567496.0A 2021-05-24 2021-05-24 Optimal mapping cross-language tone conversion method and system based on PPG consistency Pending CN113327583A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110567496.0A CN113327583A (en) 2021-05-24 2021-05-24 Optimal mapping cross-language tone conversion method and system based on PPG consistency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110567496.0A CN113327583A (en) 2021-05-24 2021-05-24 Optimal mapping cross-language tone conversion method and system based on PPG consistency

Publications (1)

Publication Number Publication Date
CN113327583A true CN113327583A (en) 2021-08-31

Family

ID=77416568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110567496.0A Pending CN113327583A (en) 2021-05-24 2021-05-24 Optimal mapping cross-language tone conversion method and system based on PPG consistency

Country Status (1)

Country Link
CN (1) CN113327583A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020032836A (en) * 2000-10-27 2002-05-04 오영환 Voice Conversion System for Speaker Modification
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
CN110930981A (en) * 2018-09-20 2020-03-27 深圳市声希科技有限公司 Many-to-one voice conversion system
CN112017644A (en) * 2020-10-21 2020-12-01 南京硅基智能科技有限公司 Sound transformation system, method and application
CN112071325A (en) * 2020-09-04 2020-12-11 中山大学 Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020032836A (en) * 2000-10-27 2002-05-04 오영환 Voice Conversion System for Speaker Modification
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
CN110930981A (en) * 2018-09-20 2020-03-27 深圳市声希科技有限公司 Many-to-one voice conversion system
CN112071325A (en) * 2020-09-04 2020-12-11 中山大学 Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling
CN112017644A (en) * 2020-10-21 2020-12-01 南京硅基智能科技有限公司 Sound transformation system, method and application
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUANLONG ZHAO ET AL.: "《Using Phonetic Posteriorgram Based Frame Pairing for Segmental Accent Conversion》", 《 IEEE/ACM TRANSACTIONS ON AUDIO,SPEECH,AND LANGUAGE PROCESSING》, vol. 27, no. 10, pages 1649 - 1660 *

Similar Documents

Publication Publication Date Title
CN108989341B (en) Voice autonomous registration method and device, computer equipment and storage medium
CN110675855B (en) Voice recognition method, electronic equipment and computer readable storage medium
CN104575490B (en) Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm
US20190266998A1 (en) Speech recognition method and device, computer device and storage medium
US6839667B2 (en) Method of speech recognition by presenting N-best word candidates
US8019602B2 (en) Automatic speech recognition learning using user corrections
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
CN108899013B (en) Voice search method and device and voice recognition system
CN111341305B (en) Audio data labeling method, device and system
JP5223673B2 (en) Audio processing apparatus and program, and audio processing method
CN110021293B (en) Voice recognition method and device and readable storage medium
WO2023245389A1 (en) Song generation method, apparatus, electronic device, and storage medium
CN112489626A (en) Information identification method and device and storage medium
CN109300468B (en) Voice labeling method and device
CN111599339B (en) Speech splicing synthesis method, system, equipment and medium with high naturalness
CN111326177B (en) Voice evaluation method, electronic equipment and computer readable storage medium
CN113327574A (en) Speech synthesis method, device, computer equipment and storage medium
CN112687291A (en) Pronunciation defect recognition model training method and pronunciation defect recognition method
CN113450757A (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
CN114783424A (en) Text corpus screening method, device, equipment and storage medium
CN110853669B (en) Audio identification method, device and equipment
JP2004094257A (en) Method and apparatus for generating question of decision tree for speech processing
US11615787B2 (en) Dialogue system and method of controlling the same
CN112908308B (en) Audio processing method, device, equipment and medium
CN112908359A (en) Voice evaluation method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination