CN111108558B - Voice conversion method, device, computer equipment and computer readable storage medium - Google Patents

Voice conversion method, device, computer equipment and computer readable storage medium Download PDF

Info

Publication number
CN111108558B
CN111108558B CN201980003120.8A CN201980003120A CN111108558B CN 111108558 B CN111108558 B CN 111108558B CN 201980003120 A CN201980003120 A CN 201980003120A CN 111108558 B CN111108558 B CN 111108558B
Authority
CN
China
Prior art keywords
target
feature
voice
converted
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201980003120.8A
Other languages
Chinese (zh)
Other versions
CN111108558A (en
Inventor
刘洋
李柏
丁万
黄东延
熊友军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youbixuan Intelligent Robot Co ltd
Original Assignee
Ubtech Robotics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ubtech Robotics Corp filed Critical Ubtech Robotics Corp
Publication of CN111108558A publication Critical patent/CN111108558A/en
Application granted granted Critical
Publication of CN111108558B publication Critical patent/CN111108558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention discloses a voice conversion method, a voice conversion device, computer equipment and a computer readable storage medium. The voice conversion method comprises the following steps: acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format; performing format conversion on the original conversion model to obtain a target conversion model in an offline format; extracting characteristics of the voice to be converted to obtain characteristics to be converted; inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model; and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.

Description

Voice conversion method, device, computer equipment and computer readable storage medium
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to a voice conversion method, a voice conversion device, a computer device, and a computer readable storage medium.
Background
The voice conversion technology is a technology for converting source voice into target voice under the condition that semantic content is kept unchanged, wherein the source voice is voice sent by a first voice, and the target voice is voice sent by a second voice, namely, the source voice sent by the first voice is converted into the target voice sent by the second voice with the same semantic through the voice conversion technology.
With the rapid development of deep neural network technology, the voice conversion method based on deep learning has high similarity of the converted voice, good voice quality and good fluency. The current voice conversion method based on deep learning mainly comprises two steps, namely training a conversion model by using a large amount of voice data, and then carrying out voice conversion by using the trained model. Because the training has high requirements on computing resources, the offline end has few resources and low performance, and the offline end is used for training and is easy to be used for resource exhaustion, even if the offline end can train, the offline end has low efficiency, and the offline end is too high in time cost and difficult to use. Therefore, the voice conversion function based on deep learning can be realized by an online high-performance server, and cannot be used in an offline state.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a voice conversion method, apparatus, computer device, and storage medium capable of performing high-quality voice conversion in an offline state.
A method of speech conversion, the method comprising:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
performing format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting characteristics of the voice to be converted to obtain characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.
An apparatus for speech conversion, the apparatus comprising:
the acquisition module is used for acquiring the voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
the format conversion module is used for carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;
The feature extraction module is used for extracting features of the voice to be converted to obtain features to be converted;
the feature conversion module is used for inputting the feature to be converted into the target conversion model to obtain target features output by the target conversion model;
and the result module is used for obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.
A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
performing format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting characteristics of the voice to be converted to obtain characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.
A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
performing format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting characteristics of the voice to be converted to obtain characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.
The embodiment of the invention has the following beneficial effects:
according to the voice conversion method, the voice conversion device, the computer equipment and the computer readable storage medium, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an offline state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, after the format of the original conversion model is converted into the offline format, the target characteristics can be obtained according to the characteristics to be converted and the target conversion model in the offline format, and then the target voice is obtained according to the target characteristics. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Wherein:
FIG. 1 is a diagram of an application environment for a speech conversion method in one embodiment;
FIG. 2 is a flow chart of a method of speech conversion in one embodiment;
FIG. 3 is a flow chart of a method of speech conversion in one embodiment;
FIG. 4 is a schematic diagram of segmentation processing of speech to be converted in one embodiment;
FIG. 5 is a block diagram of a speech conversion device according to one embodiment;
FIG. 6 is a block diagram of a computer device in one embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
FIG. 1 is a diagram of an application environment for a speech conversion method in one embodiment. As shown in fig. 1, the voice conversion method is applied to a voice conversion system. The voice conversion system comprises a terminal, wherein the terminal can be a desktop terminal or a mobile terminal, and the mobile terminal can be at least one of a mobile phone, a tablet computer, a notebook computer and the like. The terminal comprises a microphone, a conversion unit and a player, wherein the microphone is used for acquiring voice to be converted, the conversion unit is used for converting the voice to be converted into target voice with the same voice content as the voice to be converted but different sounds, and the player is used for playing the target voice.
As shown in fig. 2, in one embodiment, a speech conversion method is provided. The method can be applied to a terminal, a server and other voice conversion devices. The present embodiment is exemplified as applied to a voice conversion device. In an off-line state, after the voice conversion device obtains the voice to be converted, the target voice with the same voice content and different voices with the voice to be converted can be obtained through the voice conversion method. The voice conversion method specifically comprises the following steps:
step 202: and acquiring the voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format.
The voice to be converted refers to a voice emitted by the voice to be converted and emitted by the target voice to be converted.
The online format refers to a storage format of a file which can be opened or normally operated only in a network connection state.
The original conversion model refers to a model which is input as a feature to be converted of the voice to be converted and output as a target feature of the target voice, and is used for acquiring the target feature of the target voice according to the feature to be converted of the voice to be converted under the network connection state.
Step 204: and carrying out format conversion on the original conversion model to obtain an offline format target conversion model.
The offline format refers to a storage format of a file which can still be opened or normally operated in a state of being disconnected from the network.
The target conversion model is used for obtaining target characteristics of target voice according to the characteristics to be converted of the voice to be converted in a network disconnection state.
And carrying out format conversion on the original conversion model to obtain an offline format target conversion model. Illustratively, the original conversion model is a model file trained by a TensorFlow (machine learning library developed by Google, using python language) framework, and the storage format of the original conversion model is an online format CheckPoint (abbreviated ckpt), and the storage format of the original conversion model can be converted into an offline format JetSoft Shield Now (abbreviated jsn) to obtain the target conversion model. The original conversion model in the ckpt format has more recorded information, such as some parameters and data used for training the original conversion model, and the voice conversion process in the offline state does not need the data, so that the redundant data can be removed when the storage format of the original conversion model is converted into the jsn format, which is equivalent to simplifying and compressing the model file, so that the running speed in the offline state can be improved, the voice conversion speed can be further improved, and the real-time voice conversion can be realized.
Step 206: and extracting the characteristics of the voice to be converted to obtain the characteristics to be converted.
The feature to be converted is used for inputting a target conversion model to acquire the target feature corresponding to the voice to be converted.
And obtaining the frequency spectrum characteristics of the voice to be converted according to the voice to be converted, such as the Mel frequency spectrum of the voice to be converted, extracting the characteristics of the voice to be converted, and determining the characteristics to be converted of the voice to be converted according to the characteristics.
Step 208: and inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model.
The target characteristics are used for acquiring target voices which are the same as voice contents of the voice to be converted and have different sounds.
In an offline state, when the target conversion model is in an operation state, the feature to be converted is input into the target conversion model, and the target conversion model directly outputs the target feature corresponding to the feature to be converted.
Step 210: and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.
The target voice refers to voice with the same voice content as the voice to be converted and different voice from the voice to be converted.
The characteristics of fundamental frequency, spectrum envelope, aperiodicity and the like of the target voice can be obtained according to the target characteristics, the Mel spectrum of the target voice is determined, and the target voice can be obtained according to the Mel spectrum of the target voice. The feature to be converted is binary 130-dimensional serialization data, the target feature obtained by inputting the target conversion model is also 130-dimensional serialization data, the lf0, mgc and bap feature data of the target voice are obtained through inverse normalization, the feature data are converted into f0, sp and ap features through SPTK, the Mel frequency spectrum of the target voice can be determined through the f0, sp and ap of the target voice, and the target voice can be obtained through the Mel frequency spectrum of the target voice.
According to the voice conversion method, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an offline state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, the format of the original conversion model is converted into the offline format, then the target characteristics can be obtained according to the characteristics to be converted and the target conversion model in the offline format, and then the target voice is obtained according to the target characteristics. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.
In one embodiment, step 206 performs feature extraction on the speech to be converted to obtain features to be converted, including: performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
Since there are a plurality of sound sources generating acoustic energy in the vocal tract when a person speaks, wherein the non-periodic sound sources include air supply sound, friction sound, explosion sound generated at lips, teeth, throat, vocal tract, etc., and the periodic sound sources are generated by vocal cord vibration at glottis, the speech to be converted includes periodic components and non-periodic components, and the corresponding spectral features of the speech to be converted include periodic features and non-periodic features. In this embodiment, a mel spectrum of a voice to be converted is used as a spectrum feature.
The fundamental frequency (Fundamental Frequency, f 0) is that a group of sine waves form an original signal, the sine wave with the lowest frequency is the fundamental frequency, and the other sine waves are overtones. The spectral envelope (spectral envelope, sp) is an envelope obtained by connecting the peaks of the amplitudes of different frequencies by a smooth curve. The non-periodic sequence (aperiodic parameter, ap) refers to non-periodic signal parameters of the speech.
Where the periodic characteristics refer to the fundamental frequency and the spectral envelope in the mel spectrum of the speech to be converted.
Wherein, the non-periodic characteristic refers to a non-periodic sequence in the mel spectrum of the speech to be converted.
From the periodic features and the non-periodic features, feature data, which is a feature to be converted, can be obtained as input to a target conversion model by processing. Illustratively, a set of feature data is obtained according to the periodic features and the non-periodic features, and the feature data is calculated and converted in format to obtain the feature to be converted.
In one embodiment, obtaining the feature to be converted according to the periodic feature and the non-periodic feature includes: obtaining a target dimension characteristic according to the periodic characteristic and the non-periodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the non-periodic characteristic; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
The target dimension feature refers to a good feature which is obtained according to the periodic feature and the non-periodic feature and has a dimension higher than that of the periodic feature and the non-periodic feature. And mapping the periodic features and the non-periodic features with low dimensions to obtain target dimension features with high dimensions, so that the quality of the synthesized voice can be improved.
Illustratively, the periodic features F0 and sp are obtained according to the mel spectrum of the voice to be converted, the non-periodic features ap are processed by a voice signal processing kit (Speech Signal Processing Toolkit, SPTK) to obtain 1-dimensional lF0 (logarithm of F0), the mgc-dimensional and 1-dimensional band non-periodic (band aperiodicity, bap) data of 1-dimensional are calculated according to lF0, whether sound (un sound, abbreviated as vuv) data of 1-dimensional are obtained, and the first derivative and the second derivative are respectively obtained for lF0, mgc and bap, so as to obtain 1×2, 41×2 and 1×2-dimensional data. Finally, the data vuv, lf0 and its derivative, mgc and its derivative, and bap and its derivative are normalized to obtain a total of 130-dimensional serialized data. The 130-dimensional serialized data is used as a target dimension feature.
And carrying out format conversion on the target dimension characteristics so as to enable the target dimension characteristics to meet the format requirement of the input of a target conversion model, wherein the characteristic data obtained through format conversion is the characteristics to be converted. For example, when the input format requirement of the target conversion model is binary data, binary conversion is performed on the target dimension feature, and the obtained binary data is the feature to be converted.
In one embodiment, the target conversion model operates based on a computer unified device architecture recurrent neural network toolkit framework (Computed Unified Device Architecture RecurREnt Neural Network Toolkit, CURRENNT).
Among these, CURRENNT is an open source parallel implementation of a deep parallel neural network (Recurrent Neural Network, RNN) that supports graphics processing units (Graphics Processing Unit, GPU) through the computer unified device architecture (Computed Unified Device Architecture, CUDA) of inflight. The current nnt supports one-way and two-way RNNs with Long Short-Term Memory (LSTM) storage units, thereby overcoming the problem of gradient extinction.
And placing the target conversion model in a current NNT, wherein the target conversion model is in an operation state, at the moment, placing the feature to be converted in the same current NNT, inputting the feature to be converted into the target conversion model, and outputting the target feature corresponding to the feature to be converted by the target conversion model.
As shown in fig. 3, in one embodiment, the method further comprises:
step 306: and carrying out segmentation processing on the voice to be converted to obtain a plurality of segmented voices.
Because the offline equipment has limited computing resources, if the duration of the voice to be converted is longer, the voice to be converted is directly converted, the operation speed is low, and the real-time conversion of the voice cannot be realized. The voice to be converted is subjected to segmentation processing to obtain a plurality of segmented voices, and the conversion can be rapidly performed due to the length of the segmented voices, so that the running speed can be greatly improved. The method includes that when the time length of the voice to be converted is longer than the preset time length, the voice to be converted is segmented according to preset conditions. As shown in fig. 4, the speech 41 to be converted is equally divided into 3 segments according to the duration, and 3 segmented speech 42 is obtained.
Step 308: and extracting the characteristics of the segmented voices to obtain a plurality of segmented characteristics.
The segmentation features refer to features to be converted corresponding to each segmented voice.
And respectively extracting the characteristics of each piece of segmented voice, and obtaining the characteristics to be converted corresponding to each piece of segmented voice according to the extracted characteristics to obtain the segmented characteristics of each piece of segmented voice.
Step 310: and inputting each segment feature into the target conversion model in parallel to obtain a target segment feature corresponding to each segment feature.
The target segment features refer to target features corresponding to each segment feature.
After obtaining a plurality of segment features, a plurality of cores of a central processing unit (central processing unit, CPU) are called to convert the plurality of segment features at the same time, a plurality of processes are started, and each process independently executes to input the segment features into a target conversion model to obtain target segment features corresponding to the segment features. And each segmented feature is input into the target conversion model in parallel, so that the conversion speed is much faster than that of each segmented feature in sequence, and the real-time conversion of the voice is facilitated.
Step 312: and obtaining target voice according to the target segmentation characteristics corresponding to each segmentation characteristic.
Target segment features corresponding to each segment feature can be synthesized to obtain target features, and target voice is obtained according to the target features; the corresponding target segmented voice can be obtained according to the target segmented characteristics, and the segmented voice is synthesized to obtain the target voice. The speech to be converted is segmented into 5 segmented speech, 5 corresponding segmented features are obtained according to the 5 segmented speech, 5 corresponding segmented features are input into a target conversion model to obtain 5 corresponding target segmented features, 5 corresponding target segmented speech is obtained according to the 5 corresponding target segmented features, and the 5 corresponding target segmented speech is synthesized to obtain target speech.
In one embodiment, any two target segment features adjacent in time from the plurality of target segment features include an overlapping feature, step 312 obtains a target speech from the target segment feature corresponding to each of the segment features, including: and obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features.
As shown in fig. 4, in order to prevent errors or loss of some features in the subsequent extraction of features of the speech 41 to be converted due to the segmentation process, any two segmented voices 42 adjacent in time among the plurality of segmented voices 42 may include an overlapping portion 421 during the segmentation process.
The overlapping feature refers to that overlapping portions 421 included in any two segmented voices 42 adjacent in time in the plurality of segmented voices 42 are converted to obtain corresponding target features.
Combining target segment features corresponding to each segment feature to obtain a combined feature, adjusting the combined feature according to overlapping features of any two target segment features adjacent in time in the target segment features to obtain a target feature, and obtaining target voice according to the target feature. Illustratively, the speech to be converted is segmented into 2 segmented speech, and 2 target segment features are obtained by conversion, where the target segment feature I is (A+C) A ) The target segment feature II is (C B +B), the overlapping characteristic of the target segment characteristic I and the target segment characteristic II is C, and the former 1/2 of the overlapping characteristic C in the target segment characteristic I, namely C, can be reserved in the process of obtaining the target characteristic Before A The back 1/2 of the overlapped feature C in the target segment feature II is reserved After B Target featureIs (A+C) Before A +C After B +B), obtaining target voice according to the target characteristics.
In one embodiment, the obtaining the target speech according to the target segment feature corresponding to each segment feature and the overlapping feature of any two target segment features adjacent in time in the target segment features includes: acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features adjacent in time; and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.
The feature weight set is used for determining the weight sizes of overlapping features of any two target segment features adjacent in time in the two target segment features respectively.
Illustratively, the speech to be converted is segmented into 2 segmented speech, and 2 target segment features are obtained by conversion, where the target segment feature I is (A+C) A ) The target segment feature II is (C B +B), the overlapping feature of the target segment feature I and the target segment feature II is C, the first feature weight in the feature weight set is m, used for determining the weight of the overlapping feature C in the target segment feature I, the second feature weight is n, used for determining the weight of the overlapping feature C in the target segment feature II, and the target feature of the voice to be converted is (A+m×C) A +n×C B +B), obtaining target voice according to the target characteristics.
As shown in fig. 5, in one embodiment, there is provided a voice conversion apparatus including:
the obtaining module 502 is configured to obtain a voice to be converted and an original conversion model, where the format of the original conversion model is an online format;
a format conversion module 504, configured to perform format conversion on the original conversion model to obtain an offline format target conversion model;
the feature extraction module 506 is configured to perform feature extraction on the speech to be converted to obtain features to be converted;
the feature conversion module 508 is configured to input the feature to be converted into the target conversion model, so as to obtain a target feature output by the target conversion model;
And a result module 510, configured to obtain a target voice according to the target feature output by the target conversion model, where the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.
According to the voice conversion device, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an offline state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, the format of the original conversion model is converted into the offline format, then the target characteristics can be obtained according to the characteristics to be converted and the target conversion model in the offline format, and then the target voice is obtained according to the target characteristics. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.
In one embodiment, the feature extraction module 506 is configured to perform periodic feature extraction and non-periodic feature extraction on the speech to be converted, so as to obtain periodic features and non-periodic features corresponding to the speech to be converted, where the periodic features include a fundamental frequency and a spectral envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
In one embodiment, the feature extraction module 506 is specifically configured to obtain a target dimension feature according to the periodic feature and the non-periodic feature, where a dimension of the target dimension feature is higher than a sum of dimensions of the periodic feature and the non-periodic feature; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
In one embodiment, the target transformation model operates based on a computer unified device architecture recurrent neural network toolkit framework.
In one embodiment, the feature extraction module 506 is configured to perform a segmentation process on the speech to be converted to obtain a plurality of segmented speech, and perform feature extraction on the plurality of segmented speech to obtain a plurality of segmented features; the feature conversion module 508 is configured to input each of the segment features into the target conversion model in parallel, so as to obtain a target segment feature corresponding to each of the segment features; the result module 510 is configured to obtain a target voice according to a target segmentation feature corresponding to each segmentation feature.
In one embodiment, any two target segment features adjacent in time in the plurality of target segment features include an overlapping feature, and the result module 510 is configured to obtain the target speech according to the target segment feature corresponding to each of the segment features and the overlapping feature of any two target segment features adjacent in time in the plurality of target segment features.
In one embodiment, the result module 510 is configured to obtain a feature weight set, where the feature weight set includes a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features that are adjacent in time; and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.
FIG. 6 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be a terminal, a server, or a voice conversion device. As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a speech conversion method. The internal memory may also store a computer program that, when executed by the processor, causes the processor to perform the speech conversion method. It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is presented comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
performing format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting characteristics of the voice to be converted to obtain characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.
According to the computer equipment, the to-be-converted voice and the original conversion model are obtained, and the original conversion model cannot work in an offline state, so that the to-be-converted voice characteristic is extracted to obtain the to-be-converted characteristic, after the format of the original conversion model is converted into the offline format, the target characteristic can be obtained according to the to-be-converted characteristic and the target conversion model in the offline format, and then the target voice is obtained according to the target characteristic. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.
In one embodiment, the feature extraction of the speech to be converted to obtain the feature to be converted includes: performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
In one embodiment, the obtaining the feature to be converted according to the periodic feature and the non-periodic feature includes: obtaining a target dimension characteristic according to the periodic characteristic and the non-periodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the non-periodic characteristic; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
In one embodiment, the target transformation model operates based on a computer unified device architecture recurrent neural network toolkit framework.
In one embodiment, the feature extraction of the speech to be converted to obtain the feature to be converted includes: performing segmentation processing on the voice to be converted to obtain a plurality of segmented voices; extracting features of the segmented voices to obtain segmented features; the step of inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model comprises the following steps: inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature; the obtaining the target voice according to the target characteristics output by the target conversion model comprises the following steps: and obtaining target voice according to the target segmentation characteristics corresponding to each segmentation characteristic.
In one embodiment, any two target segment features of the plurality of target segment features that are adjacent in time comprise overlapping features; the step of obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature comprises the following steps: and obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features.
In one embodiment, the obtaining the target speech according to the target segment feature corresponding to each segment feature and the overlapping feature of any two target segment features adjacent in time in the target segment features includes: acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features adjacent in time; and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
performing format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting characteristics of the voice to be converted to obtain characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted.
The above computer readable storage medium, by acquiring the voice to be converted and the original conversion model, extracts the feature of the voice to be converted to obtain the feature to be converted because the original conversion model cannot work in an offline state, converts the format of the original conversion model into an offline format, and then obtains the target feature according to the feature to be converted and the target conversion model in the offline format, and then obtains the target voice according to the target feature. The voice conversion method not only can carry out voice conversion with high quality in an off-line state, but also has high running speed and can realize real-time voice conversion.
In one embodiment, the feature extraction of the speech to be converted to obtain the feature to be converted includes: performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
In one embodiment, the obtaining the feature to be converted according to the periodic feature and the non-periodic feature includes: obtaining a target dimension characteristic according to the periodic characteristic and the non-periodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the non-periodic characteristic; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
In one embodiment, the target transformation model operates based on a computer unified device architecture recurrent neural network toolkit framework.
In one embodiment, the feature extraction of the speech to be converted to obtain the feature to be converted includes: performing segmentation processing on the voice to be converted to obtain a plurality of segmented voices; extracting features of the segmented voices to obtain segmented features; the step of inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model comprises the following steps: inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature; the obtaining the target voice according to the target characteristics output by the target conversion model comprises the following steps: and obtaining target voice according to the target segmentation characteristics corresponding to each segmentation characteristic.
In one embodiment, any two target segment features of the plurality of target segment features that are adjacent in time comprise overlapping features; the step of obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature comprises the following steps: and obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features.
In one embodiment, the obtaining the target speech according to the target segment feature corresponding to each segment feature and the overlapping feature of any two target segment features adjacent in time in the target segment features includes: acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features adjacent in time; and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.
It should be noted that the above-mentioned voice conversion method, voice conversion apparatus, computer device, and computer-readable storage medium belong to one general inventive concept, and the contents in the embodiments of the voice conversion method, voice conversion apparatus, computer device, and computer-readable storage medium are mutually applicable.
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (7)

1. A method of speech conversion, the method comprising:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
performing format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting characteristics of the voice to be converted to obtain characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
Obtaining target voice according to target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted;
the feature extraction of the voice to be converted to obtain the feature to be converted includes:
performing segmentation processing on the voice to be converted to obtain a plurality of segmented voices;
extracting features of the segmented voices to obtain segmented features;
the step of inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model comprises the following steps:
inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature;
the obtaining the target voice according to the target characteristics output by the target conversion model comprises the following steps:
obtaining target voice according to the target segmentation characteristics corresponding to each segmentation characteristic;
wherein any two target segment features of the plurality of target segment features that are adjacent in time comprise overlapping features; the step of obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature comprises the following steps:
Obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features;
the step of obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features, including:
acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features adjacent in time;
and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.
2. The method for converting speech according to claim 1, wherein the feature extraction of the speech to be converted to obtain the feature to be converted includes:
performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope;
And obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
3. The method for voice conversion according to claim 2, wherein the obtaining the feature to be converted from the periodic feature and the non-periodic feature includes:
obtaining a target dimension characteristic according to the periodic characteristic and the non-periodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the non-periodic characteristic;
and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
4. The speech conversion method according to claim 1, wherein the target conversion model is run based on a computer unified device architecture recurrent neural network toolkit framework.
5. A speech conversion apparatus, the apparatus comprising:
the acquisition module is used for acquiring the voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
the format conversion module is used for carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;
the feature extraction module is used for extracting features of the voice to be converted to obtain features to be converted;
The feature conversion module is used for inputting the feature to be converted into the target conversion model to obtain target features output by the target conversion model;
the result module is used for obtaining target voice according to the target characteristics output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the voice of the target voice is different from the voice to be converted;
the feature extraction of the voice to be converted to obtain the feature to be converted includes:
performing segmentation processing on the voice to be converted to obtain a plurality of segmented voices;
extracting features of the segmented voices to obtain segmented features;
the step of inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model comprises the following steps:
inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature;
the obtaining the target voice according to the target characteristics output by the target conversion model comprises the following steps:
obtaining target voice according to the target segmentation characteristics corresponding to each segmentation characteristic;
Wherein any two target segment features of the plurality of target segment features that are adjacent in time comprise overlapping features; the step of obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature comprises the following steps:
obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features;
the step of obtaining the target voice according to the target segment feature corresponding to each segment feature and the overlapped feature of any two target segment features adjacent in time in the target segment features, including:
acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segment features adjacent in time;
and obtaining the target voice according to the target segment feature corresponding to each segment feature, the overlapped feature of any two target segment features adjacent in time in the target segment features and the feature weight set.
6. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the speech conversion method according to any one of claims 1 to 4.
7. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the speech conversion method of any one of claims 1 to 4.
CN201980003120.8A 2019-12-20 2019-12-20 Voice conversion method, device, computer equipment and computer readable storage medium Active CN111108558B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/126865 WO2021120145A1 (en) 2019-12-20 2019-12-20 Voice conversion method and apparatus, computer device and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN111108558A CN111108558A (en) 2020-05-05
CN111108558B true CN111108558B (en) 2023-08-04

Family

ID=70427470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980003120.8A Active CN111108558B (en) 2019-12-20 2019-12-20 Voice conversion method, device, computer equipment and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN111108558B (en)
WO (1) WO2021120145A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103430234A (en) * 2011-03-17 2013-12-04 国际商业机器公司 Voice transformation with encoded information
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
CN107785030A (en) * 2017-10-18 2018-03-09 杭州电子科技大学 A kind of phonetics transfer method
CN110097890A (en) * 2019-04-16 2019-08-06 北京搜狗科技发展有限公司 A kind of method of speech processing, device and the device for speech processes

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100484666B1 (en) * 2002-12-31 2005-04-22 (주) 코아보이스 Voice Color Converter using Transforming Vocal Tract Characteristic and Method
CN1534595A (en) * 2003-03-28 2004-10-06 中颖电子(上海)有限公司 Speech sound change over synthesis device and its method
CN1645363A (en) * 2005-01-04 2005-07-27 华南理工大学 Portable realtime dialect inter-translationing device and method thereof
JP4241736B2 (en) * 2006-01-19 2009-03-18 株式会社東芝 Speech processing apparatus and method
CN105023570B (en) * 2014-04-30 2018-11-27 科大讯飞股份有限公司 A kind of method and system for realizing sound conversion
US9613620B2 (en) * 2014-07-03 2017-04-04 Google Inc. Methods and systems for voice conversion
US9922138B2 (en) * 2015-05-27 2018-03-20 Google Llc Dynamically updatable offline grammar model for resource-constrained offline device
US10896669B2 (en) * 2017-05-19 2021-01-19 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech
CN107767879A (en) * 2017-10-25 2018-03-06 北京奇虎科技有限公司 Audio conversion method and device based on tone color
CN109637551A (en) * 2018-12-26 2019-04-16 出门问问信息科技有限公司 Phonetics transfer method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103430234A (en) * 2011-03-17 2013-12-04 国际商业机器公司 Voice transformation with encoded information
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning
CN107785030A (en) * 2017-10-18 2018-03-09 杭州电子科技大学 A kind of phonetics transfer method
CN110097890A (en) * 2019-04-16 2019-08-06 北京搜狗科技发展有限公司 A kind of method of speech processing, device and the device for speech processes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
应耀鹏等.《跨软件的文本转语音APP的设计与开发》.《福建电脑》.2019,第35卷(第4期),第115-116页. *

Also Published As

Publication number Publication date
CN111108558A (en) 2020-05-05
WO2021120145A1 (en) 2021-06-24

Similar Documents

Publication Publication Date Title
US11848002B2 (en) Synthesis of speech from text in a voice of a target speaker using neural networks
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
US8571857B2 (en) System and method for generating models for use in automatic speech recognition
CN111433847B (en) Voice conversion method, training method, intelligent device and storage medium
US11741942B2 (en) Text-to-speech synthesis system and method
US11355097B2 (en) Sample-efficient adaptive text-to-speech
CN108492818B (en) Text-to-speech conversion method and device and computer equipment
US9009050B2 (en) System and method for cloud-based text-to-speech web services
US11049491B2 (en) System and method for prosodically modified unit selection databases
CN107240401B (en) Tone conversion method and computing device
WO2021134581A1 (en) Prosodic feature prediction-based speech synthesis method, apparatus, terminal, and medium
WO2023116660A2 (en) Model training and tone conversion method and apparatus, device, and medium
KR20210032809A (en) Real-time interpretation method and apparatus
US20240161727A1 (en) Training method for speech synthesis model and speech synthesis method and related apparatuses
CN113506586A (en) Method and system for recognizing emotion of user
CN113362804A (en) Method, device, terminal and storage medium for synthesizing voice
JP2023529699A (en) clear text echo
CN111108558B (en) Voice conversion method, device, computer equipment and computer readable storage medium
CN112712789A (en) Cross-language audio conversion method and device, computer equipment and storage medium
CN112201253A (en) Character marking method and device, electronic equipment and computer readable storage medium
WO2023116243A1 (en) Data conversion method and computer storage medium
WO2023197206A1 (en) Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models
US11335321B2 (en) Building a text-to-speech system from a small amount of speech data
CN114299910B (en) Training method, using method, device, equipment and medium of speech synthesis model
WO2022140966A1 (en) Cross-language voice conversion method, computer device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231211

Address after: Room 601, 6th Floor, Building 13, No. 3 Jinghai Fifth Road, Beijing Economic and Technological Development Zone (Tongzhou), Tongzhou District, Beijing, 100176

Patentee after: Beijing Youbixuan Intelligent Robot Co.,Ltd.

Address before: 518000 16th and 22nd Floors, C1 Building, Nanshan Zhiyuan, 1001 Xueyuan Avenue, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen UBTECH Technology Co.,Ltd.

TR01 Transfer of patent right