WO2021120145A1 - Voice conversion method and apparatus, computer device and computer-readable storage medium - Google Patents

Voice conversion method and apparatus, computer device and computer-readable storage medium Download PDF

Info

Publication number
WO2021120145A1
WO2021120145A1 PCT/CN2019/126865 CN2019126865W WO2021120145A1 WO 2021120145 A1 WO2021120145 A1 WO 2021120145A1 CN 2019126865 W CN2019126865 W CN 2019126865W WO 2021120145 A1 WO2021120145 A1 WO 2021120145A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
feature
voice
converted
features
Prior art date
Application number
PCT/CN2019/126865
Other languages
French (fr)
Chinese (zh)
Inventor
刘洋
李柏
丁万
黄东延
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2019/126865 priority Critical patent/WO2021120145A1/en
Priority to CN201980003120.8A priority patent/CN111108558B/en
Publication of WO2021120145A1 publication Critical patent/WO2021120145A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the field of audio processing technology, and in particular to a voice conversion method, device, computer equipment, and computer-readable storage medium.
  • Voice conversion technology is a technology that converts the source voice into the target voice while keeping the semantic content unchanged.
  • the source voice is the voice uttered by the first human voice
  • the target voice is the voice uttered by the second human voice. That is, the source voice emitted by the first human voice is converted into the target voice emitted by the second human voice with the same semantics through voice conversion technology.
  • the current deep learning-based speech conversion method mainly includes two steps. First, a large amount of speech data is used to train the conversion model, and then the trained model is used for speech conversion. Because training requires high computing resources, there are few offline resources and low performance. It is easy to run out of resources when used for training. Even if it can be trained, the efficiency is very low, and the time cost is too high and difficult to use. Therefore, the current deep learning-based voice conversion function can only be realized by relying on online high-performance servers, and cannot be used offline.
  • a voice conversion method includes:
  • the target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  • a device for voice conversion includes:
  • An acquisition module for acquiring the voice to be converted and the original conversion model, the format of the original conversion model is an online format
  • a format conversion module for format conversion of the original conversion model to obtain a target conversion model in an offline format
  • the feature extraction module is used to perform feature extraction on the voice to be converted to obtain the feature to be converted;
  • the feature conversion module is configured to input the features to be converted into the target conversion model to obtain the target features output by the target conversion model;
  • the result module is configured to obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  • a computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  • the above voice conversion method, device, computer equipment and computer readable storage medium by acquiring the voice to be converted and the original conversion model, since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted.
  • the target feature can be obtained according to the features to be converted and the target conversion model in the offline format, and then the target voice can be obtained according to the target feature.
  • This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
  • Figure 1 is an application environment diagram of a voice conversion method in an embodiment
  • Figure 2 is a flowchart of a voice conversion method in an embodiment
  • Figure 3 is a flowchart of a voice conversion method in an embodiment
  • FIG. 4 is a schematic diagram of segmentation processing of the voice to be converted in an embodiment
  • Figure 5 is a structural block diagram of a voice conversion device in an embodiment
  • Fig. 6 is a structural block diagram of a computer device in an embodiment.
  • Fig. 1 is an application environment diagram of a voice conversion method in an embodiment.
  • the voice conversion method is applied to a voice conversion system.
  • the voice conversion system includes a terminal.
  • the terminal may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, and a notebook computer.
  • the terminal includes a microphone, a conversion unit, and a player.
  • the microphone is used to obtain the voice to be converted.
  • the conversion unit is used to convert the voice to be converted into a target voice with the same voice content but a different voice.
  • the player is used to play the target voice.
  • a voice conversion method is provided.
  • the method can be applied to terminals, servers, and other voice conversion devices.
  • it is applied to a voice conversion device as an example.
  • the voice conversion method In the offline state, after the voice conversion device obtains the voice to be converted, the following voice conversion method can obtain the target voice with the same voice content and different voice as the voice to be converted.
  • the voice conversion method specifically includes the following steps:
  • Step 202 Obtain the voice to be converted and the original conversion model, where the format of the original conversion model is an online format.
  • the voice to be converted refers to the voice that is emitted by the human voice to be converted and is to be converted into the target human voice.
  • the online format refers to the saving format of files that can be opened or work normally only when the network is connected.
  • the original conversion model refers to a model in which the input is the feature to be converted of the voice to be converted, and the output is the target feature of the target voice, which is used to obtain the target feature of the target voice according to the feature of the voice to be converted in the state of network connection .
  • Step 204 Perform format conversion on the original conversion model to obtain an offline format target conversion model.
  • the offline format refers to the saving format of files that can be opened or work normally when disconnected from the network.
  • the target conversion model is used to obtain the target characteristics of the target voice according to the characteristics of the voice to be converted when the network is disconnected.
  • the original conversion model is formatted to obtain an offline format target conversion model.
  • the original conversion model is a model file trained by the TensorFlow (a machine learning library developed by Google, using the python language) framework.
  • the original conversion model is saved in the online format CheckPoint (abbreviated ckpt), which can be converted to save format It is the offline format JetSoft Shield Now (jsn in short) to obtain the target conversion model.
  • the original conversion model in ckpt format records a lot of information, such as some parameters and data used when training the original conversion model. This part of the data is not needed in the process of voice conversion in the offline state, so it is necessary to convert the original conversion model to the save format. Excess data will be removed when it is in jsn format, which is equivalent to simplifying and compressing the model file, which can improve the running speed in offline state, thereby increasing the speed of voice conversion and realizing real-time voice conversion.
  • Step 206 Perform feature extraction on the voice to be converted to obtain the feature to be converted.
  • the feature to be converted is used to input the target conversion model to obtain the target feature corresponding to the voice to be converted.
  • Step 208 Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model.
  • the target feature is used to obtain a target voice with the same voice content and different voice as the voice to be converted.
  • the target conversion model In the offline state, when the target conversion model is in the running state, the feature to be converted is input to the target conversion model, and the target conversion model directly outputs the target feature corresponding to the feature to be converted.
  • Step 210 Obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  • the target voice refers to a voice whose voice content is the same as the voice to be converted and whose voice is different from the voice to be converted.
  • the fundamental frequency, spectrum envelope, and non-periodical characteristics of the target voice can be obtained, and the Mel spectrum of the target voice can be determined, and the target voice can be obtained according to the Mel spectrum of the target voice.
  • the feature to be converted is binarized 130-dimensional serialized data, and the target feature obtained by inputting the target conversion model is also 130-dimensional serialized data.
  • the lf0, mgc, and bap features of the target voice are obtained.
  • SPTK to convert the data into f0, sp, and ap features.
  • the mel spectrum of the target voice can be determined, and the mel spectrum of the target voice can be used to obtain the target voice.
  • the voice conversion method by acquiring the voice to be converted and the original conversion model, since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to an offline format, According to the features to be converted and the target conversion model in offline format, the target features can be obtained, and then the target speech can be obtained according to the target features.
  • This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
  • step 206 performs feature extraction on the voice to be converted to obtain the feature to be converted, including: performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain the period corresponding to the voice to be converted Features and aperiodic features, where the periodic features include a fundamental frequency and a spectrum envelope; the features to be converted are obtained according to the periodic features and the aperiodic features.
  • non-periodic sound sources include aspiration, friction, and blasting sound generated at the lips, teeth, throat, and vocal tract
  • periodic sound sources are It is generated by the vibration of the vocal cords at the glottis, so the voice to be converted includes periodic components and non-periodic components, and the corresponding spectral features of the voice to be converted include periodic features and non-periodic features.
  • the Mel spectrum of the voice to be converted is used as the spectral feature for description.
  • the fundamental frequency (Fundamental Frequency, f0) refers to a group of sine waves forming the original signal, the sine wave with the lowest frequency is the fundamental frequency, and the others are overtones.
  • the spectral envelope (spectral envelope, sp) refers to the envelope obtained by connecting the highest amplitude points of different frequencies through a smooth curve.
  • Aperiodic sequence (aperiodic parameter, ap) refers to aperiodic signal parameters of speech.
  • the periodic feature refers to the fundamental frequency and spectrum envelope in the Mel spectrum of the voice to be converted.
  • the aperiodic feature refers to the aperiodic sequence in the Mel spectrum of the voice to be converted.
  • the feature data as the input of the target conversion model can be obtained through processing, and the feature data is the feature to be converted.
  • a set of characteristic data is obtained according to the periodic characteristic and the aperiodic characteristic, and the characteristic data is calculated and formatted to obtain the characteristic to be converted.
  • obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a higher dimension than all. The sum of the dimensionality of the periodic feature and the non-periodic feature; performing format conversion on the target dimensional feature to obtain the feature to be converted.
  • the target dimensional feature refers to a good feature whose dimension obtained according to the periodic feature and the aperiodic feature is higher than the dimension of the periodic feature and the aperiodic feature.
  • the low-dimensional periodic features and the non-periodic features are mapped to obtain high-dimensional target dimensional features, which can improve the quality of synthesized speech.
  • the periodic features f0 and sp are obtained according to the Mel spectrum of the voice to be converted, and the non-periodic feature ap is used to process the three features using the Speech Signal Processing Toolkit (SPTK) Get 1-dimensional lF0 (take the logarithm of F0), 41-dimensional mgc and 1-dimensional aperiodicity (band aperiodicity, bap), and calculate the 1-dimensional voice (voice, unvoice, abbreviated as vuv) according to lf0 Data, find the first derivative and the second derivative of lf0, mgc, and bap respectively, and obtain 1 ⁇ 2, 41 ⁇ 2, and 1 ⁇ 2 dimensional data respectively. Finally, the data vuv, lf0 and its derivatives, mgc and its derivatives, bap and its derivatives are normalized to obtain a total of 130-dimensional serialized data. The 130-dimensional serialized data is used as the target dimensional feature.
  • SPTK Speech Signal Processing Toolkit
  • the target dimension feature is formatted to meet the input format requirements of the target conversion model, and the feature data obtained by the format conversion is the feature to be converted.
  • the input format of the target conversion model is required to be binary data
  • binary conversion is performed on the target dimensional feature, and the obtained binary data is the feature to be converted.
  • the target conversion model runs based on the Computer Unified Device Architecture Recursive Neural Network Toolkit (Computed Unified Device Architecture Recursive Neural Network Toolkit, CURRENNT).
  • CURRENNT Computer Unified Device Architecture Recursive Neural Network Toolkit
  • CURRENNT is an open source parallel implementation of a deep parallel neural network (Recurrent Neural Network, RNN). It supports the Graphics Processing Unit (GPU) through NVIDIA's Computer Unified Device Architecture (CUDA). CURRENNT supports one-way and two-way RNNs with Long Short-Term Memory (LSTM) storage units, thereby overcoming the problem of vanishing gradients.
  • RNN Recurrent Neural Network
  • CUDA Computer Unified Device Architecture
  • the target conversion model is in the running state, and put the features to be converted into the same CURRENNT, the features to be converted will be input into the target conversion model, and the target The conversion model outputs the target feature corresponding to the feature to be converted.
  • the method further includes:
  • Step 306 Perform segmentation processing on the voice to be converted to obtain multiple segmented voices.
  • the voice to be converted is processed in segments to obtain multiple segmented voices. Due to the short duration of the segmented voices, the conversion can be performed quickly, thereby greatly improving the running speed.
  • the voice to be converted is segmented according to a preset condition. As shown in FIG. 4, the voice to be converted 41 is divided into 3 segments evenly according to the length of time, and 3 segmented voices 42 are obtained.
  • Step 308 Perform feature extraction on the multiple segmented voices to obtain multiple segmented features.
  • the segmented feature refers to the feature to be converted corresponding to each segmented voice.
  • the feature extraction is performed on each segmented voice respectively, and the feature to be converted corresponding to each segmented voice is obtained according to the extracted features, that is, the segmented feature of each segmented voice is obtained.
  • Step 310 Input each of the segmented features into the target conversion model in parallel to obtain a target segmented feature corresponding to each of the segmented features.
  • the target segment feature refers to the target feature corresponding to each segment feature.
  • Step 312 Obtain a target voice according to the target segment feature corresponding to each of the segment features.
  • the target segmented features corresponding to each of the segmented features can be synthesized to obtain the target feature, and the target speech can be obtained according to the target feature; the corresponding target segmented speech can also be obtained according to the target segmented feature, and the segmented speech can be synthesized Target voice.
  • the voice to be converted is segmented into 5 segmented voices, and 5 corresponding segmented features are obtained from the 5 segmented voices, and the 5 corresponding segmented features are input into the target conversion model to obtain 5 corresponding targets Segmentation features: According to the 5 corresponding target segmentation features, 5 corresponding target segmented voices are obtained, and the 5 corresponding target segmented voices can be synthesized to obtain the target voice.
  • any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features
  • step 312 is obtained according to the target segmentation feature corresponding to each of the segmentation features.
  • the target voice includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features adjacent in time among the plurality of target segment features. Describe the target voice.
  • segmented voices 42 may be adjacent in time. Any two of the segmented speech 42 includes an overlapping portion 421.
  • the overlapping feature refers to that the overlapping part 421 included in any two segmented voices 42 adjacent in time in the plurality of segmented speeches 42 is converted to obtain the corresponding target feature.
  • the target segmentation features corresponding to each of the segmentation features are merged together to obtain a merged feature, according to the overlapping features of any two target segmentation features that are adjacent in time among the plurality of target segmentation features ,
  • the target feature can be obtained by adjusting the merged feature, and then the target voice can be obtained according to the target feature.
  • the voice to be converted is segmented into 2 segmented voices, and 2 target segmented features are obtained after conversion.
  • the target segmented feature I is (A+C A )
  • the target segmented feature II is (C B +B)
  • the overlap feature of the standard segment feature I and the target segment feature II is C.
  • the first 1/2 of the overlap feature C in the target segment feature I can be retained, that is, before C A.
  • the last 1/2 of the overlapping feature C in the target segmentation feature II, that is, after C B , the target feature is (A + CA before + C B after + B), and the target voice is obtained according to the target feature.
  • the target speech includes: acquiring a feature weight set, the feature weight set includes a first feature weight and a second feature weight, the first feature weight and the second feature weight are any two targets that are adjacent in time The weights corresponding to the overlapping features in the segmented features; according to the target segmented feature corresponding to each of the segmented features, any two target segments that are adjacent in time among the plurality of target segmented features.
  • the overlapping features of the features and the feature weight set obtain the target speech.
  • the feature weight set is used to determine the weights of overlapping features of any two target segment features that are adjacent in time in the two target segment features.
  • the voice to be converted is segmented into 2 segmented voices, and 2 target segmented features are obtained after conversion.
  • the target segmented feature I is (A+C A ), and the target segmented feature II is (C B +B), the overlap feature of the standard segment feature I and the target segment feature II is C, the first feature weight in the feature weight set is m, which is used to determine the weight of the overlap feature C in the target segment feature I, and the second feature The weight is n, which is used to determine the weight of the overlapping feature C in the target segmented feature II.
  • the target feature of the voice to be converted is (A+m ⁇ C A +n ⁇ C B +B), and the target voice is obtained according to the target feature.
  • a voice conversion device As shown in FIG. 5, in one embodiment, a voice conversion device is provided, and the device includes:
  • the obtaining module 502 is configured to obtain the voice to be converted and the original conversion model, and the format of the original conversion model is an online format;
  • the format conversion module 504 is used for format conversion of the original conversion model to obtain an offline format target conversion model
  • the feature extraction module 506 is configured to perform feature extraction on the voice to be converted to obtain the feature to be converted;
  • the feature conversion module 508 is configured to input the features to be converted into the target conversion model to obtain the target features output by the target conversion model;
  • the result module 510 is configured to obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  • the above voice conversion device obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to an offline format, According to the features to be converted and the target conversion model in offline format, the target features can be obtained, and then the target speech can be obtained according to the target features.
  • This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
  • the feature extraction module 506 is configured to perform periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain periodic features and aperiodic features corresponding to the voice to be converted, and the periodic features include Fundamental frequency and spectrum envelope; the feature to be converted is obtained according to the periodic feature and the non-periodic feature.
  • the feature extraction module 506 is specifically configured to obtain a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a higher dimension than the periodic feature and the aperiodic feature The sum of the dimensions; format conversion of the target dimensional feature to obtain the feature to be converted.
  • the target conversion model runs based on a computer unified device architecture recurrent neural network toolkit framework.
  • the feature extraction module 506 is configured to perform segmentation processing on the voice to be converted to obtain multiple segmented voices, and perform feature extraction on the multiple segmented voices to obtain multiple segmented features
  • the feature conversion module 508 is configured to input each of the segmented features into the target conversion model in parallel to obtain the target segmented feature corresponding to each of the segmented features; the result module 510 is configured to The target segmentation feature corresponding to each of the segmentation features obtains the target voice.
  • any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features
  • the result module 510 is configured to correspond to each of the segmentation features.
  • the target voice is obtained by overlapping features of the target segmentation feature and any two temporally adjacent target segmentation features among the plurality of target segmentation features.
  • the result module 510 is used to obtain a feature weight set, the feature weight set includes a first feature weight and a second feature weight, and the first feature weight and the second feature weight are relative in time.
  • the weight corresponding to the overlapping feature in any two adjacent target segmentation features; according to the target segmentation feature corresponding to each of the segmentation features, and the temporally adjacent ones of the multiple target segmentation features The target speech is obtained by overlapping features of any two target segmented features and the feature weight set.
  • Fig. 6 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device can be a terminal, a server, or a voice conversion device.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and may also store a computer program.
  • the processor can realize the voice conversion method.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the voice conversion method.
  • FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device which includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  • the above-mentioned computer equipment obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to the offline format, according to The feature to be converted and the target conversion model in the offline format can obtain the target feature, and then the target voice can be obtained according to the target feature.
  • This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
  • the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain the period corresponding to the voice to be converted Features and aperiodic features, where the periodic features include a fundamental frequency and a spectrum envelope; the features to be converted are obtained according to the periodic features and the aperiodic features.
  • the obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a high dimensionality Based on the sum of the dimensions of the periodic feature and the non-periodic feature; performing format conversion on the target dimensional feature to obtain the feature to be converted.
  • the target conversion model runs based on the recurrent neural network toolkit framework of a computer unified device architecture.
  • the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing segmentation processing on the voice to be converted to obtain multiple segmented voices; Perform feature extraction on speech to obtain multiple segmented features; the inputting the features to be converted into the target conversion model to obtain the target features output by the target conversion model includes: parallelizing each of the segmented features Inputting the target conversion model to obtain the target segment feature corresponding to each of the segment features; the obtaining the target voice according to the target feature output by the target conversion model includes: according to each of the segment features corresponding The target voice is obtained by the target segmentation feature.
  • any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features; the target segmentation feature corresponding to each of the segmentation features is obtained
  • the target voice includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features adjacent in time among the plurality of target segment features. Describe the target voice.
  • the target segmentation feature corresponding to each of the segmentation features and the overlap of any two target segmentation features adjacent in time among the plurality of the target segmentation features includes: acquiring a feature weight set, the feature weight set including a first feature weight and a second feature weight, and the first feature weight and the second feature weight are any two that are adjacent in time. Weights corresponding to overlapping features in the target segmentation features; according to the target segmentation feature corresponding to each of the segmentation features, any two targets that are adjacent in time among the plurality of target segmentation features The overlapping features of the segmented features and the feature weight set are used to obtain the target speech.
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  • the above-mentioned computer-readable storage medium obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and the format of the original conversion model is converted to an offline format. Then, the target features can be obtained according to the features to be converted and the target conversion model in offline format, and then the target speech can be obtained according to the target features.
  • This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
  • the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain the period corresponding to the voice to be converted Features and aperiodic features, where the periodic features include a fundamental frequency and a spectrum envelope; the features to be converted are obtained according to the periodic features and the aperiodic features.
  • the obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a high dimensionality Based on the sum of the dimensions of the periodic feature and the non-periodic feature; performing format conversion on the target dimensional feature to obtain the feature to be converted.
  • the target conversion model runs based on the recurrent neural network toolkit framework of a computer unified device architecture.
  • the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing segmentation processing on the voice to be converted to obtain multiple segmented voices; Perform feature extraction on speech to obtain multiple segmented features; the inputting the features to be converted into the target conversion model to obtain the target features output by the target conversion model includes: parallelizing each of the segmented features Inputting the target conversion model to obtain the target segment feature corresponding to each of the segment features; the obtaining the target voice according to the target feature output by the target conversion model includes: according to each of the segment features corresponding The target voice is obtained by the target segmentation feature.
  • any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features; and the target segmentation feature is obtained according to the target segmentation feature corresponding to each of the segmentation features.
  • the target speech includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features that are adjacent in time among the plurality of target segment features. Describe the target voice.
  • the target segmentation feature corresponding to each of the segmentation features and the overlap of any two target segmentation features adjacent in time among the plurality of the target segmentation features includes: acquiring a feature weight set, the feature weight set including a first feature weight and a second feature weight, and the first feature weight and the second feature weight are any two that are adjacent in time. Weights corresponding to overlapping features in the target segmentation features; according to the target segmentation feature corresponding to each of the segmentation features, any two targets that are adjacent in time among the plurality of target segmentation features The overlapping features of the segmented features and the feature weight set are used to obtain the target speech.
  • voice conversion method voice conversion device, computer equipment, and computer-readable storage medium belong to a general inventive concept.
  • the content in the embodiments of the voice conversion method, voice conversion device, computer equipment, and computer-readable storage medium Can be applied to each other.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice conversion method and apparatus, a computer device, and a computer-readable storage medium. The method comprises: acquiring a voice to be converted and an original conversion model, the format of the original conversion model being an online format (202); performing format conversion on the original conversion model to obtain a target conversion model in an offline format (204); performing feature extraction on the voice to obtain features to be converted (206); inputting the features into the target conversion model to obtain target features outputted by the target conversion model (208); and obtaining a target voice according to the target features outputted by the target conversion model, wherein the target voice has the same voice content as the voice to be converted, and the target voice has a different sound from the voice to be converted (210). The voice conversion method may not only perform high-quality voice conversion in an offline state, but also has a fast running speed and can achieve real-time voice conversion.

Description

语音转换方法、装置、计算机设备及计算机可读存储介质Voice conversion method, device, computer equipment and computer readable storage medium 技术领域Technical field
本申请涉及音频处理技术领域,尤其涉及一种语音转换方法、装置、计算机设备及计算机可读存储介质。This application relates to the field of audio processing technology, and in particular to a voice conversion method, device, computer equipment, and computer-readable storage medium.
背景技术Background technique
语音转换技术是一种保持语义内容不变的情况下,将源语音转换为目标语音的技术,其中,源语音为第一人声发出的语音,目标语音为第二人声发出的语音,也即将第一人声发出的源语音通过语音转换技术,转换为语义相同的第二人声发出的目标语音。Voice conversion technology is a technology that converts the source voice into the target voice while keeping the semantic content unchanged. The source voice is the voice uttered by the first human voice, and the target voice is the voice uttered by the second human voice. That is, the source voice emitted by the first human voice is converted into the target voice emitted by the second human voice with the same semantics through voice conversion technology.
随着深度神经网络技术的快速发展,基于深度学习的语音转换方法转换的语音相似度高且语音质量好、流畅度好。目前基于深度学习的语音转换方法主要包括两个步骤,首先用大量的语音数据训练转换模型,再用训练好的模型来进行语音转换。因为训练对计算资源要求很高,离线端的资源很少,性能很低,用来做训练容易出现资源耗尽的情况,即使能够训练,效率也很低,时间成本太高,难以使用。因此,目前基于深度学习的语音转换功能要依托在线的高性能的服务器才能够实现,离线状态下无法使用。With the rapid development of deep neural network technology, the voice conversion method based on deep learning has high voice similarity, good voice quality and good fluency. The current deep learning-based speech conversion method mainly includes two steps. First, a large amount of speech data is used to train the conversion model, and then the trained model is used for speech conversion. Because training requires high computing resources, there are few offline resources and low performance. It is easy to run out of resources when used for training. Even if it can be trained, the efficiency is very low, and the time cost is too high and difficult to use. Therefore, the current deep learning-based voice conversion function can only be realized by relying on online high-performance servers, and cannot be used offline.
申请内容Application content
基于此,有必要针对上述问题,提出了一种离线状态下仍能够进行高质量语音转换的语音转换方法、装置、计算机设备及存储介质。Based on this, it is necessary to address the above problems and propose a voice conversion method, device, computer equipment, and storage medium that can still perform high-quality voice conversion in an offline state.
一种语音转换方法,所述方法包括:A voice conversion method, the method includes:
获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;Acquiring the voice to be converted and the original conversion model, where the format of the original conversion model is an online format;
将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;Format conversion of the original conversion model to obtain a target conversion model in offline format;
对所述待转换语音进行特征提取,得到待转换特征;Performing feature extraction on the voice to be converted to obtain the feature to be converted;
将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征;Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;
根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
一种语音转换的装置,所述装置包括:A device for voice conversion, the device includes:
获取模块,用于获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;An acquisition module for acquiring the voice to be converted and the original conversion model, the format of the original conversion model is an online format;
格式转换模块,用于将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;A format conversion module for format conversion of the original conversion model to obtain a target conversion model in an offline format;
特征提取模块,用于对所述待转换语音进行特征提取,得到待转换特征;The feature extraction module is used to perform feature extraction on the voice to be converted to obtain the feature to be converted;
特征转换模块,用于将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征;The feature conversion module is configured to input the features to be converted into the target conversion model to obtain the target features output by the target conversion model;
结果模块,用于根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The result module is configured to obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:A computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;Acquiring the voice to be converted and the original conversion model, where the format of the original conversion model is an online format;
将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;Format conversion of the original conversion model to obtain a target conversion model in offline format;
对所述待转换语音进行特征提取,得到待转换特征;Performing feature extraction on the voice to be converted to obtain the feature to be converted;
将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征;Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;
根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;Acquiring the voice to be converted and the original conversion model, where the format of the original conversion model is an online format;
将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;Format conversion of the original conversion model to obtain a target conversion model in offline format;
对所述待转换语音进行特征提取,得到待转换特征;Performing feature extraction on the voice to be converted to obtain the feature to be converted;
将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征;Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;
根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
采用本申请实施例,具有如下有益效果:The embodiments of this application have the following beneficial effects:
上述语音转换方法、装置、计算机设备及计算机可读存储介质,通过获取待转换语音和原始转换模型,由于原始转换模型无法在离线状态下工作,因此提取待转换语音的特征得到待转换特征,将原始转换模型的格式转换为离线格式后,根据待转换特征和离线格式的目标转换模型可以得到目标特征,然后根据目标特征得到目标语音。这种语音转换方法不仅可以在离线状态下高质量进行语音转换,而且运行速度快,可以实现语音的实时转换。The above voice conversion method, device, computer equipment and computer readable storage medium, by acquiring the voice to be converted and the original conversion model, since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted. After the format of the original conversion model is converted to the offline format, the target feature can be obtained according to the features to be converted and the target conversion model in the offline format, and then the target voice can be obtained according to the target feature. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
其中:among them:
图1为一个实施例中语音转换方法的应用环境图;Figure 1 is an application environment diagram of a voice conversion method in an embodiment;
图2为一个实施例中语音转换方法的流程图;Figure 2 is a flowchart of a voice conversion method in an embodiment;
图3为一个实施例中语音转换方法的流程图;Figure 3 is a flowchart of a voice conversion method in an embodiment;
图4为一个实施例中对待转换语音进行分段处理示意图;FIG. 4 is a schematic diagram of segmentation processing of the voice to be converted in an embodiment;
图5为一个实施例中语音转换装置的结构框图;Figure 5 is a structural block diagram of a voice conversion device in an embodiment;
图6为一个实施例中计算机设备的结构框图。Fig. 6 is a structural block diagram of a computer device in an embodiment.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
图1为一个实施例中语音转换方法应用环境图。如图1所示,该语音转换方法应用于语音转换系统。该语音转换系统包括终端,终端具体可以是台式终端或移动终端,移动终端具体可以是手机、平板电脑、笔记本电脑等中的至少一种。终端包括话筒、转换单元和播放器,话筒用于获取待转换语音,转换单元用于将待转换语音转换成与待转换语音语音内容相同但声音不同的目标语音,播放器用于播放目标语音。Fig. 1 is an application environment diagram of a voice conversion method in an embodiment. As shown in Figure 1, the voice conversion method is applied to a voice conversion system. The voice conversion system includes a terminal. The terminal may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, and a notebook computer. The terminal includes a microphone, a conversion unit, and a player. The microphone is used to obtain the voice to be converted. The conversion unit is used to convert the voice to be converted into a target voice with the same voice content but a different voice. The player is used to play the target voice.
如图2所示,在一个实施例中,提供了一种语音转换方法。该方法既可以应用于终端,也可以应用于服务器,还可以应用于其他语音转换装置中。本实施例以应用于语音转换装置举例说明。在离线状态下,语音转换装置获取待转换语音后,通过下述语音转换方法,可以得到与待转换语音语音内容相同且声音不同的目标语音。该语音转换方法具体包括如下步骤:As shown in Figure 2, in one embodiment, a voice conversion method is provided. The method can be applied to terminals, servers, and other voice conversion devices. In this embodiment, it is applied to a voice conversion device as an example. In the offline state, after the voice conversion device obtains the voice to be converted, the following voice conversion method can obtain the target voice with the same voice content and different voice as the voice to be converted. The voice conversion method specifically includes the following steps:
步骤202:获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式。Step 202: Obtain the voice to be converted and the original conversion model, where the format of the original conversion model is an online format.
其中,待转换语音是指以待转换人声发出且待转换为目标人声发出的声音。Among them, the voice to be converted refers to the voice that is emitted by the human voice to be converted and is to be converted into the target human voice.
其中,在线格式是指仅在网络连接的状态下方可打开或正常工作的文件的保存格式。Among them, the online format refers to the saving format of files that can be opened or work normally only when the network is connected.
其中,原始转换模型是指输入为待转换语音的待转换特征,输出为目标语音的目标特征的模型,用于在网络连接的状态下,根据待转换语音的待转换特征获取目标语音的目标特征。Among them, the original conversion model refers to a model in which the input is the feature to be converted of the voice to be converted, and the output is the target feature of the target voice, which is used to obtain the target feature of the target voice according to the feature of the voice to be converted in the state of network connection .
步骤204:将所述原始转换模型进行格式转换,得到离线格式的目标转换模型。Step 204: Perform format conversion on the original conversion model to obtain an offline format target conversion model.
其中,离线格式是指与网络断开连接的状态下仍可打开或正常工作的文件的保存格式。Among them, the offline format refers to the saving format of files that can be opened or work normally when disconnected from the network.
其中,目标转换模型用于在网络断开连接的状态下,根据待转换语音的待转换特征可以得到目标语音的目标特征。Among them, the target conversion model is used to obtain the target characteristics of the target voice according to the characteristics of the voice to be converted when the network is disconnected.
将所述原始转换模型进行格式转换,得到离线格式的目标转换模型。示例性的,原始转换模型是由TensorFlow(谷歌开发的机器学习库,采用python语言)框架训练出来的模型文件,原始转换模型的保存格式为在线格式CheckPoint(简写ckpt),可以将其保存格式转换为离线格式JetSoft Shield Now(简写jsn),以得到目标转换模型。ckpt格式的原始转换模型记录的信息比较多,比如训练原始转换模型时用到的一些参数、数据,在离线状态下语音转换的过程不需要这部分数据,因此在将原始转换模型的保存格式转换成jsn格式时会去掉多余的数据,这相当于对模型文件进行了简化和压缩,可以提高离线状态下的运行速度,进而提高语音转换的速度,实现语音的实时转换。The original conversion model is formatted to obtain an offline format target conversion model. Exemplarily, the original conversion model is a model file trained by the TensorFlow (a machine learning library developed by Google, using the python language) framework. The original conversion model is saved in the online format CheckPoint (abbreviated ckpt), which can be converted to save format It is the offline format JetSoft Shield Now (jsn in short) to obtain the target conversion model. The original conversion model in ckpt format records a lot of information, such as some parameters and data used when training the original conversion model. This part of the data is not needed in the process of voice conversion in the offline state, so it is necessary to convert the original conversion model to the save format. Excess data will be removed when it is in jsn format, which is equivalent to simplifying and compressing the model file, which can improve the running speed in offline state, thereby increasing the speed of voice conversion and realizing real-time voice conversion.
步骤206:对所述待转换语音进行特征提取,得到待转换特征。Step 206: Perform feature extraction on the voice to be converted to obtain the feature to be converted.
其中,待转换特征用于输入目标转换模型以获取所述待转换语音对应的目标特征。The feature to be converted is used to input the target conversion model to obtain the target feature corresponding to the voice to be converted.
根据所述待转换语音得到所述待转换语音的频谱特征,如所述待转换语音的梅尔频谱,提取所述转换语音的特征,根据这些特征确定所述待转换语音的待转换特征。Obtain the spectral features of the voice to be converted according to the voice to be converted, such as the Mel spectrum of the voice to be converted, extract the features of the converted voice, and determine the feature to be converted of the voice to be converted based on these features.
步骤208:将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征。Step 208: Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model.
其中,目标特征用于获取与待转换语音语音内容相同且声音不同的目标语音。Among them, the target feature is used to obtain a target voice with the same voice content and different voice as the voice to be converted.
在离线状态下,当所述目标转换模型处于运行状态时,将所述待转换特征 输入到所述目标转换模型,所述目标转换模型直接输出与所述待转换特征对应的目标特征。In the offline state, when the target conversion model is in the running state, the feature to be converted is input to the target conversion model, and the target conversion model directly outputs the target feature corresponding to the feature to be converted.
步骤210:根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。Step 210: Obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
其中,目标语音是指目标人声发出的语音内容与所述待转换语音相同,声音与所述待转换语音不同的语音。Wherein, the target voice refers to a voice whose voice content is the same as the voice to be converted and whose voice is different from the voice to be converted.
根据目标特征可以得到目标语音的基频、频谱包络和非周期性等特征,确定目标语音的梅尔频谱,根据目标语音的梅尔频谱可以得到目标语音。示例性的,待转换特征为二进制化的130维的序列化数据,输入目标转换模型得到的目标特征也为130维的序列化数据,经过反归一化得到目标语音的lf0、mgc、bap特征数据,再用SPTK将其转换为f0、sp、ap特征,由目标语音的f0、sp、ap可以确定目标语音的梅尔频谱,由目标语音的梅尔频谱可以获取目标语音。According to the target characteristics, the fundamental frequency, spectrum envelope, and non-periodical characteristics of the target voice can be obtained, and the Mel spectrum of the target voice can be determined, and the target voice can be obtained according to the Mel spectrum of the target voice. Exemplarily, the feature to be converted is binarized 130-dimensional serialized data, and the target feature obtained by inputting the target conversion model is also 130-dimensional serialized data. After denormalization, the lf0, mgc, and bap features of the target voice are obtained. Then use SPTK to convert the data into f0, sp, and ap features. From the f0, sp, and ap of the target voice, the mel spectrum of the target voice can be determined, and the mel spectrum of the target voice can be used to obtain the target voice.
上述语音转换方法,通过获取待转换语音和原始转换模型,由于原始转换模型无法在离线状态下工作,因此提取待转换语音的特征得到待转换特征,将原始转换模型的格式转换为离线格式后,根据待转换特征和离线格式的目标转换模型可以得到目标特征,然后根据目标特征得到目标语音。这种语音转换方法不仅可以在离线状态下高质量进行语音转换,而且运行速度快,可以实现语音的实时转换。In the above voice conversion method, by acquiring the voice to be converted and the original conversion model, since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to an offline format, According to the features to be converted and the target conversion model in offline format, the target features can be obtained, and then the target speech can be obtained according to the target features. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
在一个实施例中,步骤206对所述待转换语音进行特征提取,得到待转换特征,包括:对所述待转换语音进行周期特征提取和非周期特征提取,得到所述待转换语音对应的周期特征和非周期特征,所述周期特征包括基频和频谱包络;根据所述周期特征和所述非周期特征得到待转换特征。In one embodiment, step 206 performs feature extraction on the voice to be converted to obtain the feature to be converted, including: performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain the period corresponding to the voice to be converted Features and aperiodic features, where the periodic features include a fundamental frequency and a spectrum envelope; the features to be converted are obtained according to the periodic features and the aperiodic features.
由于人说话时声道中存在多个产生声学能量的声源,其中非周期声源包括在唇、齿、喉、声道等处产生的送气声、摩擦声、爆破声,而周期声源则是在声门处由声带振动产生,因此待转换语音中包括周期成分和非周期成分,相应 的待转换语音的频谱特征中包括周期性特征和非周期特征。在本实施例中,以待转换语音的梅尔频谱为频谱特征进行说明。There are multiple sound sources that produce acoustic energy in the vocal tract when a person speaks. Among them, non-periodic sound sources include aspiration, friction, and blasting sound generated at the lips, teeth, throat, and vocal tract, while periodic sound sources are It is generated by the vibration of the vocal cords at the glottis, so the voice to be converted includes periodic components and non-periodic components, and the corresponding spectral features of the voice to be converted include periodic features and non-periodic features. In this embodiment, the Mel spectrum of the voice to be converted is used as the spectral feature for description.
其中,基频(Fundamental Frequency,f0)是指一组正弦波组成原始信号,频率最低的正弦波为基频,其他为泛音。频谱包络(spectral envelope,sp)是指将不同频率的振幅最高点通过平滑的曲线连接起来得到的包络线。非周期序列(aperiodic parameter,ap)是指语音的非周期信号参数。Among them, the fundamental frequency (Fundamental Frequency, f0) refers to a group of sine waves forming the original signal, the sine wave with the lowest frequency is the fundamental frequency, and the others are overtones. The spectral envelope (spectral envelope, sp) refers to the envelope obtained by connecting the highest amplitude points of different frequencies through a smooth curve. Aperiodic sequence (aperiodic parameter, ap) refers to aperiodic signal parameters of speech.
其中,周期特征是指待转换语音的梅尔频谱中的基频和频谱包络。Among them, the periodic feature refers to the fundamental frequency and spectrum envelope in the Mel spectrum of the voice to be converted.
其中,非周期特征是指待转换语音的梅尔频谱中的非周期序列。Among them, the aperiodic feature refers to the aperiodic sequence in the Mel spectrum of the voice to be converted.
根据所述周期特征和所述非周期特征,可以通过处理可以得到作为目标转换模型的输入的特征数据,该特征数据为待转换特征。示例性的,根据所述周期特征和所述非周期特征,得到一组特征数据,将特征数据进行计算和格式转换得到待转换特征。According to the periodic feature and the non-periodic feature, the feature data as the input of the target conversion model can be obtained through processing, and the feature data is the feature to be converted. Exemplarily, a set of characteristic data is obtained according to the periodic characteristic and the aperiodic characteristic, and the characteristic data is calculated and formatted to obtain the characteristic to be converted.
在一个实施例中,根据所述周期特征和所述非周期特征得到待转换特征,包括:根据所述周期特征和所述非周期特征得到目标维度特征,所述目标维度特征的维度高于所述周期特征和所述非周期特征的维度的和;对所述目标维度特征进行格式转换,得到所述待转换特征。In one embodiment, obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a higher dimension than all. The sum of the dimensionality of the periodic feature and the non-periodic feature; performing format conversion on the target dimensional feature to obtain the feature to be converted.
其中,目标维度特征是指根据所述周期特征和所述非周期特征得到的维度高于所述周期特征和所述非周期特征的维度的好的特征。将低维的所述周期特征和所述非周期特征映射得到高维的目标维度特征,可以提高合成语音的质量。Wherein, the target dimensional feature refers to a good feature whose dimension obtained according to the periodic feature and the aperiodic feature is higher than the dimension of the periodic feature and the aperiodic feature. The low-dimensional periodic features and the non-periodic features are mapped to obtain high-dimensional target dimensional features, which can improve the quality of synthesized speech.
示例性的,根据所述待转换语音的梅尔频谱得到所述周期特征f0和sp,所述非周期特征ap,用语音信号处理工具包(Speech Signal Processing Toolkit,SPTK)对三个特征进行处理得到1维的lF0(对F0取对数),41维的mgc和1维的波段非周期性(band aperiodicity,bap),根据lf0计算1维的是否发音(voice,un voice,简写为vuv)数据,对lf0、mgc、bap分别求一阶导数和二阶导数,各得到1×2、41×2、1×2维数据。最后对数据vuv,lf0及其导数,mgc及其导数,bap及其导数进行归一化处理,得到总计为130维的序列化数据。 将该130维的序列化数据作为目标维度特征。Exemplarily, the periodic features f0 and sp are obtained according to the Mel spectrum of the voice to be converted, and the non-periodic feature ap is used to process the three features using the Speech Signal Processing Toolkit (SPTK) Get 1-dimensional lF0 (take the logarithm of F0), 41-dimensional mgc and 1-dimensional aperiodicity (band aperiodicity, bap), and calculate the 1-dimensional voice (voice, unvoice, abbreviated as vuv) according to lf0 Data, find the first derivative and the second derivative of lf0, mgc, and bap respectively, and obtain 1×2, 41×2, and 1×2 dimensional data respectively. Finally, the data vuv, lf0 and its derivatives, mgc and its derivatives, bap and its derivatives are normalized to obtain a total of 130-dimensional serialized data. The 130-dimensional serialized data is used as the target dimensional feature.
将目标维度特征进行格式转换,以使其满足目标转换模型的输入的格式要求,经格式转换得到的特征数据即为所述待转换特征。示例性的,当所述目标转换模型的输入的格式要求为二进制数据,则对所述目标维度特征进行二进制转换,得到的二进制数据即为所述待转换特征。The target dimension feature is formatted to meet the input format requirements of the target conversion model, and the feature data obtained by the format conversion is the feature to be converted. Exemplarily, when the input format of the target conversion model is required to be binary data, binary conversion is performed on the target dimensional feature, and the obtained binary data is the feature to be converted.
在一个实施例中,所述目标转换模型基于计算机统一设备架构递归神经网络工具包框架(Computed Unified Device Architecture RecurREnt Neural Network Toolkit,CURRENNT)运行。In one embodiment, the target conversion model runs based on the Computer Unified Device Architecture Recursive Neural Network Toolkit (Computed Unified Device Architecture Recursive Neural Network Toolkit, CURRENNT).
其中,CURRENNT是一种深度并行神经网络(Recurrent Neural Network,RNN)的开源并行实现,它通过英伟达的计算机统一设备架构(Computed Unified Device Architecture,CUDA)支持图形处理单元(Graphics Processing Unit,GPU)。CURRENNT支持带有长短期记忆(Long Short-Term Memory,LSTM)存储单元的单向和双向RNN,从而克服了梯度消失的问题。Among them, CURRENNT is an open source parallel implementation of a deep parallel neural network (Recurrent Neural Network, RNN). It supports the Graphics Processing Unit (GPU) through NVIDIA's Computer Unified Device Architecture (CUDA). CURRENNT supports one-way and two-way RNNs with Long Short-Term Memory (LSTM) storage units, thereby overcoming the problem of vanishing gradients.
将所述目标转换模型置于CURRENNT中,所述目标转换模型处于运行状态,此时将所述待转换特征放入同一CURRENNT中,所述待转换特征将输入所述目标转换模型,所述目标转换模型输出与所述待转换特征对应的目标特征。Put the target conversion model in CURRENNT, the target conversion model is in the running state, and put the features to be converted into the same CURRENNT, the features to be converted will be input into the target conversion model, and the target The conversion model outputs the target feature corresponding to the feature to be converted.
如图3所示,在一个实施例中,所述方法还包括:As shown in Figure 3, in one embodiment, the method further includes:
步骤306:对所述待转换语音进行分段处理,得到多个分段语音。Step 306: Perform segmentation processing on the voice to be converted to obtain multiple segmented voices.
由于离线设备的计算资源有限,若所述待转换语音时长较长时,直接对待转换语音进行转换,运行速度慢,无法实现语音的实时转换。将所述待转换语音进行分段处理,得到多个分段语音,由于分段语音时长短,可以快速进行转换,从而可以大大提高运行速度。示例性的,所述待转换语音时长大于预设时长时,将所述待转换语音按照预设条件进行分段。如图4所示,将所述待转换语音41按照时长平均分成3段,得到3个分段语音42。Due to the limited computing resources of offline devices, if the duration of the voice to be converted is long, the voice to be converted is directly converted, the running speed is slow, and the real-time voice conversion cannot be realized. The voice to be converted is processed in segments to obtain multiple segmented voices. Due to the short duration of the segmented voices, the conversion can be performed quickly, thereby greatly improving the running speed. Exemplarily, when the duration of the voice to be converted is greater than the preset duration, the voice to be converted is segmented according to a preset condition. As shown in FIG. 4, the voice to be converted 41 is divided into 3 segments evenly according to the length of time, and 3 segmented voices 42 are obtained.
步骤308:对所述多个分段语音进行特征提取,得到多个分段特征。Step 308: Perform feature extraction on the multiple segmented voices to obtain multiple segmented features.
其中,分段特征是指每个分段语音对应的待转换特征。Among them, the segmented feature refers to the feature to be converted corresponding to each segmented voice.
分别对每个分段语音进行特征提取,根据提取得到的特征得到每个分段语音对应的待转换特征,即得到每个分段语音的分段特征。The feature extraction is performed on each segmented voice respectively, and the feature to be converted corresponding to each segmented voice is obtained according to the extracted features, that is, the segmented feature of each segmented voice is obtained.
步骤310:将每个所述分段特征并行的输入所述目标转换模型,得到每个所述分段特征对应的目标分段特征。Step 310: Input each of the segmented features into the target conversion model in parallel to obtain a target segmented feature corresponding to each of the segmented features.
其中,目标分段特征是指每个分段特征对应的目标特征。Among them, the target segment feature refers to the target feature corresponding to each segment feature.
得到多个分段特征后,调用中央处理器(central processing unit,CPU)的多个核同时转换多个分段特征,开启多个进程,每个进程都单独执行将分段特征输入到目标转换模型中,得到该分段特征对应的目标分段特征。将每个所述分段特征并行输入所述目标转换模型中,比每个所述分段特征依次进行转换速度要快很多,从而有利于实现语音的实时转换。After obtaining multiple segmented features, call multiple cores of the central processing unit (CPU) to convert multiple segmented features at the same time, open multiple processes, and each process individually executes the input of the segmented features to the target conversion In the model, the target segment feature corresponding to the segment feature is obtained. Inputting each of the segmented features into the target conversion model in parallel is much faster than sequentially converting each of the segmented features, thereby facilitating real-time speech conversion.
步骤312:根据每个所述分段特征对应的目标分段特征得到目标语音。Step 312: Obtain a target voice according to the target segment feature corresponding to each of the segment features.
可以将每个所述分段特征对应的目标分段特征合成在得到目标特征,根据目标特征获得目标语音;还可以根据目标分段特征得到其对应的目标分段语音,将分段语音合成得到目标语音。示例性的,待转换语音被分段成5个分段语音,根据5个分段语音得到5个相应的分段特征,将5个相应的分段特征输入目标转换模型得到5个相应的目标分段特征,根据5个相应的目标分段特征得到5个相应的目标分段语音,将5个相应的目标分段语音合成可以得到目标语音。The target segmented features corresponding to each of the segmented features can be synthesized to obtain the target feature, and the target speech can be obtained according to the target feature; the corresponding target segmented speech can also be obtained according to the target segmented feature, and the segmented speech can be synthesized Target voice. Exemplarily, the voice to be converted is segmented into 5 segmented voices, and 5 corresponding segmented features are obtained from the 5 segmented voices, and the 5 corresponding segmented features are input into the target conversion model to obtain 5 corresponding targets Segmentation features: According to the 5 corresponding target segmentation features, 5 corresponding target segmented voices are obtained, and the 5 corresponding target segmented voices can be synthesized to obtain the target voice.
在一个实施例中,多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征包括重叠特征,步骤312根据每个所述分段特征对应的目标分段特征得到目标语音,包括:根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音。In one embodiment, any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features, and step 312 is obtained according to the target segmentation feature corresponding to each of the segmentation features. The target voice includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features adjacent in time among the plurality of target segment features. Describe the target voice.
如图4所示,为了防止所述待转换语音41由于分段处理造成后续提取特征时存在误差或者丢失某些特征,可以在分段处理时,多个分段语音42中在时间上相邻的任意两个分段语音42包括重叠部分421。As shown in FIG. 4, in order to prevent the segmentation processing of the voice to be converted 41 from causing errors in subsequent feature extraction or loss of some features, during segmentation processing, multiple segmented voices 42 may be adjacent in time. Any two of the segmented speech 42 includes an overlapping portion 421.
其中,重叠特征是指多个分段语音42中在时间上相邻的任意两个分段语 音42包括的重叠部分421经转换得到对应的目标特征。Wherein, the overlapping feature refers to that the overlapping part 421 included in any two segmented voices 42 adjacent in time in the plurality of segmented speeches 42 is converted to obtain the corresponding target feature.
将每个所述分段特征对应的目标分段特征合并在一起得到合并特征,根据所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征,对合并特征进行调整可以得到目标特征,再根据目标特征可以获取目标语音。示例性的,待转换语音被分段处理成2个分段语音,经过转换得到2个目标分段特征,目标分段特征I为(A+C A),目标分段特征II为(C B+B),标分段特征I和目标分段特征II的重叠特征为C,在得到目标特征过程中,可以保留目标分段特征I中重叠特征C的前1/2即C A前,保留目标分段特征II中重叠特征C的后1/2即C B后,目标特征为(A+C A前+C B后+B),根据目标特征获取目标语音。 The target segmentation features corresponding to each of the segmentation features are merged together to obtain a merged feature, according to the overlapping features of any two target segmentation features that are adjacent in time among the plurality of target segmentation features , The target feature can be obtained by adjusting the merged feature, and then the target voice can be obtained according to the target feature. Exemplarily, the voice to be converted is segmented into 2 segmented voices, and 2 target segmented features are obtained after conversion. The target segmented feature I is (A+C A ), and the target segmented feature II is (C B +B), the overlap feature of the standard segment feature I and the target segment feature II is C. In the process of obtaining the target feature, the first 1/2 of the overlap feature C in the target segment feature I can be retained, that is, before C A. The last 1/2 of the overlapping feature C in the target segmentation feature II, that is, after C B , the target feature is (A + CA before + C B after + B), and the target voice is obtained according to the target feature.
在一个实施例中,根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音,包括:获取特征权重集,所述特征权重集包括第一特征权重和第二特征权重,所述第一特征权重和第二特征权重为在时间上相邻的任意两个目标分段特征中的重叠特征对应的权重;根据每个所述分段特征对应的目标分段特征、所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征以及所述特征权重集得到所述目标语音。In one embodiment, it is obtained according to the target segmentation feature corresponding to each of the segmentation features and the overlapping feature of any two target segmentation features adjacent in time among the plurality of the target segmentation features The target speech includes: acquiring a feature weight set, the feature weight set includes a first feature weight and a second feature weight, the first feature weight and the second feature weight are any two targets that are adjacent in time The weights corresponding to the overlapping features in the segmented features; according to the target segmented feature corresponding to each of the segmented features, any two target segments that are adjacent in time among the plurality of target segmented features The overlapping features of the features and the feature weight set obtain the target speech.
其中,特征权重集用于确定在时间上相邻的任意两个目标分段特征的重叠特征分别在这两个目标分段特征中的权重大小。Among them, the feature weight set is used to determine the weights of overlapping features of any two target segment features that are adjacent in time in the two target segment features.
示例性的,待转换语音被分段处理成2个分段语音,经过转换得到2个目标分段特征,目标分段特征I为(A+C A),目标分段特征II为(C B+B),标分段特征I和目标分段特征II的重叠特征为C,特征权重集中第一特征权重为m,用于确定重叠特征C在目标分段特征I中的权重,第二特征权重为n,用于确定重叠特征C在目标分段特征II中的权重,待转换语音的目标特征为(A+m×C A+n×C B+B),根据目标特征获取目标语音。 Exemplarily, the voice to be converted is segmented into 2 segmented voices, and 2 target segmented features are obtained after conversion. The target segmented feature I is (A+C A ), and the target segmented feature II is (C B +B), the overlap feature of the standard segment feature I and the target segment feature II is C, the first feature weight in the feature weight set is m, which is used to determine the weight of the overlap feature C in the target segment feature I, and the second feature The weight is n, which is used to determine the weight of the overlapping feature C in the target segmented feature II. The target feature of the voice to be converted is (A+m×C A +n×C B +B), and the target voice is obtained according to the target feature.
如图5所示,在一个实施例中,提供了一种语音转换装置,该装置包括:As shown in FIG. 5, in one embodiment, a voice conversion device is provided, and the device includes:
获取模块502,用于获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;The obtaining module 502 is configured to obtain the voice to be converted and the original conversion model, and the format of the original conversion model is an online format;
格式转换模块504,用于将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;The format conversion module 504 is used for format conversion of the original conversion model to obtain an offline format target conversion model;
特征提取模块506,用于对所述待转换语音进行特征提取,得到待转换特征;The feature extraction module 506 is configured to perform feature extraction on the voice to be converted to obtain the feature to be converted;
特征转换模块508,用于将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征;The feature conversion module 508 is configured to input the features to be converted into the target conversion model to obtain the target features output by the target conversion model;
结果模块510,用于根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The result module 510 is configured to obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
上述语音转换装置,通过获取待转换语音和原始转换模型,由于原始转换模型无法在离线状态下工作,因此提取待转换语音的特征得到待转换特征,将原始转换模型的格式转换为离线格式后,根据待转换特征和离线格式的目标转换模型可以得到目标特征,然后根据目标特征得到目标语音。这种语音转换方法不仅可以在离线状态下高质量进行语音转换,而且运行速度快,可以实现语音的实时转换。The above voice conversion device obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to an offline format, According to the features to be converted and the target conversion model in offline format, the target features can be obtained, and then the target speech can be obtained according to the target features. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
在一个实施例中,所述特征提取模块506用于对所述待转换语音进行周期特征提取和非周期特征提取,得到所述待转换语音对应的周期特征和非周期特征,所述周期特征包括基频和频谱包络;根据所述周期特征和所述非周期特征得到待转换特征。In one embodiment, the feature extraction module 506 is configured to perform periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain periodic features and aperiodic features corresponding to the voice to be converted, and the periodic features include Fundamental frequency and spectrum envelope; the feature to be converted is obtained according to the periodic feature and the non-periodic feature.
在一个实施例中,所述特征提取模块506具体用于根据所述周期特征和所述非周期特征得到目标维度特征,所述目标维度特征的维度高于所述周期特征和所述非周期特征的维度的和;对所述目标维度特征进行格式转换,得到所述待转换特征。In an embodiment, the feature extraction module 506 is specifically configured to obtain a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a higher dimension than the periodic feature and the aperiodic feature The sum of the dimensions; format conversion of the target dimensional feature to obtain the feature to be converted.
在一个实施例中,所述目标转换模型基于计算机统一设备架构递归神经网 络工具包框架运行。In one embodiment, the target conversion model runs based on a computer unified device architecture recurrent neural network toolkit framework.
在一个实施例中,所述特征提取模块506用于对所述待转换语音进行分段处理,得到多个分段语音,对所述多个分段语音进行特征提取,得到多个分段特征;所述特征转换模块508用于将每个所述分段特征并行的输入所述目标转换模型,得到每个所述分段特征对应的目标分段特征;所述结果模块510用于根据每个所述分段特征对应的目标分段特征得到目标语音。In one embodiment, the feature extraction module 506 is configured to perform segmentation processing on the voice to be converted to obtain multiple segmented voices, and perform feature extraction on the multiple segmented voices to obtain multiple segmented features The feature conversion module 508 is configured to input each of the segmented features into the target conversion model in parallel to obtain the target segmented feature corresponding to each of the segmented features; the result module 510 is configured to The target segmentation feature corresponding to each of the segmentation features obtains the target voice.
在一个实施例中,多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征包括重叠特征,所述结果模块510用于根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音。In one embodiment, any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features, and the result module 510 is configured to correspond to each of the segmentation features. The target voice is obtained by overlapping features of the target segmentation feature and any two temporally adjacent target segmentation features among the plurality of target segmentation features.
在一个实施例中,所述结果模块510用于获取特征权重集,所述特征权重集包括第一特征权重和第二特征权重,所述第一特征权重和第二特征权重为在时间上相邻的任意两个目标分段特征中的重叠特征对应的权重;根据每个所述分段特征对应的目标分段特征、所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征以及所述特征权重集得到所述目标语音。In one embodiment, the result module 510 is used to obtain a feature weight set, the feature weight set includes a first feature weight and a second feature weight, and the first feature weight and the second feature weight are relative in time. The weight corresponding to the overlapping feature in any two adjacent target segmentation features; according to the target segmentation feature corresponding to each of the segmentation features, and the temporally adjacent ones of the multiple target segmentation features The target speech is obtained by overlapping features of any two target segmented features and the feature weight set.
图6示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是终端,也可以是服务器,还可以语音转换装置。如图6所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现语音转换方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行语音转换方法。本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Fig. 6 shows an internal structure diagram of a computer device in an embodiment. The computer device can be a terminal, a server, or a voice conversion device. As shown in Figure 6, the computer device includes a processor, a memory, and a network interface connected through a system bus. Among them, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program. When the computer program is executed by the processor, the processor can realize the voice conversion method. A computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the voice conversion method. Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
在一个实施例中,提出了一种计算机设备,包括存储器和处理器,所述存 储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;Acquiring the voice to be converted and the original conversion model, where the format of the original conversion model is an online format;
将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;Format conversion of the original conversion model to obtain a target conversion model in offline format;
对所述待转换语音进行特征提取,得到待转换特征;Performing feature extraction on the voice to be converted to obtain the feature to be converted;
将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征;Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;
根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
上述计算机设备,通过获取待转换语音和原始转换模型,由于原始转换模型无法在离线状态下工作,因此提取待转换语音的特征得到待转换特征,将原始转换模型的格式转换为离线格式后,根据待转换特征和离线格式的目标转换模型可以得到目标特征,然后根据目标特征得到目标语音。这种语音转换方法不仅可以在离线状态下高质量进行语音转换,而且运行速度快,可以实现语音的实时转换。The above-mentioned computer equipment obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and after the format of the original conversion model is converted to the offline format, according to The feature to be converted and the target conversion model in the offline format can obtain the target feature, and then the target voice can be obtained according to the target feature. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
在一个实施例中,所述对所述待转换语音进行特征提取,得到待转换特征,包括:对所述待转换语音进行周期特征提取和非周期特征提取,得到所述待转换语音对应的周期特征和非周期特征,所述周期特征包括基频和频谱包络;根据所述周期特征和所述非周期特征得到待转换特征。In one embodiment, the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain the period corresponding to the voice to be converted Features and aperiodic features, where the periodic features include a fundamental frequency and a spectrum envelope; the features to be converted are obtained according to the periodic features and the aperiodic features.
在一个实施例中,所述根据所述周期特征和所述非周期特征得到待转换特征,包括:根据所述周期特征和所述非周期特征得到目标维度特征,所述目标维度特征的维度高于所述周期特征和所述非周期特征的维度的和;对所述目标维度特征进行格式转换,得到所述待转换特征。In one embodiment, the obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a high dimensionality Based on the sum of the dimensions of the periodic feature and the non-periodic feature; performing format conversion on the target dimensional feature to obtain the feature to be converted.
在一个实施例中,所述目标转换模型基于计算机统一设备架构递归神经网络工具包框架运行。In one embodiment, the target conversion model runs based on the recurrent neural network toolkit framework of a computer unified device architecture.
在一个实施例中,所述对所述待转换语音进行特征提取,得到待转换特征, 包括:对所述待转换语音进行分段处理,得到多个分段语音;对所述多个分段语音进行特征提取,得到多个分段特征;所述将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征,包括:将每个所述分段特征并行的输入所述目标转换模型,得到每个所述分段特征对应的目标分段特征;所述根据所述目标转换模型输出的目标特征得到目标语音,包括:根据每个所述分段特征对应的目标分段特征得到目标语音。In one embodiment, the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing segmentation processing on the voice to be converted to obtain multiple segmented voices; Perform feature extraction on speech to obtain multiple segmented features; the inputting the features to be converted into the target conversion model to obtain the target features output by the target conversion model includes: parallelizing each of the segmented features Inputting the target conversion model to obtain the target segment feature corresponding to each of the segment features; the obtaining the target voice according to the target feature output by the target conversion model includes: according to each of the segment features corresponding The target voice is obtained by the target segmentation feature.
在一个实施例中,多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征包括重叠特征;所述根据每个所述分段特征对应的目标分段特征得到目标语音,包括:根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音。In one embodiment, any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features; the target segmentation feature corresponding to each of the segmentation features is obtained The target voice includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features adjacent in time among the plurality of target segment features. Describe the target voice.
在一个实施例中,所述根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音,包括:获取特征权重集,所述特征权重集包括第一特征权重和第二特征权重,所述第一特征权重和第二特征权重为在时间上相邻的任意两个目标分段特征中的重叠特征对应的权重;根据每个所述分段特征对应的目标分段特征、所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征以及所述特征权重集得到所述目标语音。In an embodiment, the target segmentation feature corresponding to each of the segmentation features and the overlap of any two target segmentation features adjacent in time among the plurality of the target segmentation features Obtaining the target voice from features includes: acquiring a feature weight set, the feature weight set including a first feature weight and a second feature weight, and the first feature weight and the second feature weight are any two that are adjacent in time. Weights corresponding to overlapping features in the target segmentation features; according to the target segmentation feature corresponding to each of the segmentation features, any two targets that are adjacent in time among the plurality of target segmentation features The overlapping features of the segmented features and the feature weight set are used to obtain the target speech.
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:In one embodiment, a computer-readable storage medium is provided that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;Acquiring the voice to be converted and the original conversion model, where the format of the original conversion model is an online format;
将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;Format conversion of the original conversion model to obtain a target conversion model in offline format;
对所述待转换语音进行特征提取,得到待转换特征;Performing feature extraction on the voice to be converted to obtain the feature to be converted;
将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征;Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;
根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语 音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
上述计算机可读存储介质,通过获取待转换语音和原始转换模型,由于原始转换模型无法在离线状态下工作,因此提取待转换语音的特征得到待转换特征,将原始转换模型的格式转换为离线格式后,根据待转换特征和离线格式的目标转换模型可以得到目标特征,然后根据目标特征得到目标语音。这种语音转换方法不仅可以在离线状态下高质量进行语音转换,而且运行速度快,可以实现语音的实时转换。The above-mentioned computer-readable storage medium obtains the voice to be converted and the original conversion model. Since the original conversion model cannot work in an offline state, the features of the voice to be converted are extracted to obtain the features to be converted, and the format of the original conversion model is converted to an offline format. Then, the target features can be obtained according to the features to be converted and the target conversion model in offline format, and then the target speech can be obtained according to the target features. This voice conversion method can not only perform high-quality voice conversion in an offline state, but also runs fast, and can realize real-time voice conversion.
在一个实施例中,所述对所述待转换语音进行特征提取,得到待转换特征,包括:对所述待转换语音进行周期特征提取和非周期特征提取,得到所述待转换语音对应的周期特征和非周期特征,所述周期特征包括基频和频谱包络;根据所述周期特征和所述非周期特征得到待转换特征。In one embodiment, the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain the period corresponding to the voice to be converted Features and aperiodic features, where the periodic features include a fundamental frequency and a spectrum envelope; the features to be converted are obtained according to the periodic features and the aperiodic features.
在一个实施例中,所述根据所述周期特征和所述非周期特征得到待转换特征,包括:根据所述周期特征和所述非周期特征得到目标维度特征,所述目标维度特征的维度高于所述周期特征和所述非周期特征的维度的和;对所述目标维度特征进行格式转换,得到所述待转换特征。In one embodiment, the obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, and the target dimensional feature has a high dimensionality Based on the sum of the dimensions of the periodic feature and the non-periodic feature; performing format conversion on the target dimensional feature to obtain the feature to be converted.
在一个实施例中,所述目标转换模型基于计算机统一设备架构递归神经网络工具包框架运行。In one embodiment, the target conversion model runs based on the recurrent neural network toolkit framework of a computer unified device architecture.
在一个实施例中,所述对所述待转换语音进行特征提取,得到待转换特征,包括:对所述待转换语音进行分段处理,得到多个分段语音;对所述多个分段语音进行特征提取,得到多个分段特征;所述将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征,包括:将每个所述分段特征并行的输入所述目标转换模型,得到每个所述分段特征对应的目标分段特征;所述根据所述目标转换模型输出的目标特征得到目标语音,包括:根据每个所述分段特征对应的目标分段特征得到目标语音。In one embodiment, the performing feature extraction on the voice to be converted to obtain the feature to be converted includes: performing segmentation processing on the voice to be converted to obtain multiple segmented voices; Perform feature extraction on speech to obtain multiple segmented features; the inputting the features to be converted into the target conversion model to obtain the target features output by the target conversion model includes: parallelizing each of the segmented features Inputting the target conversion model to obtain the target segment feature corresponding to each of the segment features; the obtaining the target voice according to the target feature output by the target conversion model includes: according to each of the segment features corresponding The target voice is obtained by the target segmentation feature.
在一个实施例中,多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征包括重叠特征;所述根据每个所述分段特征对应的目标分段特征得 到目标语音,包括:根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音。In an embodiment, any two of the target segmentation features that are adjacent in time among the plurality of target segmentation features include overlapping features; and the target segmentation feature is obtained according to the target segmentation feature corresponding to each of the segmentation features. The target speech includes: obtaining the result according to the target segment feature corresponding to each of the segment features and the overlapping features of any two target segment features that are adjacent in time among the plurality of target segment features. Describe the target voice.
在一个实施例中,所述根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音,包括:获取特征权重集,所述特征权重集包括第一特征权重和第二特征权重,所述第一特征权重和第二特征权重为在时间上相邻的任意两个目标分段特征中的重叠特征对应的权重;根据每个所述分段特征对应的目标分段特征、所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征以及所述特征权重集得到所述目标语音。In an embodiment, the target segmentation feature corresponding to each of the segmentation features and the overlap of any two target segmentation features adjacent in time among the plurality of the target segmentation features Obtaining the target voice from features includes: acquiring a feature weight set, the feature weight set including a first feature weight and a second feature weight, and the first feature weight and the second feature weight are any two that are adjacent in time. Weights corresponding to overlapping features in the target segmentation features; according to the target segmentation feature corresponding to each of the segmentation features, any two targets that are adjacent in time among the plurality of target segmentation features The overlapping features of the segmented features and the feature weight set are used to obtain the target speech.
需要说明的是,上述语音转换方法、语音转换装置、计算机设备及计算机可读存储介质属于一个总的发明构思,语音转换方法、语音转换装置、计算机设备及计算机可读存储介质实施例中的内容可相互适用。It should be noted that the above-mentioned voice conversion method, voice conversion device, computer equipment, and computer-readable storage medium belong to a general inventive concept. The content in the embodiments of the voice conversion method, voice conversion device, computer equipment, and computer-readable storage medium Can be applied to each other.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a non-volatile computer readable storage medium. Here, when the program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and their descriptions are more specific and detailed, but they should not be understood as a limitation to the scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (10)

  1. 一种语音转换方法,所述方法包括:A voice conversion method, the method includes:
    获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;Acquiring the voice to be converted and the original conversion model, where the format of the original conversion model is an online format;
    将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;Format conversion of the original conversion model to obtain a target conversion model in offline format;
    对所述待转换语音进行特征提取,得到待转换特征;Performing feature extraction on the voice to be converted to obtain the feature to be converted;
    将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输出的目标特征;Input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;
    根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The target voice is obtained according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  2. 根据权利要求1所述的语音转换方法,其特征在于,所述对所述待转换语音进行特征提取,得到待转换特征,包括:The voice conversion method according to claim 1, wherein the feature extraction of the voice to be converted to obtain the feature to be converted comprises:
    对所述待转换语音进行周期特征提取和非周期特征提取,得到所述待转换语音对应的周期特征和非周期特征,所述周期特征包括基频和频谱包络;Performing periodic feature extraction and aperiodic feature extraction on the voice to be converted to obtain periodic features and aperiodic features corresponding to the voice to be converted, the periodic features including a fundamental frequency and a spectrum envelope;
    根据所述周期特征和所述非周期特征得到待转换特征。The feature to be converted is obtained according to the periodic feature and the non-periodic feature.
  3. 根据权利要求2所述的语音转换方法,其特征在于,所述根据所述周期特征和所述非周期特征得到待转换特征,包括:The voice conversion method according to claim 2, wherein the obtaining the feature to be converted according to the periodic feature and the non-periodic feature comprises:
    根据所述周期特征和所述非周期特征得到目标维度特征,所述目标维度特征的维度高于所述周期特征和所述非周期特征的维度的和;Obtaining a target dimensional feature according to the periodic feature and the aperiodic feature, where the dimensionality of the target dimensional feature is higher than the sum of the dimensionality of the periodic feature and the aperiodic feature;
    对所述目标维度特征进行格式转换,得到所述待转换特征。Perform format conversion on the target dimensional feature to obtain the feature to be converted.
  4. 根据权利要求1所述的语音转换方法,其特征在于,所述目标转换模型基于计算机统一设备架构递归神经网络工具包框架运行。The speech conversion method according to claim 1, wherein the target conversion model runs based on a recurrent neural network toolkit framework of a computer unified device architecture.
  5. 根据权利要求1所述的语音转换方法,其特征在于,所述对所述待转换语音进行特征提取,得到待转换特征,包括:The voice conversion method according to claim 1, wherein the feature extraction of the voice to be converted to obtain the feature to be converted comprises:
    对所述待转换语音进行分段处理,得到多个分段语音;Performing segmentation processing on the voice to be converted to obtain multiple segmented voices;
    对所述多个分段语音进行特征提取,得到多个分段特征;Performing feature extraction on the multiple segmented voices to obtain multiple segmented features;
    所述将所述待转换特征输入所述目标转换模型,得到所述目标转换模型输 出的目标特征,包括:The inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model includes:
    将每个所述分段特征并行的输入所述目标转换模型,得到每个所述分段特征对应的目标分段特征;Input each of the segmented features into the target conversion model in parallel to obtain a target segmented feature corresponding to each of the segmented features;
    所述根据所述目标转换模型输出的目标特征得到目标语音,包括:The obtaining the target voice according to the target feature output by the target conversion model includes:
    根据每个所述分段特征对应的目标分段特征得到目标语音。The target voice is obtained according to the target segment feature corresponding to each of the segment features.
  6. 根据权利要求5所述的语音转换方法,其特征在于,多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征包括重叠特征;所述根据每个所述分段特征对应的目标分段特征得到目标语音,包括:The voice conversion method according to claim 5, wherein any two of the target segment features that are adjacent in time among the plurality of target segment features include overlapping features; The target voice is obtained by the target segmentation feature corresponding to the segment feature, including:
    根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音。The target voice is obtained according to the target segment feature corresponding to each of the segment features and the overlapping feature of any two target segment features adjacent in time among the plurality of target segment features.
  7. 根据权利要求6所述的语音转换方法,其特征在于,所述根据每个所述分段特征对应的目标分段特征和所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征得到所述目标语音,包括:The voice conversion method according to claim 6, wherein the target segment feature corresponding to each of the segment features is adjacent in time among the plurality of target segment features The target speech is obtained by overlapping features of any two target segmentation features, including:
    获取特征权重集,所述特征权重集包括第一特征权重和第二特征权重,所述第一特征权重和第二特征权重为在时间上相邻的任意两个目标分段特征中的重叠特征对应的权重;Obtain a feature weight set, the feature weight set includes a first feature weight and a second feature weight, the first feature weight and the second feature weight are overlapping features of any two target segmented features that are adjacent in time Corresponding weight;
    根据每个所述分段特征对应的目标分段特征、所述多个所述目标分段特征中的在时间上相邻的任意两个目标分段特征的重叠特征以及所述特征权重集得到所述目标语音。Obtained according to the target segmentation feature corresponding to each of the segmentation features, the overlapping feature of any two target segmentation features adjacent in time among the plurality of the target segmentation features, and the feature weight set The target voice.
  8. 一种语音转换装置,其特征在于,所述装置包括:A voice conversion device, characterized in that the device includes:
    获取模块,用于获取待转换语音和原始转换模型,所述原始转换模型的格式为在线格式;An acquisition module for acquiring the voice to be converted and the original conversion model, the format of the original conversion model is an online format;
    格式转换模块,用于将所述原始转换模型进行格式转换,得到离线格式的目标转换模型;A format conversion module for format conversion of the original conversion model to obtain a target conversion model in an offline format;
    特征提取模块,用于对所述待转换语音进行特征提取,得到待转换特征;The feature extraction module is used to perform feature extraction on the voice to be converted to obtain the feature to be converted;
    特征转换模块,用于将所述待转换特征输入所述目标转换模型,得到所述 目标转换模型输出的目标特征;The feature conversion module is configured to input the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model;
    结果模块,用于根据所述目标转换模型输出的目标特征得到目标语音,所述目标语音的语音内容和所述待转换语音相同,所述目标语音的声音与所述待转换语音不同。The result module is configured to obtain a target voice according to the target feature output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
  9. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至7中任一项所述语音转换方法的步骤。A computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor executes the steps of the voice conversion method according to any one of claims 1 to 7.
  10. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至7中任一项所述语音转换方法的步骤。A computer device, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the speech according to any one of claims 1 to 7 Steps of the conversion method.
PCT/CN2019/126865 2019-12-20 2019-12-20 Voice conversion method and apparatus, computer device and computer-readable storage medium WO2021120145A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/126865 WO2021120145A1 (en) 2019-12-20 2019-12-20 Voice conversion method and apparatus, computer device and computer-readable storage medium
CN201980003120.8A CN111108558B (en) 2019-12-20 2019-12-20 Voice conversion method, device, computer equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/126865 WO2021120145A1 (en) 2019-12-20 2019-12-20 Voice conversion method and apparatus, computer device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021120145A1 true WO2021120145A1 (en) 2021-06-24

Family

ID=70427470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/126865 WO2021120145A1 (en) 2019-12-20 2019-12-20 Voice conversion method and apparatus, computer device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN111108558B (en)
WO (1) WO2021120145A1 (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040061709A (en) * 2002-12-31 2004-07-07 (주) 코아보이스 Voice Color Converter using Transforming Vocal Tract Characteristic and Method
CN1534595A (en) * 2003-03-28 2004-10-06 中颖电子(上海)有限公司 Speech sound change over synthesis device and its method
CN1645363A (en) * 2005-01-04 2005-07-27 华南理工大学 Portable realtime dialect inter-translationing device and method thereof
US20070168189A1 (en) * 2006-01-19 2007-07-19 Kabushiki Kaisha Toshiba Apparatus and method of processing speech
CN105023570A (en) * 2014-04-30 2015-11-04 安徽科大讯飞信息科技股份有限公司 method and system of transforming speech
CN107430623A (en) * 2015-05-27 2017-12-01 谷歌公司 Offline syntactic model for the dynamic updatable of resource-constrained off-line device
CN107767879A (en) * 2017-10-25 2018-03-06 北京奇虎科技有限公司 Audio conversion method and device based on tone color
CN109637551A (en) * 2018-12-26 2019-04-16 出门问问信息科技有限公司 Phonetics transfer method, device, equipment and storage medium
CN110097890A (en) * 2019-04-16 2019-08-06 北京搜狗科技发展有限公司 A kind of method of speech processing, device and the device for speech processes

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8930182B2 (en) * 2011-03-17 2015-01-06 International Business Machines Corporation Voice transformation with encoded information
US9613620B2 (en) * 2014-07-03 2017-04-04 Google Inc. Methods and systems for voice conversion
US10176819B2 (en) * 2016-07-11 2019-01-08 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
US10896669B2 (en) * 2017-05-19 2021-01-19 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech
CN107545903B (en) * 2017-07-19 2020-11-24 南京邮电大学 Voice conversion method based on deep learning
CN107785030B (en) * 2017-10-18 2021-04-30 杭州电子科技大学 Voice conversion method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040061709A (en) * 2002-12-31 2004-07-07 (주) 코아보이스 Voice Color Converter using Transforming Vocal Tract Characteristic and Method
CN1534595A (en) * 2003-03-28 2004-10-06 中颖电子(上海)有限公司 Speech sound change over synthesis device and its method
CN1645363A (en) * 2005-01-04 2005-07-27 华南理工大学 Portable realtime dialect inter-translationing device and method thereof
US20070168189A1 (en) * 2006-01-19 2007-07-19 Kabushiki Kaisha Toshiba Apparatus and method of processing speech
CN105023570A (en) * 2014-04-30 2015-11-04 安徽科大讯飞信息科技股份有限公司 method and system of transforming speech
CN107430623A (en) * 2015-05-27 2017-12-01 谷歌公司 Offline syntactic model for the dynamic updatable of resource-constrained off-line device
CN107767879A (en) * 2017-10-25 2018-03-06 北京奇虎科技有限公司 Audio conversion method and device based on tone color
CN109637551A (en) * 2018-12-26 2019-04-16 出门问问信息科技有限公司 Phonetics transfer method, device, equipment and storage medium
CN110097890A (en) * 2019-04-16 2019-08-06 北京搜狗科技发展有限公司 A kind of method of speech processing, device and the device for speech processes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YING YAOPENG, JI XINGLONG; LIN RUYI; WANG MENGQI; MA YUQIAN: "Design and Development of Trans-Software Text to Speech APP", FUJIAN COMPUTER, FU JIAN DIAN NAO BIAN JI BU, CN, vol. 35, no. 4, 1 April 2019 (2019-04-01), CN, pages 115 - 116, XP055822100, ISSN: 1673-2782, DOI: 10.16707/j.cnki.fjpc.2019.04.042 *

Also Published As

Publication number Publication date
CN111108558A (en) 2020-05-05
CN111108558B (en) 2023-08-04

Similar Documents

Publication Publication Date Title
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
US20220076693A1 (en) Bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
EP3588490B1 (en) Speech conversion method, computer device, and storage medium
WO2021128256A1 (en) Voice conversion method, apparatus and device, and storage medium
Vougioukas et al. Video-driven speech reconstruction using generative adversarial networks
CN111433847A (en) Speech conversion method and training method, intelligent device and storage medium
WO2022203699A1 (en) Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
WO2021179717A1 (en) Speech recognition front-end processing method and apparatus, and terminal device
US11810546B2 (en) Sample generation method and apparatus
US9412359B2 (en) System and method for cloud-based text-to-speech web services
WO2023116660A2 (en) Model training and tone conversion method and apparatus, device, and medium
CN113658583B (en) Ear voice conversion method, system and device based on generation countermeasure network
CN107240401B (en) Tone conversion method and computing device
CN113362804A (en) Method, device, terminal and storage medium for synthesizing voice
CN112201253A (en) Character marking method and device, electronic equipment and computer readable storage medium
CN113948062B (en) Data conversion method and computer storage medium
CN113506586A (en) Method and system for recognizing emotion of user
WO2021120145A1 (en) Voice conversion method and apparatus, computer device and computer-readable storage medium
CN113112969B (en) Buddhism music notation method, device, equipment and medium based on neural network
US11074926B1 (en) Trending and context fatigue compensation in a voice signal
CN113409775B (en) Keyword recognition method and device, storage medium and computer equipment
WO2022141126A1 (en) Personalized speech conversion training method, computer device, and storage medium
WO2023173966A1 (en) Speech identification method, terminal device, and computer readable storage medium
WO2022140966A1 (en) Cross-language voice conversion method, computer device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19956272

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19956272

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19956272

Country of ref document: EP

Kind code of ref document: A1