CN117351974A - Voice conversion method, device, equipment and medium - Google Patents

Voice conversion method, device, equipment and medium Download PDF

Info

Publication number
CN117351974A
CN117351974A CN202311425164.4A CN202311425164A CN117351974A CN 117351974 A CN117351974 A CN 117351974A CN 202311425164 A CN202311425164 A CN 202311425164A CN 117351974 A CN117351974 A CN 117351974A
Authority
CN
China
Prior art keywords
voice
target
features
training
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311425164.4A
Other languages
Chinese (zh)
Inventor
周芯永
刘忠亮
张璐
陶明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Renyimen Technology Co ltd
Original Assignee
Shanghai Renyimen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Renyimen Technology Co ltd filed Critical Shanghai Renyimen Technology Co ltd
Priority to CN202311425164.4A priority Critical patent/CN117351974A/en
Publication of CN117351974A publication Critical patent/CN117351974A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Abstract

The application discloses a voice conversion method, a device, equipment and a medium, comprising the following steps: when the voice to be converted is greater than a preset length threshold, dividing the voice to be converted into a plurality of voice fragments; sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment; inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features; inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors; and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio. The voice conversion time delay can be reduced, so that the user experience is improved.

Description

Voice conversion method, device, equipment and medium
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, and a medium for voice conversion.
Background
With the wide application of virtual social products, in order to protect user privacy, voice conversion is an indispensable technical function to improve user experience. At present, in the existing voice conversion technology, high delay exists in voice conversion, and when a user speaks a sentence, the user usually waits for about two to three seconds to receive the converted voice, and the problem of long waiting time of the user is faced, so that the user experience is poor.
Disclosure of Invention
Accordingly, the present application is directed to a method, apparatus, device and medium for voice conversion, which can reduce the voice conversion delay, thereby improving the user experience. The specific scheme is as follows:
in a first aspect, the present application discloses a voice conversion method, including:
when the voice to be converted is greater than a preset length threshold, dividing the voice to be converted into a plurality of voice fragments;
sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment;
inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features;
inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors;
and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio.
Optionally, the target acoustic model includes an encoder and a decoder; wherein the window size of the encoder is determined based on the size of the target audio segment.
Optionally, the training process of the target acoustic model includes:
acquiring a first training sample set; the first training sample set comprises first voice samples with different time lengths;
selecting a first voice sample from the first training sample set, inputting the target automatic voice recognition model, and obtaining voice posterior diagram characteristics and classification characteristics of the first voice sample;
inputting the voice posterior graph characteristics, the classification characteristics and the voiceprint characteristics of the target voice sample of the first voice sample into an initial acoustic model for training, and determining the window size of an encoder based on the length of the selected first voice sample in the training process;
when the first training stop condition is satisfied, the current acoustic model is determined as the target acoustic model.
Optionally, the acquiring a first training sample set includes:
randomly selecting a preset number of samples from an original training sample set;
using environmental noise to carry out noise adding treatment on the selected samples to obtain noisy samples;
and determining a sample set formed by all the noisy samples and the original training sample set as a first training sample set.
Optionally, the initial acoustic model further comprises an embedding layer; correspondingly, inputting the voice posterior map feature, the classification feature and the voiceprint feature of the target voice sample of the first voice sample into an initial acoustic model for training, including:
outputting a feature matrix corresponding to voiceprint features of the target voice sample by utilizing the embedded layer; the target voice sample comprises voice samples corresponding to a plurality of speakers;
inputting the feature matrix, the voice posterior diagram feature and the classification feature of the first voice sample into an encoder to obtain coding features;
and inputting the coding feature into a decoder to obtain the Mel feature.
Optionally, the decoder includes a plurality of lightweight convolution units.
Optionally, the training process of the target automatic speech recognition model includes:
obtaining a plurality of second training samples from the second training data set to obtain a training sample group; wherein each second training sample is a single-person voice sample;
inputting the training sample set into an initial automatic speech recognition model for training;
and when the second training stopping condition is met, determining the current automatic voice recognition model as a target automatic voice recognition model.
In a second aspect, the present application discloses a voice conversion apparatus comprising:
the voice segmentation module is used for segmenting the voice to be converted into a plurality of voice fragments when the voice to be converted is larger than a preset length threshold;
the target segment determining module is used for sequentially taking each voice segment in the plurality of voice segments as a target audio segment;
the semantic feature extraction module is used for inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features;
the Mel feature extraction module is used for inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors;
the audio conversion module is used for converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice segment;
and the audio output module is used for outputting the converted audio.
In a third aspect, the present application discloses an electronic device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the foregoing voice conversion method.
In a fourth aspect, the present application discloses a computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the foregoing speech conversion method.
As can be seen, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment; inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features; inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors; and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio. That is, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; and sequentially inputting each voice segment into a target automatic voice recognition model to obtain semantic features, inputting the semantic features and target voiceprint features into a target acoustic model to obtain Mel features, converting the Mel features into audio, and outputting the audio. Thus, the segmentation process processes a speech segment to output a converted speech corresponding to the speech segment. And in the conversion process, the target automatic voice recognition model is utilized to obtain the voice posterior graph characteristics and the classification characteristics, and then the voice posterior graph characteristics, the classification characteristics and the target voiceprint characteristics are output to the acoustic model to obtain the Mel characteristics and then converted into the audio, so that rich semantic characteristics are obtained in the conversion process, and the accuracy of the converted voice can be improved.
The beneficial effects of this application lie in: under the condition of guaranteeing the voice accuracy after conversion, the voice conversion time delay is reduced, so that the user experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
Fig. 1 is a flowchart of a voice conversion method disclosed in an embodiment of the present application;
FIG. 2 is a schematic diagram of voice conversion according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an acoustic model audio processing disclosed in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a voice conversion device according to an embodiment of the present disclosure;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
At present, in the existing voice conversion technology, high delay exists in voice conversion, and when a user speaks a sentence, the user usually waits for about two to three seconds to receive the converted voice, and the problem of long waiting time of the user is faced, so that the user experience is poor. Therefore, the voice conversion scheme can reduce the voice conversion time delay, so that the user experience is improved.
Referring to fig. 1, an embodiment of the present application discloses a voice conversion method, including:
step S11: and when the voice to be converted is greater than a preset length threshold, segmenting the voice to be converted into a plurality of voice fragments.
It can be understood that the voice to be converted is the recorded user voice, and the preset length threshold can be set according to actual requirements. For example, 16 frames may be used as a window size, a window is used to packetize the user voice, a packetize voice is outputted according to the window size, and a corresponding processing result (i.e. converted voice) is outputted after each packet of data is processed, so that the whole flow forms a stream output. When the converted voice is smaller than or equal to the preset length threshold, the converted voice can be directly input into a target automatic voice recognition model, then the Mel characteristics are obtained through a target acoustic model, and then the converted voice is converted into audio output.
Step S12: and taking each voice segment in the plurality of voice segments as a target audio segment in turn.
Step S13: inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features.
The training process of the target automatic voice recognition model comprises the following steps: obtaining a plurality of second training samples from the second training data set to obtain a training sample group; wherein each second training sample is a single-person voice sample; inputting the training sample set into an initial automatic speech recognition model for training; and when the second training stopping condition is met, determining the current automatic voice recognition model as a target automatic voice recognition model. And the plurality of second training samples can be voice samples corresponding to different voices, so that the real-time voice requirements of the multi-user real-time streaming dialogue scene can be met. Where the group size, batch size, may be 32 or 64, the training samples of the group are made up of multiple talkers, i.e., multiple voices. The classification features, i.e., features of the softmax layer output, the softmax layer is a layer determined based on the softmax function (i.e., classification function). It can be understood that in the embodiment of the present application, the target automatic speech recognition model is one speech segment at a time, so that converted audio corresponding to the speech segment can be obtained one by one.
Step S14: inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors.
In an embodiment of the present application, the target acoustic model includes an encoder and a decoder; wherein the window size of the encoder is determined based on the size of the target audio segment. The decoder includes a plurality of lightweight convolution units. The encoder comprises a plurality of coding units, wherein the coding units are converters, the converters are variants of converters (namely converters) and are composed of CNNs (namely Convolutional Neural Networks, convolutional neural networks) +converters. The window size of the encoder is the window size of the coding unit.
The training process of the acoustic model comprises the following steps: acquiring a first training sample set; the first training sample set comprises first voice samples with different time lengths; selecting a first voice sample from the first training sample set, inputting the target automatic voice recognition model, and obtaining voice posterior diagram characteristics and classification characteristics of the first voice sample; inputting the voice posterior graph characteristics, the classification characteristics and the voiceprint characteristics of the target voice sample of the first voice sample into an initial acoustic model for training, and determining the window size of an encoder based on the length of the selected first voice sample in the training process; when the first training stop condition is satisfied, the current acoustic model is determined as the target acoustic model. The window size can be dynamically set, for example, 10 frames, 20 frames and 30 frames can be set; when the input audio clip is longer, for example, 3 seconds or more, the window is set to be larger, and conversely, the window size is reduced.
Wherein the acquiring a first training sample set includes: randomly selecting a preset number of samples from an original training sample set; using environmental noise to carry out noise adding treatment on the selected samples to obtain noisy samples; and determining a sample set formed by all the noisy samples and the original training sample set as a first training sample set. In this way, model robustness can be improved.
Further, the initial acoustic model further comprises an embedding layer; correspondingly, inputting the voice posterior map feature, the classification feature and the voiceprint feature of the target voice sample of the first voice sample into an initial acoustic model for training, including: outputting a feature matrix corresponding to voiceprint features of the target voice sample by utilizing the embedded layer; the target voice sample comprises voice samples corresponding to a plurality of speakers; inputting the feature matrix, the voice posterior diagram feature and the classification feature of the first voice sample into an encoder to obtain coding features; and inputting the coding feature into a decoder to obtain the Mel feature.
In the embodiment of the application, the target tone is the tone selected by the user, and the embodiment of the application can provide a plurality of tone for the user to select.
Step S15: and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio.
Embodiments of the present application may utilize a vocoder to convert mel features into audio. In addition, the embodiment of the application can output the converted audio of one audio fragment after obtaining the converted audio of the audio fragment, and does not need to wait for the converted audio corresponding to other audio fragments. In other embodiments, the converted audio corresponding to the preset number of audio segments may be obtained according to the actual scene, where the preset number is smaller than the number of segmented audio segments. For example, the user voice is segmented into 100 audio segments, converted audio corresponding to 10 audio segments can be obtained, and the converted audio corresponding to 10 audio segments is packaged and output. Thus, the delay of the first packet data and the parallel flow output result can be reduced, and the problem of long waiting time of a user is solved.
As can be seen, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment; inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features; inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors; and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio. That is, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; and sequentially inputting each voice segment into a target automatic voice recognition model to obtain semantic features, inputting the semantic features and target voiceprint features into a target acoustic model to obtain Mel features, converting the Mel features into audio, and outputting the audio. Thus, the segmentation process processes a speech segment to output a converted speech corresponding to the speech segment. And in the conversion process, the target automatic voice recognition model is utilized to obtain the voice posterior graph characteristics and the classification characteristics, and then the voice posterior graph characteristics, the classification characteristics and the target voiceprint characteristics are output to the acoustic model to obtain the Mel characteristics and then converted into the audio, so that rich semantic characteristics are obtained in the conversion process, and the accuracy of the converted voice can be improved. The beneficial effects of this application lie in: under the condition of guaranteeing the voice accuracy after conversion, the voice conversion time delay is reduced, so that the user experience is improved.
Further, referring to fig. 2, a speech conversion schematic diagram is disclosed in the embodiment of the present application. Starting recording after the user confirms the target tone, and extracting corresponding ppg (Phonetic Posteriorgram, voice posterior diagram) features and softmax features (classification features) from the recorded voice of the user by an ASR (Automatic Speech Recognition, automatic voice recognition) model, wherein the corresponding ppg features and softmax features represent semantic information in the voice; the acoustic model takes ppg characteristics, softmax characteristics and voiceprint characteristics corresponding to the target tone as inputs, outputs mel (i.e. mel) acoustic characteristics corresponding to the target tone, and the mel characteristics pass through a vocoder to obtain audio after sound change, namely the voice of the target speaker. The embodiment of the application can train an ASR model with multi-voice data and is responsible for extracting ppg features and softmax features to serve as input of a subsequent acoustic model. An acoustic model consisting of an encoder and a decoder, the encoder taking a consumer as a base unit responsible for encoding ppg and softmax features; the decoder takes the LightWeight Conv (i.e. LightWeight convolution) as a decoding unit and takes the output of the encoder as an input, which is responsible for the decoding of the Mel feature. A vocoder, a HiFi-GAN (i.e., generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, generation resistant network for high-efficiency, high-fidelity speech synthesizers) vocoder is used.
For training of the acoustic model, training data can be expanded by adding noise to enhance robustness of the acoustic model, specifically, half of original data is randomly selected, the original data is subjected to data expansion by adding noise according to different signal to noise ratios by using specific environmental noise (such as street environmental noise, noise of specific scenes such as supermarket environment, theatre environment and the like), and then corresponding ppg and softmax features are extracted by using an ASR model. And extracting xvector features (namely voiceprint features) of the target speaker voice fragments by using the voiceprint recognition model as speaker characterization, and introducing voiceprint features of a plurality of speakers into the acoustic model to construct a speaker feature matrix, wherein the characterization can be used for well controlling the target tone of the voice variation. In addition, the dynamically selected window size is introduced into the Conformer module of the acoustic model encoder, so that the model is adapted to audio input fragments with different durations, the model is adapted to 'short-time fragment' input, and an application scene synthesized in real time in actual use is simulated. And a time sequence modeling unit LSTM (Long short-term memory) in the acoustic model decoder is changed into convolution with better effect, namely light convolution, so that the capacity of the model for connecting real-time sound fragments is improved, and the reasoning speed is improved. Referring to fig. 3, fig. 3 is a schematic diagram of an acoustic model audio processing according to an embodiment of the present application. Therefore, under the condition that only short-time segment training data exists, a model with real-time conversion capability is obtained through training, and sound conversion under a real-time scene is achieved. And obtaining a model compatible with the long-time and short-time voices under the condition of possessing the short-time segment training data and the long-time segment training data. The short time is smaller than the appointed time, and the long time is larger than or equal to the appointed time.
Referring to fig. 4, an embodiment of the present application discloses a voice conversion device, including:
the voice segmentation module 11 is configured to segment the voice to be converted into a plurality of voice segments when the voice to be converted is greater than a preset length threshold;
a target segment determining module 12, configured to sequentially take each of the plurality of voice segments as a target audio segment;
the semantic feature extraction module 13 is configured to input the target speech segment into a target automatic speech recognition model, so as to obtain semantic features of the target speech segment; wherein the semantic features include phonetic posterior features and classification features;
a mel feature extraction module 14, configured to input the semantic feature and the target voiceprint feature into a target acoustic model, to obtain a mel feature; the target voiceprint features are voiceprint features corresponding to target tone colors;
an audio conversion module 15, configured to convert the mel feature into audio to obtain converted audio corresponding to the target speech segment;
an audio output module 16 for outputting the converted audio.
As can be seen, in the embodiment of the present application, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment; inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features; inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors; and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio. That is, in the embodiment of the present application, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; and sequentially inputting each voice segment into a target automatic voice recognition model to obtain semantic features, inputting the semantic features and target voiceprint features into a target acoustic model to obtain Mel features, converting the Mel features into audio, and outputting the audio. Thus, the segmentation process processes a speech segment to output a converted speech corresponding to the speech segment. And in the conversion process, the target automatic voice recognition model is utilized to obtain the voice posterior graph characteristics and the classification characteristics, and then the voice posterior graph characteristics, the classification characteristics and the target voiceprint characteristics are output to the acoustic model to obtain the Mel characteristics and then converted into the audio, so that rich semantic characteristics are obtained in the conversion process, and the accuracy of the converted voice can be improved. The beneficial effects of this application lie in: under the condition of guaranteeing the voice accuracy after conversion, the voice conversion time delay is reduced, so that the user experience is improved.
In an embodiment of the present application, the target acoustic model includes an encoder and a decoder; wherein the window size of the encoder is determined based on the size of the target audio segment. The decoder includes a plurality of lightweight convolution units.
And, the training process of the target acoustic model includes: acquiring a first training sample set; the first training sample set comprises first voice samples with different time lengths; selecting a first voice sample from the first training sample set, inputting the target automatic voice recognition model, and obtaining voice posterior diagram characteristics and classification characteristics of the first voice sample; inputting the voice posterior graph characteristics, the classification characteristics and the voiceprint characteristics of the target voice sample of the first voice sample into an initial acoustic model for training, and determining the window size of an encoder based on the length of the selected first voice sample in the training process; when the first training stop condition is satisfied, the current acoustic model is determined as the target acoustic model.
Wherein the acquiring a first training sample set includes: randomly selecting a preset number of samples from an original training sample set; using environmental noise to carry out noise adding treatment on the selected samples to obtain noisy samples; and determining a sample set formed by all the noisy samples and the original training sample set as a first training sample set.
Further, the initial acoustic model further comprises an embedding layer; correspondingly, inputting the voice posterior map feature, the classification feature and the voiceprint feature of the target voice sample of the first voice sample into an initial acoustic model for training, including: outputting a feature matrix corresponding to voiceprint features of the target voice sample by utilizing the embedded layer; the target voice sample comprises voice samples corresponding to a plurality of speakers; inputting the feature matrix, the voice posterior diagram feature and the classification feature of the first voice sample into an encoder to obtain coding features; and inputting the coding feature into a decoder to obtain the Mel feature.
In addition, the training process of the target automatic voice recognition model comprises the following steps: obtaining a plurality of second training samples from the second training data set to obtain a training sample group; wherein each second training sample is a single-person voice sample; inputting the training sample set into an initial automatic speech recognition model for training; and when the second training stopping condition is met, determining the current automatic voice recognition model as a target automatic voice recognition model.
Referring to fig. 5, an embodiment of the present application discloses an electronic device 20 comprising a processor 21 and a memory 22; wherein the memory 22 is used for storing a computer program; the processor 21 is configured to execute the computer program and the voice conversion method disclosed in the foregoing embodiment.
For the specific process of the above voice conversion method, reference may be made to the corresponding content disclosed in the foregoing embodiment, and no further description is given here.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk or an optical disk, and the storage mode may be transient storage or permanent storage.
In addition, the electronic device 20 further includes a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26; wherein the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.
Further, the embodiment of the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to implement the voice conversion method disclosed in the previous embodiment.
For the specific process of the above voice conversion method, reference may be made to the corresponding content disclosed in the foregoing embodiment, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing has described in detail a method, apparatus, device and medium for speech conversion provided by the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the method and core ideas of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (10)

1. A method of speech conversion, comprising:
when the voice to be converted is greater than a preset length threshold, dividing the voice to be converted into a plurality of voice fragments;
sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment;
inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features;
inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors;
and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio.
2. The speech conversion method according to claim 1, wherein the target acoustic model comprises an encoder and a decoder; wherein the window size of the encoder is determined based on the size of the target audio segment.
3. The speech conversion method according to claim 2, wherein the training process of the target acoustic model comprises:
acquiring a first training sample set; the first training sample set comprises first voice samples with different time lengths;
selecting a first voice sample from the first training sample set, inputting the target automatic voice recognition model, and obtaining voice posterior diagram characteristics and classification characteristics of the first voice sample;
inputting the voice posterior graph characteristics, the classification characteristics and the voiceprint characteristics of the target voice sample of the first voice sample into an initial acoustic model for training, and determining the window size of an encoder based on the length of the selected first voice sample in the training process;
when the first training stop condition is satisfied, the current acoustic model is determined as the target acoustic model.
4. The method of claim 3, wherein the obtaining a first training sample set comprises:
randomly selecting a preset number of samples from an original training sample set;
using environmental noise to carry out noise adding treatment on the selected samples to obtain noisy samples;
and determining a sample set formed by all the noisy samples and the original training sample set as a first training sample set.
5. The speech conversion method according to claim 3, wherein the initial acoustic model further comprises an embedding layer; correspondingly, inputting the voice posterior map feature, the classification feature and the voiceprint feature of the target voice sample of the first voice sample into an initial acoustic model for training, including:
outputting a feature matrix corresponding to voiceprint features of the target voice sample by utilizing the embedded layer; the target voice sample comprises voice samples corresponding to a plurality of speakers;
inputting the feature matrix, the voice posterior diagram feature and the classification feature of the first voice sample into an encoder to obtain coding features;
and inputting the coding feature into a decoder to obtain the Mel feature.
6. The voice conversion method according to claim 2, wherein the decoder includes a plurality of light-weight convolution units.
7. The speech conversion method according to any one of claims 1 to 6, wherein the training process of the target automatic speech recognition model comprises:
obtaining a plurality of second training samples from the second training data set to obtain a training sample group; wherein each second training sample is a single-person voice sample;
inputting the training sample set into an initial automatic speech recognition model for training;
and when the second training stopping condition is met, determining the current automatic voice recognition model as a target automatic voice recognition model.
8. A speech conversion apparatus, comprising:
the voice segmentation module is used for segmenting the voice to be converted into a plurality of voice fragments when the voice to be converted is larger than a preset length threshold;
the target segment determining module is used for sequentially taking each voice segment in the plurality of voice segments as a target audio segment;
the semantic feature extraction module is used for inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features;
the Mel feature extraction module is used for inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors;
the audio conversion module is used for converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice segment;
and the audio output module is used for outputting the converted audio.
9. An electronic device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the speech conversion method according to any one of claims 1 to 7.
10. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the speech conversion method according to any one of claims 1 to 7.
CN202311425164.4A 2023-10-30 2023-10-30 Voice conversion method, device, equipment and medium Pending CN117351974A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311425164.4A CN117351974A (en) 2023-10-30 2023-10-30 Voice conversion method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311425164.4A CN117351974A (en) 2023-10-30 2023-10-30 Voice conversion method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117351974A true CN117351974A (en) 2024-01-05

Family

ID=89355706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311425164.4A Pending CN117351974A (en) 2023-10-30 2023-10-30 Voice conversion method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117351974A (en)

Similar Documents

Publication Publication Date Title
CN110223705B (en) Voice conversion method, device, equipment and readable storage medium
CN111508498B (en) Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
WO2023030235A1 (en) Target audio output method and system, readable storage medium, and electronic apparatus
CN113436609B (en) Voice conversion model, training method thereof, voice conversion method and system
US20220122582A1 (en) Parallel Tacotron Non-Autoregressive and Controllable TTS
CN110164413B (en) Speech synthesis method, apparatus, computer device and storage medium
CN112185363A (en) Audio processing method and device
CN111246469B (en) Artificial intelligence secret communication system and communication method
CN114267372A (en) Voice noise reduction method, system, electronic device and storage medium
CN112908293B (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
CN114360485A (en) Voice processing method, system, device and medium
CN112767955B (en) Audio encoding method and device, storage medium and electronic equipment
CN112634912A (en) Packet loss compensation method and device
CN116705004A (en) Speech recognition method, device, electronic equipment and storage medium
CN117351974A (en) Voice conversion method, device, equipment and medium
CN113724690B (en) PPG feature output method, target audio output method and device
CN113948062B (en) Data conversion method and computer storage medium
US20220157316A1 (en) Real-time voice converter
CN113223513A (en) Voice conversion method, device, equipment and storage medium
CN117636842B (en) Voice synthesis system and method based on prosody emotion migration
WO2022068675A1 (en) Speaker speech extraction method and apparatus, storage medium, and electronic device
US20240161730A1 (en) Parallel Tacotron Non-Autoregressive and Controllable TTS
Lee et al. Stacked U-Net with high-level feature transfer for parameter efficient speech enhancement
US20240127809A1 (en) Media segment representation using fixed weights

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication