CN117351974A

CN117351974A - Voice conversion method, device, equipment and medium

Info

Publication number: CN117351974A
Application number: CN202311425164.4A
Authority: CN
Inventors: 周芯永; 刘忠亮; 张璐; 陶明
Original assignee: Shanghai Renyimen Technology Co ltd
Current assignee: Shanghai Renyimen Technology Co ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-01-05

Abstract

The application discloses a voice conversion method, a device, equipment and a medium, comprising the following steps: when the voice to be converted is greater than a preset length threshold, dividing the voice to be converted into a plurality of voice fragments; sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment; inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features; inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors; and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio. The voice conversion time delay can be reduced, so that the user experience is improved.

Description

Voice conversion method, device, equipment and medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, and a medium for voice conversion.

Background

With the wide application of virtual social products, in order to protect user privacy, voice conversion is an indispensable technical function to improve user experience. At present, in the existing voice conversion technology, high delay exists in voice conversion, and when a user speaks a sentence, the user usually waits for about two to three seconds to receive the converted voice, and the problem of long waiting time of the user is faced, so that the user experience is poor.

Disclosure of Invention

Accordingly, the present application is directed to a method, apparatus, device and medium for voice conversion, which can reduce the voice conversion delay, thereby improving the user experience. The specific scheme is as follows:

in a first aspect, the present application discloses a voice conversion method, including:

when the voice to be converted is greater than a preset length threshold, dividing the voice to be converted into a plurality of voice fragments;

sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment;

inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features;

inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors;

and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio.

Optionally, the target acoustic model includes an encoder and a decoder; wherein the window size of the encoder is determined based on the size of the target audio segment.

Optionally, the training process of the target acoustic model includes:

acquiring a first training sample set; the first training sample set comprises first voice samples with different time lengths;

selecting a first voice sample from the first training sample set, inputting the target automatic voice recognition model, and obtaining voice posterior diagram characteristics and classification characteristics of the first voice sample;

inputting the voice posterior graph characteristics, the classification characteristics and the voiceprint characteristics of the target voice sample of the first voice sample into an initial acoustic model for training, and determining the window size of an encoder based on the length of the selected first voice sample in the training process;

when the first training stop condition is satisfied, the current acoustic model is determined as the target acoustic model.

Optionally, the acquiring a first training sample set includes:

randomly selecting a preset number of samples from an original training sample set;

using environmental noise to carry out noise adding treatment on the selected samples to obtain noisy samples;

and determining a sample set formed by all the noisy samples and the original training sample set as a first training sample set.

Optionally, the initial acoustic model further comprises an embedding layer; correspondingly, inputting the voice posterior map feature, the classification feature and the voiceprint feature of the target voice sample of the first voice sample into an initial acoustic model for training, including:

outputting a feature matrix corresponding to voiceprint features of the target voice sample by utilizing the embedded layer; the target voice sample comprises voice samples corresponding to a plurality of speakers;

inputting the feature matrix, the voice posterior diagram feature and the classification feature of the first voice sample into an encoder to obtain coding features;

and inputting the coding feature into a decoder to obtain the Mel feature.

Optionally, the decoder includes a plurality of lightweight convolution units.

Optionally, the training process of the target automatic speech recognition model includes:

obtaining a plurality of second training samples from the second training data set to obtain a training sample group; wherein each second training sample is a single-person voice sample;

inputting the training sample set into an initial automatic speech recognition model for training;

and when the second training stopping condition is met, determining the current automatic voice recognition model as a target automatic voice recognition model.

In a second aspect, the present application discloses a voice conversion apparatus comprising:

the voice segmentation module is used for segmenting the voice to be converted into a plurality of voice fragments when the voice to be converted is larger than a preset length threshold;

the target segment determining module is used for sequentially taking each voice segment in the plurality of voice segments as a target audio segment;

the semantic feature extraction module is used for inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features;

the Mel feature extraction module is used for inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors;

the audio conversion module is used for converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice segment;

and the audio output module is used for outputting the converted audio.

In a third aspect, the present application discloses an electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the foregoing voice conversion method.

In a fourth aspect, the present application discloses a computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the foregoing speech conversion method.

As can be seen, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment; inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features; inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors; and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio. That is, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; and sequentially inputting each voice segment into a target automatic voice recognition model to obtain semantic features, inputting the semantic features and target voiceprint features into a target acoustic model to obtain Mel features, converting the Mel features into audio, and outputting the audio. Thus, the segmentation process processes a speech segment to output a converted speech corresponding to the speech segment. And in the conversion process, the target automatic voice recognition model is utilized to obtain the voice posterior graph characteristics and the classification characteristics, and then the voice posterior graph characteristics, the classification characteristics and the target voiceprint characteristics are output to the acoustic model to obtain the Mel characteristics and then converted into the audio, so that rich semantic characteristics are obtained in the conversion process, and the accuracy of the converted voice can be improved.

The beneficial effects of this application lie in: under the condition of guaranteeing the voice accuracy after conversion, the voice conversion time delay is reduced, so that the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

Fig. 1 is a flowchart of a voice conversion method disclosed in an embodiment of the present application;

FIG. 2 is a schematic diagram of voice conversion according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an acoustic model audio processing disclosed in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a voice conversion device according to an embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

At present, in the existing voice conversion technology, high delay exists in voice conversion, and when a user speaks a sentence, the user usually waits for about two to three seconds to receive the converted voice, and the problem of long waiting time of the user is faced, so that the user experience is poor. Therefore, the voice conversion scheme can reduce the voice conversion time delay, so that the user experience is improved.

Referring to fig. 1, an embodiment of the present application discloses a voice conversion method, including:

step S11: and when the voice to be converted is greater than a preset length threshold, segmenting the voice to be converted into a plurality of voice fragments.

It can be understood that the voice to be converted is the recorded user voice, and the preset length threshold can be set according to actual requirements. For example, 16 frames may be used as a window size, a window is used to packetize the user voice, a packetize voice is outputted according to the window size, and a corresponding processing result (i.e. converted voice) is outputted after each packet of data is processed, so that the whole flow forms a stream output. When the converted voice is smaller than or equal to the preset length threshold, the converted voice can be directly input into a target automatic voice recognition model, then the Mel characteristics are obtained through a target acoustic model, and then the converted voice is converted into audio output.

Step S12: and taking each voice segment in the plurality of voice segments as a target audio segment in turn.

Step S13: inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features.

The training process of the target automatic voice recognition model comprises the following steps: obtaining a plurality of second training samples from the second training data set to obtain a training sample group; wherein each second training sample is a single-person voice sample; inputting the training sample set into an initial automatic speech recognition model for training; and when the second training stopping condition is met, determining the current automatic voice recognition model as a target automatic voice recognition model. And the plurality of second training samples can be voice samples corresponding to different voices, so that the real-time voice requirements of the multi-user real-time streaming dialogue scene can be met. Where the group size, batch size, may be 32 or 64, the training samples of the group are made up of multiple talkers, i.e., multiple voices. The classification features, i.e., features of the softmax layer output, the softmax layer is a layer determined based on the softmax function (i.e., classification function). It can be understood that in the embodiment of the present application, the target automatic speech recognition model is one speech segment at a time, so that converted audio corresponding to the speech segment can be obtained one by one.

Step S14: inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors.

In an embodiment of the present application, the target acoustic model includes an encoder and a decoder; wherein the window size of the encoder is determined based on the size of the target audio segment. The decoder includes a plurality of lightweight convolution units. The encoder comprises a plurality of coding units, wherein the coding units are converters, the converters are variants of converters (namely converters) and are composed of CNNs (namely Convolutional Neural Networks, convolutional neural networks) +converters. The window size of the encoder is the window size of the coding unit.

The training process of the acoustic model comprises the following steps: acquiring a first training sample set; the first training sample set comprises first voice samples with different time lengths; selecting a first voice sample from the first training sample set, inputting the target automatic voice recognition model, and obtaining voice posterior diagram characteristics and classification characteristics of the first voice sample; inputting the voice posterior graph characteristics, the classification characteristics and the voiceprint characteristics of the target voice sample of the first voice sample into an initial acoustic model for training, and determining the window size of an encoder based on the length of the selected first voice sample in the training process; when the first training stop condition is satisfied, the current acoustic model is determined as the target acoustic model. The window size can be dynamically set, for example, 10 frames, 20 frames and 30 frames can be set; when the input audio clip is longer, for example, 3 seconds or more, the window is set to be larger, and conversely, the window size is reduced.

Wherein the acquiring a first training sample set includes: randomly selecting a preset number of samples from an original training sample set; using environmental noise to carry out noise adding treatment on the selected samples to obtain noisy samples; and determining a sample set formed by all the noisy samples and the original training sample set as a first training sample set. In this way, model robustness can be improved.

Further, the initial acoustic model further comprises an embedding layer; correspondingly, inputting the voice posterior map feature, the classification feature and the voiceprint feature of the target voice sample of the first voice sample into an initial acoustic model for training, including: outputting a feature matrix corresponding to voiceprint features of the target voice sample by utilizing the embedded layer; the target voice sample comprises voice samples corresponding to a plurality of speakers; inputting the feature matrix, the voice posterior diagram feature and the classification feature of the first voice sample into an encoder to obtain coding features; and inputting the coding feature into a decoder to obtain the Mel feature.

In the embodiment of the application, the target tone is the tone selected by the user, and the embodiment of the application can provide a plurality of tone for the user to select.

Step S15: and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio.

Embodiments of the present application may utilize a vocoder to convert mel features into audio. In addition, the embodiment of the application can output the converted audio of one audio fragment after obtaining the converted audio of the audio fragment, and does not need to wait for the converted audio corresponding to other audio fragments. In other embodiments, the converted audio corresponding to the preset number of audio segments may be obtained according to the actual scene, where the preset number is smaller than the number of segmented audio segments. For example, the user voice is segmented into 100 audio segments, converted audio corresponding to 10 audio segments can be obtained, and the converted audio corresponding to 10 audio segments is packaged and output. Thus, the delay of the first packet data and the parallel flow output result can be reduced, and the problem of long waiting time of a user is solved.

As can be seen, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment; inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features; inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors; and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio. That is, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; and sequentially inputting each voice segment into a target automatic voice recognition model to obtain semantic features, inputting the semantic features and target voiceprint features into a target acoustic model to obtain Mel features, converting the Mel features into audio, and outputting the audio. Thus, the segmentation process processes a speech segment to output a converted speech corresponding to the speech segment. And in the conversion process, the target automatic voice recognition model is utilized to obtain the voice posterior graph characteristics and the classification characteristics, and then the voice posterior graph characteristics, the classification characteristics and the target voiceprint characteristics are output to the acoustic model to obtain the Mel characteristics and then converted into the audio, so that rich semantic characteristics are obtained in the conversion process, and the accuracy of the converted voice can be improved. The beneficial effects of this application lie in: under the condition of guaranteeing the voice accuracy after conversion, the voice conversion time delay is reduced, so that the user experience is improved.

Further, referring to fig. 2, a speech conversion schematic diagram is disclosed in the embodiment of the present application. Starting recording after the user confirms the target tone, and extracting corresponding ppg (Phonetic Posteriorgram, voice posterior diagram) features and softmax features (classification features) from the recorded voice of the user by an ASR (Automatic Speech Recognition, automatic voice recognition) model, wherein the corresponding ppg features and softmax features represent semantic information in the voice; the acoustic model takes ppg characteristics, softmax characteristics and voiceprint characteristics corresponding to the target tone as inputs, outputs mel (i.e. mel) acoustic characteristics corresponding to the target tone, and the mel characteristics pass through a vocoder to obtain audio after sound change, namely the voice of the target speaker. The embodiment of the application can train an ASR model with multi-voice data and is responsible for extracting ppg features and softmax features to serve as input of a subsequent acoustic model. An acoustic model consisting of an encoder and a decoder, the encoder taking a consumer as a base unit responsible for encoding ppg and softmax features; the decoder takes the LightWeight Conv (i.e. LightWeight convolution) as a decoding unit and takes the output of the encoder as an input, which is responsible for the decoding of the Mel feature. A vocoder, a HiFi-GAN (i.e., generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, generation resistant network for high-efficiency, high-fidelity speech synthesizers) vocoder is used.

For training of the acoustic model, training data can be expanded by adding noise to enhance robustness of the acoustic model, specifically, half of original data is randomly selected, the original data is subjected to data expansion by adding noise according to different signal to noise ratios by using specific environmental noise (such as street environmental noise, noise of specific scenes such as supermarket environment, theatre environment and the like), and then corresponding ppg and softmax features are extracted by using an ASR model. And extracting xvector features (namely voiceprint features) of the target speaker voice fragments by using the voiceprint recognition model as speaker characterization, and introducing voiceprint features of a plurality of speakers into the acoustic model to construct a speaker feature matrix, wherein the characterization can be used for well controlling the target tone of the voice variation. In addition, the dynamically selected window size is introduced into the Conformer module of the acoustic model encoder, so that the model is adapted to audio input fragments with different durations, the model is adapted to 'short-time fragment' input, and an application scene synthesized in real time in actual use is simulated. And a time sequence modeling unit LSTM (Long short-term memory) in the acoustic model decoder is changed into convolution with better effect, namely light convolution, so that the capacity of the model for connecting real-time sound fragments is improved, and the reasoning speed is improved. Referring to fig. 3, fig. 3 is a schematic diagram of an acoustic model audio processing according to an embodiment of the present application. Therefore, under the condition that only short-time segment training data exists, a model with real-time conversion capability is obtained through training, and sound conversion under a real-time scene is achieved. And obtaining a model compatible with the long-time and short-time voices under the condition of possessing the short-time segment training data and the long-time segment training data. The short time is smaller than the appointed time, and the long time is larger than or equal to the appointed time.

Referring to fig. 4, an embodiment of the present application discloses a voice conversion device, including:

the voice segmentation module 11 is configured to segment the voice to be converted into a plurality of voice segments when the voice to be converted is greater than a preset length threshold;

a target segment determining module 12, configured to sequentially take each of the plurality of voice segments as a target audio segment;

the semantic feature extraction module 13 is configured to input the target speech segment into a target automatic speech recognition model, so as to obtain semantic features of the target speech segment; wherein the semantic features include phonetic posterior features and classification features;

a mel feature extraction module 14, configured to input the semantic feature and the target voiceprint feature into a target acoustic model, to obtain a mel feature; the target voiceprint features are voiceprint features corresponding to target tone colors;

an audio conversion module 15, configured to convert the mel feature into audio to obtain converted audio corresponding to the target speech segment;

an audio output module 16 for outputting the converted audio.

As can be seen, in the embodiment of the present application, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; sequentially taking each voice fragment in the plurality of voice fragments as a target audio fragment; inputting the target voice fragment into a target automatic voice recognition model to obtain semantic features of the target voice fragment; wherein the semantic features include phonetic posterior features and classification features; inputting the semantic features and the target voiceprint features into a target acoustic model to obtain Mel features; the target voiceprint features are voiceprint features corresponding to target tone colors; and converting the Mel characteristics into audio to obtain converted audio corresponding to the target voice fragment, and outputting the converted audio. That is, in the embodiment of the present application, when the voice to be converted is greater than the preset length threshold, the voice to be converted is segmented into a plurality of voice segments; and sequentially inputting each voice segment into a target automatic voice recognition model to obtain semantic features, inputting the semantic features and target voiceprint features into a target acoustic model to obtain Mel features, converting the Mel features into audio, and outputting the audio. Thus, the segmentation process processes a speech segment to output a converted speech corresponding to the speech segment. And in the conversion process, the target automatic voice recognition model is utilized to obtain the voice posterior graph characteristics and the classification characteristics, and then the voice posterior graph characteristics, the classification characteristics and the target voiceprint characteristics are output to the acoustic model to obtain the Mel characteristics and then converted into the audio, so that rich semantic characteristics are obtained in the conversion process, and the accuracy of the converted voice can be improved. The beneficial effects of this application lie in: under the condition of guaranteeing the voice accuracy after conversion, the voice conversion time delay is reduced, so that the user experience is improved.

In an embodiment of the present application, the target acoustic model includes an encoder and a decoder; wherein the window size of the encoder is determined based on the size of the target audio segment. The decoder includes a plurality of lightweight convolution units.

And, the training process of the target acoustic model includes: acquiring a first training sample set; the first training sample set comprises first voice samples with different time lengths; selecting a first voice sample from the first training sample set, inputting the target automatic voice recognition model, and obtaining voice posterior diagram characteristics and classification characteristics of the first voice sample; inputting the voice posterior graph characteristics, the classification characteristics and the voiceprint characteristics of the target voice sample of the first voice sample into an initial acoustic model for training, and determining the window size of an encoder based on the length of the selected first voice sample in the training process; when the first training stop condition is satisfied, the current acoustic model is determined as the target acoustic model.

Wherein the acquiring a first training sample set includes: randomly selecting a preset number of samples from an original training sample set; using environmental noise to carry out noise adding treatment on the selected samples to obtain noisy samples; and determining a sample set formed by all the noisy samples and the original training sample set as a first training sample set.

In addition, the training process of the target automatic voice recognition model comprises the following steps: obtaining a plurality of second training samples from the second training data set to obtain a training sample group; wherein each second training sample is a single-person voice sample; inputting the training sample set into an initial automatic speech recognition model for training; and when the second training stopping condition is met, determining the current automatic voice recognition model as a target automatic voice recognition model.

Referring to fig. 5, an embodiment of the present application discloses an electronic device 20 comprising a processor 21 and a memory 22; wherein the memory 22 is used for storing a computer program; the processor 21 is configured to execute the computer program and the voice conversion method disclosed in the foregoing embodiment.

For the specific process of the above voice conversion method, reference may be made to the corresponding content disclosed in the foregoing embodiment, and no further description is given here.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk or an optical disk, and the storage mode may be transient storage or permanent storage.

In addition, the electronic device 20 further includes a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26; wherein the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

Further, the embodiment of the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to implement the voice conversion method disclosed in the previous embodiment.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing has described in detail a method, apparatus, device and medium for speech conversion provided by the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the method and core ideas of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of speech conversion, comprising:

2. The speech conversion method according to claim 1, wherein the target acoustic model comprises an encoder and a decoder; wherein the window size of the encoder is determined based on the size of the target audio segment.

3. The speech conversion method according to claim 2, wherein the training process of the target acoustic model comprises:

4. The method of claim 3, wherein the obtaining a first training sample set comprises:

5. The speech conversion method according to claim 3, wherein the initial acoustic model further comprises an embedding layer; correspondingly, inputting the voice posterior map feature, the classification feature and the voiceprint feature of the target voice sample of the first voice sample into an initial acoustic model for training, including:

and inputting the coding feature into a decoder to obtain the Mel feature.

6. The voice conversion method according to claim 2, wherein the decoder includes a plurality of light-weight convolution units.

7. The speech conversion method according to any one of claims 1 to 6, wherein the training process of the target automatic speech recognition model comprises:

8. A speech conversion apparatus, comprising:

and the audio output module is used for outputting the converted audio.

9. An electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the speech conversion method according to any one of claims 1 to 7.

10. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the speech conversion method according to any one of claims 1 to 7.