CN115910083A

CN115910083A - Real-time voice conversion method, device, electronic equipment and medium

Info

Publication number: CN115910083A
Application number: CN202211329075.5A
Authority: CN
Inventors: 朱鹏程; 宁子谦; 薛鹤洋; 郭帅; 张晴; 毕梦霄; 吕唐杰; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-04-04

Abstract

The application provides a real-time voice conversion method, a real-time voice conversion device, an electronic device and a medium, wherein the method comprises the following steps: intercepting first voice data meeting voice segmentation conditions from voice data of a source speaking object recorded in real time; processing the first voice data to extract first semantic information; inputting the first semantic information into a pre-trained voice conversion model, and converting the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object; and reconstructing the target voice characteristic information to obtain second voice data after the first voice data is converted, thereby realizing low-delay stream-type reasoning and realizing low-delay and high-performance real-time voice conversion.

Description

Real-time voice conversion method, device, electronic equipment and medium

Technical Field

The present application relates to the field of voice conversion, and in particular, to a real-time voice conversion method, apparatus, electronic device, and medium.

Background

Voice Conversion (Voice Conversion) is a technology for changing the tone of a speaker into another person on the premise of keeping the Voice content unchanged. Namely, on the basis of keeping the original semantics of the voice unchanged, certain characteristics of the voice are changed, specifically including the tone color, style, accent and the like of the speaker.

In real life, the voice conversion technology has a plurality of practical applications, such as a sound changer, a video dubbing, a voice assistant, personalized conversion, and the like, and has a wide development prospect in helping disabled people with sound production disorder to recover damaged voice. However, in practical applications, voice conversion is still limited by data size, computing resources, real-time rate, conversion effect, and the like. Specifically, the prior art solution converts the entire segment based on the non-parallel corpus into the main. The traditional voice conversion model requires that conversion can be carried out only after all input voices are acquired, and cannot meet the real-time requirements of low delay and high performance in real-time application.

Disclosure of Invention

In view of this, an object of the present application is to provide a real-time voice conversion method, apparatus, electronic device and medium, which can implement low-delay streaming reasoning, thereby implementing low-delay and high-performance real-time voice conversion.

The embodiment of the application provides a real-time voice conversion method, which comprises the following steps:

intercepting first voice data meeting voice segmentation conditions from voice data of a source speaking object recorded in real time;

processing the first voice data, and extracting first semantic information of the first voice data;

inputting the first semantic information into a pre-trained voice conversion model, and converting the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object; wherein the effective information is information influencing the voice conversion of the first semantic information;

and reconstructing the target voice characteristic information to obtain second voice data after the first voice data is converted.

In some embodiments, there is also provided a real-time speech conversion apparatus, the apparatus comprising:

the intercepting module is used for intercepting first voice data meeting voice segmentation conditions from voice data of a source speaking object recorded in real time;

the extraction module is used for processing the first voice data and extracting first semantic information of the first voice data;

the conversion module is used for inputting the first semantic information into a pre-trained voice conversion model and converting the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object; the effective information is information influencing the voice conversion of the first semantic information;

and the reconstruction module is used for reconstructing the target voice characteristic information to obtain second voice data after the first voice data is converted.

In some embodiments, there is also provided an electronic device comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions being executable by the processor to perform the steps of the real-time speech conversion method.

In some embodiments, a computer-readable storage medium is also provided, having stored thereon a computer program which, when executed by a processor, performs the steps of the real-time speech conversion method.

Based on this, embodiments of the present application provide a method, an apparatus, an electronic device, and a medium for real-time voice conversion, where a real-time recorded voice is input into a voice conversion model in a segmented manner, rather than being input in a whole segment, so as to reduce a delay problem in real-time application; and then recognizing first voice information of segmented voice, converting and processing the first semantic information and effective information of historical voice data before the first voice data through a pre-trained voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object, and finally reconstructing the target voice characteristic information to obtain second voice data after the first voice data is converted.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flow chart of a method of real-time voice conversion according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a method for processing the first voice data to extract first semantic information of the first voice data according to an embodiment of the present application;

FIG. 3 is a flow chart of a training method according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a training phase of a training method according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for performing a conversion process on the first semantic information and valid information of previous historical voice data of the first voice data through the voice conversion model according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a structure of a speech conversion model according to an embodiment of the present application;

fig. 7 is a flowchart of a method for obtaining target speech characteristic information corresponding to first semantic information and speech factors of a target speaker object by performing conversion processing on the first semantic information and valid information of previous historical speech data of the first speech data through the speech conversion model according to the embodiment of the present application;

FIG. 8 is a flowchart illustrating a method for generating, by an encoder of a speech conversion model according to the embodiment of the present application, a target speech vector feature corresponding to first semantic information and a target speech factor vector according to the target speech factor vector and the first semantic information and second valid information of previous historical speech data of the first speech data;

FIG. 9 is a flowchart illustrating a method for obtaining target speech feature information corresponding to first semantic information and speech factors of a target speaker by processing the target speech vector feature and third valid information of the previous historical speech data of the first speech data through the decoder according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a real-time speech conversion apparatus according to an embodiment of the present application;

fig. 11 shows a schematic structural diagram of the electronic device.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

Speech, as an external expression of language, is the most natural form of communication in our daily lives. The speech not only contains the content information of the language, but also transmits the multi-dimensional information of tone, emotion, style and the like of the speaker. Voice Conversion (Voice Conversion) is a technology for changing the tone of a speaker into another person on the premise of keeping the Voice content unchanged. Namely, on the basis of keeping the original semantics of the voice unchanged, certain characteristics of the voice are changed, specifically including the tone color, style, accent and the like of the speaker.

With the development of deep learning, the voice conversion technology also makes remarkable progress, and the voice conversion model based on the deep neural network is obviously improved in the conversion similarity and naturalness. The presence of a neural vocoder enables the output voice of the voice conversion to approach the true recording level.

In real life, the voice conversion technology has a plurality of practical applications, such as a sound changer, a video dubbing, a voice assistant, personalized conversion, and the like, and has a wide development prospect in helping disabled people with sound production disorder to recover damaged voice. However, in practical applications, voice conversion is still limited by data size, computing resources, real-time rate, conversion effect, and the like. By improving the real-time reasoning and optimization of the model, the voice conversion model has the streaming reasoning capability, the model reasoning efficiency is improved, and the method can be used in scenes such as real-time conversation voice change, privacy protection and the like.

In the existing voice conversion technology, the flow of voice conversion mainly comprises three steps of feature analysis and extraction, feature conversion and voice synthesis. The feature analysis and extraction is to extract feature information representing features of the voice data, such as mel spectrum of the voice data, from the voice data of the source speaking object. The feature conversion is to convert feature information of the voice data feature into feature information of a voice factor corresponding to the target speaking object, for example, a mel spectrum of the voice factor corresponding to the target speaking object. The speech synthesis is to recombine the converted feature information into acoustic audio features.

Speech conversion can be classified into a parallel corpus data-based method and a non-parallel corpus method according to the characteristics of training data. The parallel corpus method uses data which contains the records of the same semantic content of the source speaker and the target speaker, and on the contrary, the non-parallel data does not contain data which has the same content and different speakers. The improvement from the parallel corpus method to the non-parallel corpus method solves the problems of large difficulty in obtaining parallel corpus data and small data volume.

The method for converting the voice based on the parallel linguistic data uses parallel recordings which have the same text and simultaneously contain a source speaker and a target speaker for training, and the method mainly comprises the steps of carrying out frame-level alignment on the voices of the source speaker and the target speaker by using Dynamic Time Warping (DTW) to obtain a training set, then training a voice conversion model on the basis of the training set, and modeling the characteristic mapping relation between the source speaker and the target speaker. The methods commonly used at present include conventional Gaussian Mixture Model (GMM), vector quantization, instantiated unit selection, non-negative matrix factorization, partial least squares regression, and the like. Attention was paid in 2014, and the attention was proposed, and the method was widely used in the fields of images, natural language processing, and the like. In 2019, google proposed a Parrotron model, aligning parallel corpora based on an attention mechanism. Because the parallel corpus method is basically one-to-one voice conversion, the practicability is low; and the parallel data of two or more speakers need to be contained at the same time, so the cost is too high and the use is less at present.

Non-parallel speech conversion refers to a method that does not require parallel corpus data for conversion. The method has two common ideas, namely, performing phoneme-level alignment on the audio of a source speaker and the audio of a target speaker in a non-parallel corpus, and converting a problem into voice conversion of the parallel corpus; the second is to use audio data containing only the targeted speaker. Currently, the mainstream methods of non-parallel voice conversion include a method using a voice recognition feature, VAE (variable Auto-Encoder), GAN (generic adaptive Network), and the like, and the frame difference of each method and the like is large. The core idea of using the method based on the feature of the posterior probability atlas (PPG) in speech recognition is to extract the posterior probability of each word, word and triphone set from the audio by using the speech recognition correlation technique. The posterior probability is at frame level, so that time length information in the source speech is reserved, information related to the speaker is removed to a certain degree, and decoupling of audio semantic content and speaker timbre and the like is realized. The voice conversion model learns the mapping relation between the posterior probability and the spectral characteristics, and finally generates audio through the vocoder. Compared with a method of parallel linguistic data, the method of voice conversion based on the non-parallel linguistic data does not need parallel recording which is difficult to obtain, can widely use various data sets, and can meet the conversion requirements of various scenes.

The traditional voice conversion model requires that conversion can be carried out only after all input voice is acquired, and the requirements of low delay and high performance in real-time application cannot be met. Based on this, embodiments of the present application provide a method, an apparatus, an electronic device, and a medium for real-time voice conversion, where a real-time recorded voice is input into a voice conversion model in a segmented manner, rather than being input in a whole segment, so as to reduce a delay problem in real-time application; then, recognizing first semantic information of segmented first voice data, converting and processing the first semantic information and effective information of historical voice data before the first voice data through a pre-trained voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object, and finally reconstructing the target voice characteristic information to obtain second voice data after the first voice data is converted.

It should be noted that both the real-time speech conversion method and the avatar driving method provided in the embodiments of the present application can be implemented based on artificial intelligence. Artificial Intelligence (AI) is a comprehensive discipline that uses a digital computer or machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence technology mainly involved includes directions such as Speech Recognition ASR (Automatic Speech Recognition), voice Conversion VC (Voice Conversion), and the like.

The voice conversion provided by the embodiment of the application can be applied to processing equipment, and the processing equipment can be terminal equipment or a server. The processing device may have the capability to perform speech recognition and speech conversion. In this embodiment, the processing device may input the real-time recorded voice into the voice conversion model in a segmented manner by implementing the voice conversion technology, so as to realize functions of recognizing first semantic information of the segmented voice, obtaining target voice feature information corresponding to the first semantic information and the voice factor of the target speech object, processing the target voice feature information to obtain target voice feature information corresponding to the first semantic information and the voice factor of the target speech object, and processing the target voice feature information to obtain converted second voice data for playing.

The processing device may be a terminal device, such as a smart terminal, a computer, a Personal Digital Assistant (PDA), a tablet computer, and the like.

The processing device may also be a server, such as a stand-alone server or a cluster server. When the server carries out the real-time voice conversion, the server identifies first voice information of segmented first voice data, effective information of the first semantic information and previous historical voice data of the first voice data is converted and processed through a pre-trained voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object, finally the target voice characteristic information is reconstructed to obtain second voice data after the first voice data is converted, and the second voice data is stored or sent to terminal equipment for playing.

The real-time voice conversion method provided by the embodiment of the application can be applied to various application scenes suitable for voice conversion, such as live broadcast, real-time call, video recording and the like, and the interestingness in voice call is increased. Under the scenes, the method provided by the embodiment of the application can be used for converting the voice data of the source speaking object recorded in real time into the voice data of the target speaking object with high quality and low delay.

A method, an apparatus, a device and a medium for real-time voice conversion provided in the embodiments of the present application are described in detail below.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method of a real-time voice conversion method according to an embodiment of the present application, where the real-time voice conversion method includes the following steps S101-S104;

s101, intercepting first voice data meeting a voice segmentation condition from voice data of a source speaking object recorded in real time;

s102, processing the first voice data, and extracting first semantic information of the first voice data;

s103, inputting the first semantic information into a pre-trained voice conversion model, and converting the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object; the effective information is information influencing the voice conversion of the first semantic information;

s104, reconstructing the target voice characteristic information to obtain second voice data after the first voice data are converted.

The embodiment of the application provides a real-time voice conversion method, which is characterized in that real-time recorded voice is input into a voice conversion model in a segmented mode instead of being input in a whole segment, so that the delay problem in real-time application is reduced; then, recognizing first semantic information of segmented first voice data, converting and processing the first semantic information and effective information of historical voice data before the first voice data through a pre-trained voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object, and finally reconstructing the target voice characteristic information to obtain second voice data after the first voice data is converted.

Specifically, in step S101, first voice data meeting a voice segmentation condition is intercepted from voice data of a source speaking object recorded in real time; the first voice data meeting the voice segmentation condition at least comprises one of the following:

recording first voice data with the time length reaching the preset segmentation time length in real time;

recording first voice data with the frame number reaching a preset frame number threshold in real time;

first voice data intercepted when a preset segmentation instruction is received;

and recording the intercepted first voice data when the recording is finished.

The step S101 segments the voice data of the source speaking object recorded in real time, specifically, the voice segmentation condition may be one or more; for example, the speech segmentation conditions may be one kind or plural kinds.

For example, when the first voice data of the source speaker recorded in real time is intercepted, it may be set only to intercept a segment of the first voice data 5s long when the recording duration reaches 5 s.

Alternatively, in some embodiments, a set of recording durations 5s, 8s, 10 may be set; when the recording duration reaches 5s, intercepting a section of first voice data with the length of 5 s; restarting timing, and intercepting a section of first voice data with the length of 8s when the recording duration reaches 8 s; and restarting timing, intercepting a section of first voice data with the length of 10s when the recording time length reaches 10s, and circulating in sequence.

Alternatively, in some embodiments, it may be arranged that: when the recording duration reaches 5s, intercepting a section of first voice data with the length of 5 s; and when the recording is finished, regardless of the recording duration, intercepting the first voice data from the last interception time point to the recording ending time point.

Alternatively, in some embodiments, it may be arranged that: when the number of recording frames reaches 5 frames, intercepting a section of first voice data of 5 frames; when the recording is finished, regardless of the number of recording frames, the first voice data from the last time point of the capturing to the end point of the recording is captured.

Alternatively, in some embodiments, a segmentation instruction may be generated by an interception operation of a user, and the first voice data may be intercepted when a preset segmentation instruction is received in real time. The intercepting operation of the user may be an intercepting operation of the recording device for the source speaking object, and specifically, the intercepting operation may be clicking, touching, shortcut key, and the like.

In step S102, the first voice data is processed, and first semantic information of the first voice data is extracted.

In the embodiment of the present application, referring to fig. 2, processing the first voice data, and extracting first semantic information of the first voice data includes the following steps S201 to S202;

s201, obtaining a pre-trained voice recognition model;

s202, inputting the first voice data into the voice recognition model, performing tone decoupling on the first voice data through the voice recognition model, removing noise in the first voice data, and extracting semantic information of the first voice data.

The voice recognition model is used for recognizing semantic information of the first voice data.

In the embodiment of the present application, the Speech Recognition model is also called ASR (Automatic Speech Recognition) model.

In the embodiment of the application, in the training stage of the Speech Recognition model, the data of multiple speakers are used for training a Speaker Independent Speech Recognition model (SI-ASR).

That is to say, the speech recognition model in the embodiment of the present application is based on a speech conversion scheme of non-parallel corpora, and an ASR model is used to extract semantic information. The output characteristics of the ASR model only contain semantic information, which is equivalent to the decoupling of the semantic of the source audio and other information (including speaker timbre), thereby improving the similarity of conversion results. Therefore, the voice conversion uses the semantic recognition result output by the voice recognition model as the input of the voice conversion model, so that the requirement on data is reduced, and the conversion effect is greatly improved.

It should be understood that the first semantic information in the embodiment of the present application is not text information, but a bottleneck feature, and is characterized in a vector form. The most important of the voice conversion method in the embodiment of the application is real-time, and the steps need to be simplified as much as possible to improve the real-time performance of voice conversion and reduce the time delay, so in the ASR part, the main steps are to extract the Fbank feature from the first voice data, then carry out the reasoning of the voice recognition model and take the bottleneck feature of the penultimate layer, and the first voice data does not need to be converted into text information.

The speech recognition model can also remove noise in the first speech data, and illustratively, in a training stage of the speech recognition model, the speech recognition model is trained by using multi-speaker data carrying noise, so that the speech recognition model has denoising capability, the noise in the first speech data can be removed, more accurate semantic information of the first speech data can be extracted, and the conversion effect is improved.

In step S103, the first semantic information is input into a pre-trained speech conversion model, and the first semantic information and valid information of previous historical speech data of the first speech data are converted by the speech conversion model, so as to obtain target speech feature information corresponding to the first semantic information and speech factors of a target speaker; and the effective information is information influencing the voice conversion of the first semantic information.

In the embodiment of the present application, referring to fig. 3, fig. 3 shows a flowchart of a training method in the embodiment of the present application; specifically, the speech conversion model is trained through steps S301 to S304:

s301, obtaining a pre-trained voice recognition model;

s302, third voice data of a target speaking object are obtained, and target voice characteristic information of the third voice data is extracted;

s303, extracting third voice data semantic information through the pre-trained voice recognition model;

s304, inputting the semantic information and the target voice characteristic information of the third voice data into a pre-established voice conversion model, and training the voice conversion model until the voice conversion model meets the training completion condition.

Here, the training completion condition includes that the number of times of training reaches a preset number of times of training, that a detection value of the voice conversion model, of the voice conversion model reaches a preset detection value, and the like. A detection value of the speech conversion model is determined by a loss function.

And when the voice conversion model meets the training completion condition, stopping training to obtain the trained voice conversion model.

In an embodiment of the present application, the target speech feature information is a mel spectrum.

That is, referring to fig. 4, a speech conversion system capable of executing the speech conversion method of the present application needs to be trained in two stages:

the first stage is as follows: speech Recognition model training this stage trains a Speaker Independent Speech Recognition model (SI-ASR) using multi-Speaker data.

The second stage is as follows: and (3) training a voice conversion model, namely extracting BNFs (semantic information of the audio) of the audio in a training set by using the SI-ASR model trained in the first stage, extracting corresponding Mel spectrum characteristics, and training the voice conversion model to learn the relationship between the BNFs and the Mel spectrum.

Referring to fig. 4, after obtaining a trained speech recognition model and a trained speech conversion model through a first stage and a second stage, entering a conversion stage, in the conversion stage, first inputting first speech data to be converted into an ASR model to obtain BNFs, then inputting the BNFs into the speech conversion model to obtain a mel spectrum, and finally reconstructing audio through a vocoder to obtain converted speech.

In order to implement the streaming reasoning of the voice conversion model according to the embodiment of the present application, the streaming capability of the common structure in the deep learning neural network is first analyzed as follows:

linear layer: the reasoning process for the linear layer is relatively simple, i.e. adding an optional offset to a matrix multiplication. The input characteristic dimension is not changed, and the time dimension is changed along with the segment length of the input data, but the inference result is not influenced. That is, the spliced result after reasoning the data segments has no difference with the whole-segment reasoning result. Therefore, the linear layer in the model can realize the stream-oriented reasoning without any modification in the implementation of the voice conversion model.

A convolutional layer: the convolutional layer has a large influence on the streaming due to the padding and the setting of the receptive field. In a conventional convolutional layer, padding with the same length is often added at the beginning and the end to ensure the input-output time dimension to be the same. If the policy is still maintained during streaming reasoning, a blank is equivalently inserted between two continuously intercepted first voice data, which causes the problem of data discontinuity. The discontinuous positions are represented as bright lines on the spectrogram, and are represented as the blockage at the audio splicing position of the second voice data after the conversion of the different first voice data in audiometry.

Meanwhile, all data are visible in the whole synthesis, and in the flow reasoning, future data can not be obtained obviously, so that causal convolution must be adopted.

In the embodiment of the application, a specific strategy is that all padding is placed at the head of data during training, and only the padding is added at the head of a first section during reasoning, and all subsequent sections are not added with the padding, but effective information of historical voice data before the first voice data is added, so that the purpose of achieving the same effect of stream-oriented reasoning and whole-section reasoning can be achieved.

Transposition convolution: the characteristics of the transposed convolution are similar, and the front and back padding are all moved to the head, but due to different convolution methods, some modifications need to be made in the padding calculation.

RNN: the structure of RNN autoregression makes it naturally suitable for stream-oriented reasoning, however, bidirectional RNNs like BGRU, bi-LSTM, etc. require a reverse reasoning process from the tail to the head of input data, i.e. require future information to make reasoning, obviously cannot be stream-oriented, and need to be changed into one-way, and make reasoning only by effective information of historical speech data.

With the introduction of attention mechanism, transformers are widely applied to various models. In order to realize streaming reasoning, a segmentation-based attribute is required to be used, the scope of the attribute is essentially limited, and the method is required to be realized through an attribute mask. Different from the entire segment, the segment attribute masks future information in a mask mode, and only looks at the current segment, the current segment and the previous segment, or all historical segments.

Based on this, in the embodiment of the present application, please refer to fig. 5, and fig. 5 is a flowchart illustrating a method for performing conversion processing on the first semantic information and the valid information of the previous historical voice data of the first voice data by using the voice conversion model in the embodiment of the present application, where the method includes the following steps S501 to S502:

s501, acquiring second semantic information of a previous section of historical voice data of the first voice data converted and processed by the voice conversion model, and acquiring first effective information output by each convolution layer when target voice characteristic information of voice factors of a target speaking object corresponding to the second semantic information is acquired; wherein the second semantic information is extracted from a previous segment of historical voice data of the first voice data;

s502, the first semantic information is sequentially input into each convolution layer of the voice conversion model, first effective information corresponding to the convolution layer is added to the head of input data of each convolution layer, and the first semantic information and the effective information of the previous historical voice data of the first voice data are converted through the voice conversion model.

That is to say, in the embodiment of the present application, the historical speech data is a previous segment of the historical speech data of the first speech data, that is, a segment of the historical speech data which is prior to the first speech data in the interception time and is closest to the first speech data.

Here, the second semantic information is extracted from a previous piece of history voice data of the first voice data; the first semantic information is extracted from the first speech data. Or, the second semantic information corresponds to a previous section of historical voice data of the first voice data; the first semantic information corresponds to first voice data.

Here, the valid information includes at least one type of valid information, i.e., first valid information.

It should be understood that the first valid information output by each convolutional layer is data required by the convolutional layer to process the first semantic information of the next segment of first voice data, and is not a processing result of the first convolutional layer processing the first voice information.

For example, the first valid information may be tail data of a preset number of frames of input data of each convolution layer.

Here, it should be noted that, during inference, only the first speech data in the first segment is used, and since there is no previous historical speech data, the first semantic information of the first speech data in the first segment is sequentially input to each convolutional layer of the speech conversion model, and 0 is added to the header of the input data of each convolutional layer, so as to ensure that the input and output dimensions of the convolutional layers are the same.

In the embodiment of the present application, the first semantic information is sequentially input to each convolution layer of the voice conversion model, and the first effective information corresponding to the convolution layer is added to the head of the input data of each convolution layer, so that the input and output dimensions of the convolution layers are ensured to be the same by adding the effective information of the historical voice data, and meanwhile, since a blank is not additionally inserted between two first voice data segments, the problem of data discontinuity of the two first voice data segments is not caused, and a bright line does not exist on a spectrogram, which is specifically expressed in an audiometry mode that a blockage does not occur at an audio splicing position.

In some embodiments, the speech conversion model is only capable of converting the first speech data to target speech characteristic information for a particular speaking subject.

In some embodiments, the speech conversion model is capable of converting first speech data into target speech feature information for a plurality of different speaking subjects; through selection operation, the voice conversion model determines a target speaking object and converts the first semantic information into target voice characteristic information corresponding to voice factors of the target speaking object.

Specifically, inputting the first semantic information into a pre-trained speech conversion model, and performing conversion processing on the first semantic information and effective information of previous historical speech data of the first speech data through the speech conversion model to obtain target speech feature information corresponding to the first semantic information and speech factors of a target speaking object, including:

determining a target speech factor vector of a target speaking object;

and inputting the first semantic information and the target voice factor vector of the target speaking object into a pre-trained voice conversion model so that the voice conversion model converts the first semantic information and effective information of previous historical voice data of the first voice data based on the target voice factor vector of the target speaking object to obtain target voice characteristic information corresponding to the first semantic information and the voice factor of the target speaking object.

Specifically, the determining the target speech factor vector of the target speaking object includes:

acquiring identification information of a target speaking object;

determining a target voice factor vector of a target speaking object from an association relation table of a pre-trained voice conversion model according to identification information of the target speaking object; wherein, the incidence relation table represents the incidence relation between the speaking object and the voice factor vector.

Through the selection operation of inputting the identification information of the target speaking object, the voice conversion model determines the target speaking object, so that the flexibility of the voice conversion model is improved, and the voice conversion model is convenient to be applied to various occasions and develop various applications.

The identification information of the target speaking object can be the number, ID, keyword, name, attribute, etc. of the target speaking object. Illustratively, the plurality of speaking subjects are, for example: songshen schoolmate (ID 001), guo de (ID 002), song Xiaobao (ID 003), wang Fei (ID 004).

Inputting 001, determining the song-spirit schoolmate as a target speaking object; or inputting scholars, or determining the songshou as a target speaking object; or, inputting the sons, and determining the sons as the target speaking object.

Referring to fig. 6, fig. 6 is a schematic structural diagram of the speech conversion model; the CBHG is used as a coder of a model and comprises a one-dimensional convolution filter bank (ConvolationBank), a Highway network (Highway network) and a bidirectional gating cycle unit (GRU), and the CBHG module has strong modeling capacity on sequence information and is suitable for voice conversion tasks.

The AR part is a decoder (decoder) including four modules of Prenet, GRU, linear layer and Postnet. In each decoding process, the past output hidden state information of the GRU is spliced through the results of Prenet and CBHG, the splicing result is sent to the GRU, and a new result is obtained through a linear layer.

To implement streaming reasoning, the normal convolution should be replaced by causal convolution, padding is added to the left side of the input data, i.e. the header, without looking at any future information. Meanwhile, in addition to generating the inference result, the voice model should also return the cache of each convolution layer in the model for the next inference. And the buffer intercepts the last N frames input by the current layer, wherein N is the receptive field size of the current layer.

Meanwhile, as the CBHG uses bidirectional GRUs, the bidirectional GRUs must be changed into one-way, that is, inference is carried out only by means of historical voice data and inference is not carried out by means of future voice data. Here, to ensure the effect as much as possible, the number of RNN units is increased to two times, that is, the modified unidirectional GRU and the former bidirectional GRU parameter amount are kept unchanged.

In the training phase, the data fed into the speech conversion model need not be changed in any way, in the same way as the training of the normal model. During reasoning, padding is needed for the first chunk to ensure the input and output lengths are the same, and the cache returned by the model is saved. And simultaneously inputting the cache generated in the last step and the input data into the model when reasoning the subsequent chunk so as to obtain new output. Under the condition of using the effective information of the historical voice data, the segmented reasoning result is basically completely consistent with the whole-segment reasoning result, and the second voice data converted by different segments can be played continuously without pause.

In order to reduce the problem of reduced effectiveness of streaming models compared to non-streaming models, strategies are employed to view future information. Since the last layer of Postnet in the speech conversion model is used for carrying out some fine adjustment on the generated Mel spectrum, the quality of the Mel spectrum is improved, and some future information can be considered in the layer. The causal convolution is completely independent of future information, and is implemented by complementing enough 0 on the left side of real input data to ensure that one frame of input can generate one frame of output. If n frames of future information need to be looked at, n 0's on the left side of the input data are moved to the right side, that is, the input data with the length of n +1 is input to generate a frame result, and the n frames of future information are looked at in a practical sense. In the embodiment of the present application, the value of n is exemplarily 1.

Based on the analysis, please refer to fig. 7, the first semantic information and the effective information of the previous historical voice data of the first voice data are converted through the voice conversion model, so as to obtain target voice characteristic information corresponding to the first semantic information and the voice factor of the target speaking object; comprising the following steps S701-S704;

s701, acquiring a target voice factor vector of a target speaking object;

s702, processing the target voice factor vector, the first semantic information and second effective information of the previous historical voice data of the first voice data through an encoder of a voice conversion model, and generating target voice vector characteristics corresponding to the first semantic information and the target voice factor vector;

s703, outputting the target speech vector characteristics to a decoder of the speech conversion model;

s704, the target voice vector characteristics and third effective information of the previous historical voice data of the first voice data are processed through the decoder to obtain target voice characteristic information corresponding to the first semantic information and the voice factors of the target speaking object.

Here, the target speech factor vector represents the speech factor of the target speaking object, that is, represents the speech characteristics of the target speaking object, such as the timbre characteristics and tone characteristics of the target speaking object when speaking.

The second effective information is hidden state information generated when an encoder of the voice conversion model processes historical voice data; the third effective information is hidden state information generated when a decoder of the voice conversion model processes historical voice data.

The hidden state information of the encoder and the decoder is output in the past processing process; the method is used for realizing reasoning of a reasoning voice conversion model. For example, in the sentence "i thirst and want to drink water", it is difficult to infer "drink water" only according to "want"; but if the probability of 'drinking' is inferred to be greatly increased by combining 'I thirsty and want'.

Based on the above, inputting the first semantic information into a pre-trained voice conversion model, and converting the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object, wherein the effective information comprises multiple types of effective information; in the embodiment of the application, the first effective information, the second effective information and the third effective information are specifically included; the first effective information is output by each convolutional layer when processing a previous segment of historical voice data and is used as padding to be added to the head of input data of each convolutional layer at present, so that 0 is prevented from being added to the head of the input data of each convolutional layer, and the problem of unsmooth after segmented real-time voice conversion is solved; the second effective information is hidden state information of the coder of the voice conversion model in processing historical voice data; the third useful information is the hidden state information of the speech conversion model in the process of historical speech data, where the historical speech data is not just the previous historical speech data. The voice conversion model is a streaming reasoning model, the second effective information is used for improving the accuracy of the reasoning result of the encoder, and the third effective information is used for improving the accuracy of the reasoning result of the decoder.

Based on the voice conversion model, the effect of real-time voice conversion is improved from multiple aspects.

In some embodiments, referring to fig. 8, generating, by an encoder of a speech conversion model, a target speech vector feature corresponding to first semantic information and a target speech factor vector according to the target speech factor vector and the first semantic information and second valid information of previous historical speech data of the first speech data, includes the following steps S801-S802;

s801, acquiring hidden state information generated in the process of processing historical voice data by an encoder of a voice conversion model as second effective information;

s802, processing the target voice factor vector, the second effective information and the first semantic information through an encoder of a voice conversion model, and generating target voice vector characteristics corresponding to the first semantic information and the target voice factor vector.

That is, the historical speech data plays a role in reasoning besides the convolutional layer, so that the encoder can perform reasoning with reference to the historical speech data, and thus, the reasoning result is more accurate, and the speech conversion quality is improved.

In some embodiments, referring to fig. 9, processing, by the decoder, the target speech vector feature and third valid information of the previous historical speech data of the first speech data to obtain target speech feature information corresponding to the first semantic information and the speech factor of the target speaker object, includes the following steps S901-S902;

s901, acquiring hidden state information generated in the process of processing historical voice data by a decoder of a voice conversion model as third effective information;

s902, processing the target voice vector characteristic and the third effective information through a decoder of the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and the voice factor of the target speaking object.

That is to say, the historical speech data plays a role in reasoning, and not only can be used for reasoning by referring to the historical speech data through the encoder alone, but also can be used for reasoning by referring to the historical speech data through the decoder, so that the reasoning result is more accurate, and the speech conversion quality is improved.

Therefore, the converting, by the speech conversion model, the first semantic information and the valid information of the previous historical speech data of the first speech data specifically include:

and converting the first semantic information and effective information of the previous historical voice data of the first voice data through a convolutional layer of the voice conversion model and/or an encoder and/or a decoder.

In some embodiments, processing the target speech vector feature and the third valid information by a decoder of the speech conversion model to obtain target speech feature information corresponding to the first semantic information and the speech factor of the target speaker object, including:

processing the target voice vector feature and the third effective information through a decoder of the voice conversion model to obtain first voice feature information corresponding to first semantic information and voice factor vectors;

and adding 0 of a preset data frame at the tail part of the first voice characteristic information, inputting the first voice characteristic information added with 0 into an output convolution layer of a decoder, and processing the first voice characteristic information added with 0 through the decoder to obtain target voice characteristic information.

When the decoder processes the target voice vector feature, 0 is supplemented on the left side of the real input of the decoder, and a strategy of viewing future information is adopted, so that the problem of effect reduction of a streaming model compared with a non-streaming model is reduced, and the quality of real-time voice conversion is further improved.

In step S104, reconstructing the target voice feature information to obtain second voice data obtained by converting the first voice data, including:

and reconstructing the target voice characteristic information through a vocoder to obtain second voice data after the first voice data is converted.

That is, the vocoder synthesizes the Mel spectrum corresponding to the first semantic information and the voice factor of the target speaker into the second voice data characterized by the acoustic features as the converted audio of the first voice data.

After reconstructing the target voice feature information to obtain second voice data after the first voice data is converted, the voice conversion method further includes:

determining the playing sequence of the second voice data according to the interception time of the first voice data before the second voice data is converted;

and playing the second voice data according to the determined playing sequence.

That is to say, a plurality of first voice data which are sequentially intercepted in the real-time recording process are converted to obtain a plurality of second voice data, and the plurality of second voice data are sequentially played according to the intercepting time of the first voice data before conversion, so that the effects of whole-segment conversion and whole-segment playing are realized. And because the effective information of the first voice data of the previous section participates in the conversion process of the first voice data of the previous section, the problem of pause during continuous playing of different sections is solved.

Here, the plurality of second voice data may be stored in the server according to a playing sequence, and the plurality of second voice data arranged in sequence is sent to the terminal device according to a preset sending rule for playing; or the second voice data obtained by conversion is sent to the terminal equipment in real time, and the terminal equipment plays the data in sequence according to the determined playing sequence.

And playing the second voice data according to the determined playing sequence, namely playing the previous second voice data and immediately playing the next second voice data so as to realize the effect of converting the voice data of the source speaking object recorded in real time into the second voice data of the target speaking object in real time without delay.

Based on the real-time voice conversion method, the streaming reasoning can be realized under the RTF smaller than 0.5, and the conversion effect is not obviously reduced.

Specifically, please refer to table one for experimental results of the present application.

Watch 1

RTF	10700	m1
			VC	0.033	0.049

Here, RTF is Real Time Factor, real Time rate; the VC ensures the voice conversion method in the embodiment of the present application, or the real-time conversion model (including a voice recognition model and a voice conversion model) in the embodiment of the present application; the table guarantees the real-time rate of the voice conversion method in the embodiment of the application on two cpus (m 1 and i 7-10700); analysis table one, the real-time voice conversion method described in the embodiment of the present application can implement stream-based reasoning under an RTF less than 0.5, and ensure the real-time performance of voice conversion.

Based on the same inventive concept, a real-time voice conversion apparatus corresponding to the real-time voice conversion method in the embodiment of the present application is also provided in the embodiment of the present application, and as the principle of solving the problem of the real-time voice conversion apparatus in the embodiment of the present application is similar to that of the real-time voice conversion method in the embodiment of the present application, reference may be made to the implementation of the real-time voice conversion method in the embodiment of the present application, and repeated parts are not described again.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a real-time speech conversion apparatus provided in an embodiment of the present application; the device comprises:

an intercepting module 1001, configured to intercept first voice data that meets a voice segmentation condition from voice data of a source speaker object recorded in real time;

an extracting module 1002, configured to process the first voice data, and extract first semantic information of the first voice data;

a conversion module 1003, configured to input the first semantic information into a pre-trained voice conversion model, and perform conversion processing on the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice feature information corresponding to the first semantic information and a voice factor of a target speaker; wherein the effective information is information influencing the voice conversion of the first semantic information;

a reconstructing module 1004, configured to reconstruct the target voice feature information to obtain second voice data after the first voice data is converted.

The embodiment of the application provides a real-time voice conversion device, which inputs the voice recorded in real time into a voice conversion model in a segmented mode instead of the whole segment, thereby reducing the delay problem in real-time application; then, effective information of historical voice data before the first semantic information and the first voice data is converted and processed through a pre-trained voice conversion model, target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object is obtained, finally the target voice characteristic information is reconstructed, second voice data obtained after the first voice data is converted are obtained, the effective information of the historical voice data is used, the influence of voice segmentation on the continuity of the whole voice is reduced, the voice conversion model can carry out reasoning in a continuous and streaming mode, the obtained converted segmented second voice packet data can be played continuously, smoothly and in high quality, and the high-performance requirement in real-time application is met.

In some embodiments, the real-time speech conversion apparatus further includes:

the determining module is used for determining the playing sequence of the second voice data according to the interception time of the first voice data before the conversion of the second voice data after the target voice characteristic information is reconstructed to obtain the second voice data after the conversion of the first voice data;

and the playing module is used for playing the second voice data according to the determined playing sequence.

In some embodiments, in the real-time speech conversion apparatus, when the speech conversion model performs conversion processing on the first semantic information and the valid information of the previous historical speech data, the conversion module is specifically configured to:

acquiring second semantic information of a previous section of historical voice data of the first voice data converted and processed by the voice conversion model, and acquiring first effective information output by each convolution layer when target voice characteristic information of voice factors of a target speaking object corresponding to the second semantic information is acquired; wherein the second semantic information is extracted from a previous section of historical voice data of the first voice data;

and sequentially inputting the first semantic information into each convolution layer of the voice conversion model, and adding first effective information corresponding to the convolution layer to the head of input data of each convolution layer so as to convert the first semantic information and the effective information of the previous historical voice data of the first voice data through the voice conversion model.

In some embodiments, in the real-time speech conversion apparatus, the historical speech data is a previous segment of the historical speech data of the first speech data.

In some embodiments, in the real-time speech conversion apparatus, when the conversion module inputs the first semantic information into a pre-trained speech conversion model, and performs conversion processing on the first semantic information and effective information of previous historical speech data of the first speech data through the speech conversion model to obtain target speech feature information corresponding to the first semantic information and speech factors of a target speaker, the conversion module is specifically configured to:

determining a target speech factor vector of a target speaking object;

and inputting the first semantic information and the target voice factor vector of the target speaking object into a pre-trained voice conversion model so that the voice conversion model carries out conversion processing on the first semantic information and effective information of previous historical voice data of the first voice data based on the target voice factor vector of the target speaking object to obtain target voice characteristic information corresponding to the first semantic information and the voice factor of the target speaking object.

In some embodiments, in the real-time speech conversion apparatus, when determining the target speech factor vector of the target speaker, the conversion module is specifically configured to:

acquiring identification information of a target speaking object;

In some embodiments, in the real-time speech conversion apparatus, when the speech conversion model performs conversion processing on the first semantic information and valid information of previous historical speech data of the first speech data to obtain target speech feature information corresponding to the first semantic information and speech factors of a target speaker, the conversion module is specifically configured to:

acquiring a target voice factor vector of a target speaking object;

processing the target voice factor vector, the first semantic information and second effective information of the previous historical voice data of the first voice data through an encoder of a voice conversion model to generate target voice vector characteristics corresponding to the first semantic information and the target voice factor vector;

outputting the target speech vector characteristics to a decoder of a speech conversion model;

and processing the target voice vector characteristics and third effective information of the historical voice data before the first voice data through the decoder to obtain target voice characteristic information corresponding to the first semantic information and the voice factors of the target speaking object.

In some embodiments, in the real-time speech conversion apparatus, the conversion module, when generating, by an encoder of the speech conversion model, the target speech vector feature corresponding to the first semantic information and the target speech factor vector according to the target speech factor vector, the first semantic information, and the second valid information of the previous historical speech data of the first speech data, is specifically configured to:

acquiring hidden state information generated in the process of processing historical voice data by an encoder of a voice conversion model as second effective information;

and processing the target voice factor vector, the second effective information and the first semantic information through an encoder of a voice conversion model to generate target voice vector characteristics corresponding to the first semantic information and the target voice factor vector.

In some embodiments, in the real-time speech conversion apparatus, when the decoder processes the target speech vector feature and third valid information of the previous historical speech data of the first speech data to obtain the target speech feature information corresponding to the first semantic information and the speech factor of the target speaker, the conversion module is specifically configured to:

obtaining hidden state information generated when a decoder of the voice conversion model processes historical voice data as third effective information;

and processing the target voice vector characteristics and the third effective information through a decoder of the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and the voice factors of the target speaking object.

In some embodiments, in the real-time speech conversion apparatus, when the decoder of the speech conversion model processes the target speech vector feature and the third valid information to obtain target speech feature information corresponding to the first semantic information and the speech factor of the target speaker, the conversion module is specifically configured to:

processing the target voice vector feature and the third effective information through a decoder of the voice conversion model to obtain first voice feature information corresponding to the first semantic information and the voice factor vector;

In some embodiments, in the real-time speech conversion apparatus, the first speech data satisfying the speech segmentation condition includes at least one of:

intercepting first voice data when a preset segmentation instruction is received;

and recording the intercepted first voice data when the recording is finished.

In some embodiments, in the real-time speech conversion apparatus, when processing the first speech data and extracting the first semantic information of the first speech data, the extracting module is specifically configured to:

acquiring a pre-trained voice recognition model;

and inputting the first voice data into the voice recognition model, performing tone decoupling on the first voice data through the voice recognition model, removing noise in the first voice data, and extracting semantic information of the first voice data.

In some embodiments, the real-time speech conversion apparatus further includes a training module, configured to obtain a pre-trained speech recognition model;

acquiring third voice data of a target speaking object, and extracting target voice characteristic information of the third voice data;

extracting third voice data semantic information through the pre-trained voice recognition model;

and inputting the semantic information and the target voice characteristic information of the third voice data into a pre-established voice conversion model, and training the voice conversion model until the voice conversion model meets the training completion condition.

In some embodiments, when reconstructing the target speech feature information to obtain the second speech data after the conversion of the first speech data, the reconstruction module in the real-time speech conversion apparatus is specifically configured to:

and reconstructing the target voice characteristic information through a vocoder to obtain second voice data converted from the first voice data.

Based on the same inventive concept, an electronic device corresponding to the real-time voice conversion method in the foregoing embodiment is also provided in the embodiment of the present application, and because the principle of solving the problem of the electronic device in the embodiment of the present application is similar to that of the real-time voice conversion method in the foregoing embodiment of the present application, the implementation of the electronic device may refer to the implementation of the real-time voice conversion method, and repeated details are not repeated.

Referring to fig. 11, an electronic device 1100 includes: a processor 1102, a memory 1101 and a bus, wherein the memory 1101 stores machine-readable instructions executable by the processor 1102, when the electronic device is running, the processor 1102 communicates with the memory 1101 via the bus, and the machine-readable instructions are executed by the processor 1102 to perform the following steps of the real-time speech conversion method, specifically:

In some embodiments, the machine readable instructions, when executed by the processor, further perform the steps of the real-time speech conversion method of:

after the target voice characteristic information is reconstructed to obtain second voice data after the first voice data is converted, determining the playing sequence of the second voice data according to the interception time of the first voice data before the second voice data is converted;

and playing the second voice data according to the determined playing sequence.

In some embodiments, in the real-time speech conversion apparatus, when the conversion module performs conversion processing on the first semantic information and the valid information of the previous historical speech data of the first speech data through the speech conversion model, the processor is specifically configured to perform the following steps:

and sequentially inputting the first semantic information into each convolution layer of the voice conversion model, and adding first effective information corresponding to the convolution layer at the head of input data of each convolution layer so as to convert the first semantic information and effective information of the previous historical voice data of the first voice data through the voice conversion model.

In some embodiments, the historical speech data is a previous segment of the historical speech data of the first speech data.

In some embodiments, when the first semantic information is input into a pre-trained speech conversion model, and the first semantic information and valid information of previous historical speech data of the first speech data are converted by the speech conversion model to obtain target speech feature information corresponding to the first semantic information and speech factors of a target speaker, the processor is specifically configured to perform the following steps:

determining a target voice factor vector of a target speaking object;

In some embodiments, in determining the target speech factor vector of the target speaking object, the processor is specifically configured to perform the following steps:

acquiring identification information of a target speaking object;

determining a target voice factor vector of the target speaking object from an incidence relation table of a pre-trained voice conversion model according to the identification information of the target speaking object; wherein, the incidence relation table represents the incidence relation between the speaking object and the voice factor vector.

In some embodiments, when the first semantic information and the valid information of the previous historical speech data are converted by the speech conversion model to obtain the target speech feature information corresponding to the first semantic information and the speech factor of the target speaker, the processor is specifically configured to perform the following steps:

acquiring a target voice factor vector of a target speaking object;

outputting the target speech vector characteristics to a decoder of the speech conversion model;

In some embodiments, when generating, by an encoder of the speech conversion model, the target speech vector feature corresponding to the first semantic information and the target speech factor vector from the target speech factor vector and the first semantic information and second valid information of previous historical speech data of the first speech data, the processor is specifically configured to perform the following steps:

In some embodiments, when the decoder processes the target speech vector feature and the third valid information of the previous historical speech data of the first speech data to obtain the target speech feature information corresponding to the first semantic information and the speech factor of the target speaker, the processor is specifically configured to perform the following steps:

In some embodiments, when the target speech vector feature and the third valid information are processed by a decoder of the speech conversion model to obtain target speech feature information corresponding to the first semantic information and the speech factor of the target speaker, the processor is specifically configured to perform the following steps:

In some embodiments, the first speech data satisfying the speech segmentation condition includes at least one of:

and recording the intercepted first voice data when the recording is finished.

In some embodiments, when the first speech data is processed and the first semantic information of the first speech data is extracted, the processor is specifically configured to perform the following steps:

acquiring a pre-trained voice recognition model;

when the target voice feature information is reconstructed to obtain the second voice data after the first voice data is converted, the method is specifically configured to:

Based on the same inventive concept, a storage medium corresponding to the real-time voice conversion method in the foregoing embodiment is also provided in the embodiment of the present application, and since the principle of solving the problem of the storage medium in the embodiment of the present application is similar to that of the real-time voice conversion method in the foregoing embodiment of the present application, the implementation of the storage medium may refer to the implementation of the real-time voice conversion method, and repeated details are not repeated.

A computer-readable storage medium having stored thereon a computer program for execution by a processor, the processor performing the steps of:

after the target voice characteristic information is reconstructed to obtain second voice data after the first voice data are converted, determining the playing sequence of the second voice data according to the intercepting time of the first voice data before the second voice data are converted;

and playing the second voice data according to the determined playing sequence.

acquiring second semantic information of a previous section of historical voice data of the first voice data converted and processed by the voice conversion model, and acquiring first effective information output by each convolution layer when target voice characteristic information of voice factors of a target speaking object corresponding to the second semantic information is acquired; wherein the second semantic information is extracted from a previous segment of historical voice data of the first voice data;

determining a target voice factor vector of a target speaking object;

acquiring identification information of a target speaking object;

acquiring a target voice factor vector of a target speaking object;

In some embodiments, when generating, by an encoder of the speech conversion model, the target speech vector feature corresponding to the first semantic information and the target speech factor vector according to the target speech factor vector and the first semantic information and second valid information of previous historical speech data of the first speech data, the processor is specifically configured to perform the following steps:

acquiring hidden state information generated by a decoder of a voice conversion model in the process of processing historical voice data as third effective information;

and processing the target voice vector characteristic and the third effective information through a decoder of the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and the voice factor of the target speaking object.

In some embodiments, when the target speech vector feature and the third valid information are processed by the decoder of the speech conversion model to obtain target speech feature information corresponding to the first semantic information and the speech factor of the target speaker, the processor is specifically configured to perform the following steps:

and recording the intercepted first voice data when the recording is finished.

acquiring a pre-trained voice recognition model;

when reconstructing the target voice feature information to obtain the second voice data after the first voice data conversion, specifically:

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in software functional units and sold or used as a stand-alone product, may be stored in a non-transitory computer-readable storage medium executable by a processor. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a platform server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk, and various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A real-time speech conversion method, comprising the steps of:

inputting the first semantic information into a pre-trained voice conversion model, and converting the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object; the effective information is information influencing the voice conversion of the first semantic information;

2. The real-time voice conversion method according to claim 1, wherein after reconstructing the target voice feature information to obtain the second voice data after converting the first voice data, the method further comprises:

determining the playing sequence of the second voice data according to the intercepting time of the first voice data before the second voice data is converted;

and playing the second voice data according to the determined playing sequence.

3. The real-time voice conversion method according to claim 1, wherein the performing conversion processing on the first semantic information and the valid information of the previous historical voice data of the first voice data by the voice conversion model comprises:

4. The real-time speech conversion method of claim 1, wherein:

the historical voice data is previous section of historical voice data of the first voice data.

5. The real-time voice conversion method according to claim 1, wherein the inputting the first semantic information into a pre-trained voice conversion model, and performing conversion processing on the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice feature information corresponding to the first semantic information and voice factors of a target speaker object comprises:

determining a target voice factor vector of a target speaking object;

6. The method of claim 5, wherein determining the target speech factor vector of the target speaker comprises:

acquiring identification information of a target speaking object;

7. The real-time voice conversion method according to claim 1, wherein the voice conversion model is used for converting the first semantic information and the effective information of the previous historical voice data of the first voice data to obtain target voice characteristic information corresponding to the first semantic information and the voice factor of the target speaking object; the method comprises the following steps:

acquiring a target voice factor vector of a target speaking object;

8. The method of claim 7, wherein generating, by an encoder of a speech conversion model, a target speech vector feature corresponding to the first semantic information and the target speech factor vector according to the target speech factor vector and the first semantic information and second valid information of previous historical speech data of the first speech data comprises:

9. The method of claim 7, wherein the processing, by the decoder, the target speech vector feature and third valid information of previous historical speech data of the first speech data to obtain the target speech feature information corresponding to the first semantic information and the speech factor of the target speaker comprises:

10. The method of claim 9, wherein the processing the target speech vector feature and the third valid information by a decoder of the speech conversion model to obtain target speech feature information corresponding to the first semantic information and the speech factor of the target speaker object comprises:

11. The real-time voice conversion method according to claim 1, wherein the first voice data satisfying the voice segmentation condition includes at least one of:

recording first voice data with the time length reaching preset segmented time length in real time;

and recording the intercepted first voice data when the recording is finished.

12. The real-time voice conversion method according to claim 1, wherein processing the first voice data to extract first semantic information of the first voice data comprises:

acquiring a pre-trained voice recognition model;

13. The real-time speech conversion method of claim 1, wherein the speech conversion model is trained by:

acquiring a pre-trained voice recognition model;

14. The real-time voice conversion method according to claim 1, wherein reconstructing the target voice feature information to obtain the second voice data after the first voice data conversion comprises:

15. A real-time speech conversion apparatus, comprising:

the intercepting module is used for intercepting first voice data meeting a voice segmenting condition from voice data of a source speaking object recorded in real time;

the conversion module is used for inputting the first semantic information into a pre-trained voice conversion model and converting the first semantic information and effective information of previous historical voice data of the first voice data through the voice conversion model to obtain target voice characteristic information corresponding to the first semantic information and voice factors of a target speaking object; wherein the effective information is information influencing the voice conversion of the first semantic information;

16. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the real-time speech conversion method according to any one of claims 1 to 14.

17. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the real-time speech conversion method according to any one of claims 1 to 14.