CN117953837A

CN117953837A - Singing voice conversion method and device based on fundamental frequency control, electronic equipment and storage medium

Info

Publication number: CN117953837A
Application number: CN202410107322.XA
Authority: CN
Inventors: 陈闽川; 马骏; 王少军; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-04-30

Abstract

The invention relates to the field of voice data processing, and discloses a singing voice conversion method based on fundamental frequency control, which comprises the following steps: cutting the first song sample to obtain a key pitch vector, and randomly shifting the key pitch vector to obtain an offset pitch vector; performing vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector; obtaining a first spliced vector based on the key pitch vector and the linear spectrum vector, and decoding the first spliced vector to obtain a reconstructed original waveform; obtaining a second spliced vector based on the offset pitch vector and the linear spectrum vector, and decoding the second spliced vector to obtain an offset waveform; the second song sample is converted to a target song based on reconstructing the original waveform, the offset waveform, and the original waveform of the second song sample. The invention aims to enable the converted singing voice to be more close to the real singing style and pitch characteristics of a first singer of a first song sample.

Description

Singing voice conversion method and device based on fundamental frequency control, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech data processing, and in particular, to a singing voice conversion method and apparatus based on fundamental frequency control, an electronic device, and a storage medium.

Background

The singing voice conversion technology based on fundamental frequency control is an advanced sound processing means, and can realize the accurate conversion from the sound style of a source singer to the tone color of a target singer under the condition of keeping the original content of the same song unchanged, or can realize the accurate conversion from the sound style of a source singer of a second song to the tone color of the target singer of a first song under the condition of keeping the original content of the second song unchanged.

Currently, one of the most difficult tasks faced by singing voice conversion models is the precise simulation and reconstruction of pitch. The pitch not only determines the adjusting structure of the song foundation, but also constructs a key carrier for rhythm and emotion expression of the whole work.

The fundamental frequency signal (F0) is taken as a core acoustic parameter, plays a decisive role in accurately describing pitch fluctuation of songs, and in the current singing voice conversion task based on fundamental frequency control, a great deal of research is carried out on incorporating fundamental frequency into a model to carry out pitch prediction, and the effectiveness of the fundamental frequency is proved to improve the rhythm naturalness of the converted work.

However, strategies for modeling directly based on fundamental frequency are not perfect. In practice, the inventor finds that the strategy of modeling directly based on the fundamental frequency can lead to a narrower variation range of the fundamental frequency of the song or a smaller variance of the fundamental frequency value of the song, thereby limiting the real singing style and pitch characteristics of different singers to a certain extent.

For example, in a mobile application developed by a financial insurance enterprise, singing voice conversion application is built in an application module to provide diversified value added services or interactive entertainment services so as to retain existing users and attract more new user groups.

For example, after the user completes the insurance business operation, the singing voice conversion model can be used to convert the voices of the user into the voice styles of different well-known singers by uploading the recorded song clips or selecting song samples provided by the platform.

Because of possible limitations of the current technology, such as a problem of limited variation range of fundamental frequency or small variance, the converted singing voice may be slightly insufficient in some complex emotional expressions or extremely high and extremely low bass parts, and all fine pitch characteristics of the target singer cannot be completely captured.

Disclosure of Invention

In view of the above, it is necessary to provide a singing voice conversion method based on fundamental frequency control, which aims to make the converted singing voice closer to the real singing style and pitch characteristics of the target singer.

The singing voice conversion method based on fundamental frequency control provided by the invention comprises the following steps:

Cutting a first song sample to obtain a key pitch vector, and randomly shifting the key pitch vector to obtain an offset pitch vector;

performing vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector;

obtaining a first spliced vector based on the key pitch vector and the linear spectrum vector, and decoding the first spliced vector to obtain a reconstructed original waveform;

Obtaining a second spliced vector based on the offset pitch vector and the linear spectrum vector, and decoding the second spliced vector to obtain an offset waveform;

The second song sample is converted to a target song based on the reconstructed original waveform, the offset waveform, and the original waveform of the second song sample.

Optionally, the clipping the first song sample to obtain the key pitch vector includes:

And extracting an initial pitch vector from the fundamental frequency signal of the first song sample, and cutting non-key areas in the initial pitch vector to obtain the key pitch vector.

Optionally, before the extracting the initial pitch vector from the fundamental frequency signal of the first song sample, the method further includes:

and performing audio sampling on the first song sample to obtain a fundamental frequency signal of the first song sample.

Optionally, the clipping the non-key region in the initial pitch vector to obtain a key pitch vector includes:

and cutting out a non-key region with the pitch change smaller than a preset threshold value in the initial pitch vector, and reserving a key region with the pitch change larger than or equal to the preset threshold value in the initial pitch vector to obtain the key pitch vector.

Optionally, the performing vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector includes:

calculating the similarity between the linear spectrum and all codewords in a preset codebook;

And selecting codewords with the distance smaller than a threshold value from the similarity result to replace the linear spectrum, so as to obtain the linear spectrum vector.

Optionally, the decoding the first spliced vector to obtain a reconstructed original waveform includes:

And decoding the first spliced vector, and restoring the decoded first spliced vector into a time domain signal to obtain the reconstructed original waveform.

Optionally, the decoding the second spliced vector to obtain an offset waveform includes:

And decoding the second spliced vector, and restoring the decoded second spliced vector into a time domain signal to obtain the offset waveform.

In order to solve the above-mentioned problem, the present invention also provides a singing voice conversion apparatus based on fundamental frequency control, the apparatus comprising:

the clipping module is used for clipping the first song sample to obtain a key pitch vector, and randomly shifting the key pitch vector to obtain an offset pitch vector;

the processing module is used for performing vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector;

The first splicing module is used for obtaining a first splicing vector based on the key pitch vector and the linear spectrum vector, and decoding the first splicing vector to obtain a reconstructed original waveform;

the second splicing module is used for obtaining a second splicing vector based on the offset pitch vector and the linear spectrum vector, and decoding the second splicing vector to obtain an offset waveform;

And the conversion module is used for converting the second song sample into a target song based on the reconstructed original waveform, the offset waveform and the original waveform of the second song sample.

In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a fundamental frequency control-based singing voice conversion program executable by the at least one processor, the fundamental frequency control-based singing voice conversion program being executed by the at least one processor to enable the at least one processor to perform the fundamental frequency control-based singing voice conversion method described above.

In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored thereon a fundamental frequency control-based singing voice conversion program executable by one or more processors to implement the above-mentioned fundamental frequency control-based singing voice conversion method.

Compared with the prior art, the method and the device have the advantages that the key pitch vector is obtained by cutting the first song sample, the key pitch vector is randomly shifted to obtain the shifted pitch vector, the pitch change characteristics of the first singer during singing are simulated, and a foundation is provided for style migration.

Performing vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector containing song spectrum information, obtaining a first splicing vector based on the key pitch vector and the linear spectrum vector, and performing decoding on the first splicing vector to obtain a reconstructed original waveform; obtaining a second spliced vector based on the offset pitch vector and the linear spectrum vector, and decoding the second spliced vector to obtain an offset waveform; the reconstructed original waveform and the offset waveform represent the original pitch of the first singer in the first song style and the sound expression at the adjusted pitch, respectively.

The second song sample is converted to a target song based on reconstructing the original waveform, the offset waveform, and the original waveform of the second song sample. The converted song can be ensured to keep the basic melody of the second song sample, and the real singing style and pitch characteristics of the first singer in the first song sample can be endowed.

Drawings

Fig. 1 is a schematic flow chart of a singing voice conversion method based on baseband control according to an embodiment of the invention;

fig. 2 is a flow chart of a singing voice conversion method based on baseband control according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a singing voice conversion device based on baseband control according to an embodiment of the present invention;

Fig. 4 is a schematic structural diagram of an electronic device implementing a singing voice conversion method based on baseband control according to an embodiment of the present invention;

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

Referring to fig. 1 and fig. 2, a flow chart of a singing voice conversion method based on baseband frequency control according to an embodiment of the invention is shown.

The method is performed by an electronic device.

In this embodiment, the singing voice conversion method based on the fundamental frequency control includes:

s1, cutting a first song sample to obtain a key pitch vector, and randomly shifting the key pitch vector to obtain an offset pitch vector;

In this embodiment, a first song sample of a first singer is obtained, the first song sample is stored in a database in the form of a digitized audio file, the song sample is input to a singing voice conversion model, and a first module of the singing voice conversion model is utilized to extract a fundamental frequency signal from the first song sample, so as to obtain an initial pitch vector. The initial pitch vector is a continuous sequence of data representing the overall pitch trend of the song.

The first singer refers to the object of the conversion from the voice style of the source singer to the tone color of the target singer, i.e., the first singer is the target singer.

The first song sample refers to a song that the first singer sings.

The initial pitch vector represents a data sequence of the trend of the fundamental frequency (F0) of the song sample.

For example, a3 minute song may have an initial pitch vector that is a series of floating point numbers with a length of 9000 (3 minutes 1000 points/second sampling rate), each corresponding to the baseband signal at that time.

In an actual song, not all time periods of the fundamental frequency signal variation play a key role in the overall style or emotion expression, so some relatively smooth, less varying partial vectors of the initial pitch vector are clipped.

For example, when a melody is a flat pass or a chord is in progress, the change in pitch may not be significant, and the region where it is not significant is regarded as a non-key region.

The non-critical areas are clipped to yield critical pitch vectors that retain only those portions that contain distinct pitch fluctuations or singer's unique music for voices in a Chinese opera features.

To achieve flexible control of pitch, singing voice conversion models may randomly shift key pitch vectors. For example, adding or subtracting a random value to each fundamental frequency value in the key pitch vector may generate different pitch versions. For example, the original pitch value at a certain time point is 500Hz, and after random offset, the pitch value can be 520Hz or 480Hz, so that an offset pitch vector with random offset is obtained, and the innovative adjustment of the pitch of songs is realized.

The singing voice conversion model is improved on the basis of a VITS (Variational INFERENCE WITH ADVERSARIAL LEARNING for end-to-end Text-to-specification) model.

The singing voice conversion model comprises a first module, a second module, a third module and a fourth module, wherein:

The first module is constructed based on PITS models. The PITS model is an end-to-end, pitch-controllable TTS model that can model pitch using variational reasoning, PITS also incorporates countermeasure training for pitch encoder, pitch decoder and pitch shift synthesis to achieve pitch controllability.

The first module is a voice-to-voice function for enabling pitch-controllable voice of the first song sample. Unlike traditional methods that rely on direct fundamental frequency modeling, the first module models and controls pitch by employing a variational reasoning technique to solve the problems of small variance of synthesized speech and limited prosody expression due to direct modeling.

The second module is constructed based on a vector quantization (Vector Quantization, VQ) algorithm. The second module is used for processing the linear spectrum of the first song sample, and can keep the characteristics of tone, rhythm and the like of the tone quality of the song.

The third module is built based on a timbre encoder. Timbre encoders refer to a deep learning model used to extract, learn, or generate mel-spectral features of sample songs in the field of singing voice conversion based on fundamental frequency control.

The third module is for capturing and expressing mel-spectra in song samples for feature combinations of mel-spectra, linear spectra, and pitch.

The fourth module is constructed based on a speech self-supervised pre-training model. The speech self-monitoring pre-training model is a network that uses unlabeled speech data for self-learning and performance improvement.

And the fourth module performs self-supervision learning by using a large number of unlabeled song samples, and extracts text information of sample songs. The fourth module can avoid the need of carrying out a large number of manual labels on song samples, and can improve the understanding capability of the singing voice conversion model on complex sound changes.

In one embodiment, the clipping the first song sample to obtain the key pitch vector includes:

In one embodiment, the extracting the initial pitch vector from the fundamental frequency signal of the first song sample of the first singer includes:

And acquiring a fundamental frequency signal of the first song sample, and extracting a key parameter of the fundamental frequency signal as the initial pitch vector, wherein the key parameter at least comprises one of average pitch, pitch range, pitch change rate and pitch differential sequence.

The first module includes a variable encoder and a variable decoder.

The variable encoder of the first module is utilized to acquire the fundamental frequency signal of the first song sample and extract key parameters of the fundamental frequency signal as an initial pitch vector, so that the singing voice conversion model is facilitated to understand and analyze the pitch characteristics of the first song sample. Key parameters include, but are not limited to: average pitch, pitch range, pitch rate of change, pitch differential sequence.

In one embodiment, before the extracting the initial pitch vector from the fundamental frequency signal of the first song sample, the method further comprises:

The method comprises the steps of reading a digital audio file (such as a format of wav, mp3 and the like) corresponding to a first song sample, digitally sampling an audio signal according to a preset sampling rate (such as 44.1kHz or 48 kHz) to obtain continuous audio information, framing the audio information to obtain a fundamental frequency signal of each frame, converting the fundamental frequency signal of each frame from a time domain to a frequency domain by utilizing a preset time-frequency analysis strategy (such as short-time Fourier transform (STFT), cepstrum analysis and the like) to obtain a spectrogram of each frame, extracting fundamental frequency characteristics of the spectrogram, and obtaining an initial pitch vector.

The spectrogram refers to the frequency component distribution of the fundamental frequency signal at a single moment. The spectrogram is a static image showing the frequency composition of the fundamental frequency signal and its corresponding amplitude.

In one embodiment, after the acquiring the fundamental frequency signal of the first song sample, the method further comprises:

and performing smoothing pretreatment on the fundamental frequency signal, wherein the smoothing pretreatment at least comprises one of a moving average algorithm, a linear filter algorithm and a nonlinear filter algorithm.

The moving average algorithm refers to that the average number of the fundamental frequency values of adjacent frames is calculated to replace the fundamental frequency value of the current frame, so that instability caused by single frame fundamental frequency fluctuation is reduced.

The linear filter algorithm refers to applying a low-pass filter (such as butterworth filter, exponential smoothing filter, etc.) to filter the baseband sequence to reduce high-frequency noise and abrupt points.

The nonlinear filtering algorithm refers to advanced filtering algorithms such as Kalman filtering, particle filtering and the like, and the algorithms can be combined with model prediction and actual observation data to optimize fundamental frequency estimation, so that the nonlinear filtering algorithm is particularly suitable for environments with dynamic changes and noise interference.

The fundamental frequency signal is subjected to smoothing pretreatment, key parameters are extracted from the fundamental frequency signal subjected to the smoothing pretreatment to serve as an initial pitch vector, the influence of noise and detection errors on the fundamental frequency signal can be eliminated, and the continuity and stability of a pitch track are improved.

In one embodiment, the clipping non-critical regions in the initial pitch vector to obtain a critical pitch vector includes:

In an actual song, not all the fundamental frequency signal changes in all the time periods play a key role in the overall style or emotion expression, so that non-key areas with pitch changes smaller than a preset threshold (for example, the threshold is 10 Hz) in the initial pitch vector are cut, and the non-key areas with pitch changes smaller than the preset threshold refer to some relatively stable and small-change partial vectors in the initial pitch vector.

The non-critical areas are clipped to yield critical pitch vectors that retain only those portions that contain distinct pitch fluctuations or singer's unique music for voices in a Chinese opera features. The singing voice conversion model can be made to focus more on capturing and mimicking critical areas that contain significant pitch fluctuations, tempo changes, and singer's personal features music for voices in a Chinese opera.

In one embodiment, the randomly shifting the key pitch vector to obtain an shifted pitch vector includes:

Randomly shifting each fundamental frequency value of the key pitch vector according to a preset pitch shift strategy;

and when the key pitch vector after random offset meets a preset condition, obtaining the offset pitch vector.

The pitch shift strategy refers to a purposeful and sometimes limited adjustment method for the original pitch information in order to achieve a transition from the source singer to the target singer voice style during the singer conversion based on the fundamental frequency control.

In order to realize flexible control of pitch, one of the innovation points of the invention is that key pitch vectors are input into a variable decoder of a first module to be constrained, so that offset pitch vectors after random offset are close to the key pitch vectors, and the offset amplitude of the offset pitch vectors is prevented from being too large, so that the relative distance between the key pitch vectors and the offset pitch vectors is ensured, and the effectiveness of pitch control is improved.

Meanwhile, the pitch characteristics of the source singer can be flexibly adjusted through random offset so as to adapt to the characteristics of the target singer, and the converted singer is enabled to be closer to the real singing style of the target singer.

When all fundamental frequency values of the key pitch vectors are adjusted according to a preset pitch shift strategy, and when the new pitch values subjected to random shift meet preset conditions (for example, the new pitch values are kept in a reasonable human voice frequency range and the original melody aesthetic feeling is not damaged), the shifted pitch vectors are obtained.

For example, adding or subtracting a random value to each fundamental frequency value in the key pitch vector may generate different pitch versions. For example, the original pitch value at a certain time point is 500Hz, and after random offset, the pitch value can be 520Hz or 480Hz, so that an offset pitch vector with random offset is obtained, and the innovative adjustment of the pitch of songs is realized.

In step S1, the acquisition of the key pitch vector and the offset pitch vector has the following benefits in the present invention:

1. The key pitch vector retains only the portion containing the distinct pitch fluctuations or singer's unique music for voices in a Chinese opera features by clipping non-key regions in the original pitch vector, helping to capture and retain the personalized pitch features of the source singer during the singing voice conversion process.

2. The key pitch vectors are subjected to random offset operation to obtain offset pitch vectors, so that the singing voice conversion model can creatively adjust the pitch of songs, and the unique gamut, tone color and melody expressive force of a target singer are simulated, so that the key pitch vectors are closer to the real singing style of the target singer.

3. While the traditional direct fundamental frequency modeling method may result in smaller pitch variation and lack of richness of the output, the strategy of introducing key pitch vectors and shifting pitches is helpful to overcome the limitation, so that singing voice conversion achieves higher fineness and flexibility in terms of pitch reconstruction.

S2, performing vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector;

In this embodiment, the second song sample refers to a song that the second singer sings.

The second singer refers to the object of the conversion from the voice style of the source singer to the tone color of the target singer, i.e., the second singer is the source singer.

The first song sample and the second song sample may be the same song or two different songs, which is not limited herein.

The linear spectrum refers to a frequency domain representation obtained by linearly transforming an audio signal. The linear spectrum can help the singing voice conversion model understand the change condition of tone quality, tone color and intensity of songs along with time, and is one of important tools for analyzing the structure, tone and rhythm characteristics of music signals.

And converting the time domain signal of each frame of the second song sample from the time domain to the frequency domain by utilizing a second module of the singing voice conversion model to obtain a time-frequency spectrogram of each frame, and extracting a linear spectrum from the time-frequency spectrogram, wherein the linear spectrum comprises but is not limited to information such as amplitude spectrum, phase spectrum and the like.

And carrying out vectorization processing on the linear spectrum to reduce the data dimension and maintain key characteristics, so as to obtain a linear spectrum vector.

In one embodiment, the performing vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector includes:

Carrying out vectorization processing on the linear spectrum of the second song sample to reduce the data dimension and maintain key characteristics, and obtaining a linear spectrum vector, wherein the vectorization processing comprises the following steps:

1. the initial vector of the linear spectrum of each frame of the second song sample is taken as a point in a high-dimensional space.

2. A preset codebook (codebook) is set, which contains a series of pre-trained representative vectors (codewords). In speech signal processing or audio coding, a codebook is a pre-trained set of representative tonal or spectral features. For example, during vector quantization (Vector Quantization, VQ), the initial vector of the original high-dimensional linear spectrum is mapped to one or several nearest codewords in the codebook, thereby performing data compression and maintaining key information.

3. The similarity (e.g., euclidean distance or cosine distance) between the linear spectrum of the current frame and all codewords in the codebook is calculated.

4. And screening out the code words which are most similar to the initial vector of the linear spectrum of the current frame and have the distance smaller than a threshold value as the approximate representation of the frame.

5. And replacing the initial vector of the linear spectrum of the current frame with the screened code word, thereby realizing the quantization process and obtaining the linear spectrum vector.

In one embodiment, before the vectorizing the linear spectrum of the second song sample to obtain a linear spectrum vector, the method further includes:

Converting the time domain signal of each frame of the second song sample into a frequency domain to obtain a time-frequency spectrogram;

and extracting the frequency domain characteristics of the time-frequency spectrogram to obtain the linear spectrum.

The time-frequency spectrogram is dynamic, and takes frequency characteristics of the fundamental frequency signal changing along with time into consideration, and shows frequency spectrum changes of the signal in different time periods. Thus, a time-frequency spectrogram is a two-dimensional or three-dimensional image, wherein the abscissa represents time and the ordinate generally represents frequency, and intensity information such as color or gray scale represents signal amplitude or energy density at corresponding time and frequency points.

In the audio signal processing, the time domain signal may be a sound waveform captured by a microphone, with time as an abscissa and sound intensity as an ordinate.

In step S2, one of the innovations of the present invention, the linear spectrum extracted from the linear spectrum of the song is input to a Vector Quantizer (VQ), and the quantizer is trained in combination with the stream model of the third module, in an effort to reject a small amount of fundamental frequency signals contained in the linear spectrum.

Meanwhile, through the linear spectrum vector after vectorization processing, main characteristics of song samples, such as tone color and rhythm structure, can be better captured and reserved.

S3, obtaining a first spliced vector based on the key pitch vector and the linear spectrum vector, and decoding the first spliced vector to obtain a reconstructed original waveform;

in this embodiment, the key pitch vector and the linear spectrum vector are spliced to obtain a first spliced vector, and the first spliced vector formed after splicing contains a variation trend of songs in the pitch dimension and detailed features in the frequency domain, so that a unique style of singing of a source singer can be reflected more comprehensively.

That is, the following is true. By combining the key pitch vector and the linear spectrum vector, the singing voice conversion model can simultaneously consider the two important factors in the training and prediction stages, so that the converted singing voice can be closer to the real singing effect of a target singer in tone color, rhythm and emotion expression while the original melody structure is maintained.

In one embodiment, the decoding the first spliced vector to obtain a reconstructed original waveform includes:

The first spliced vector (including the key pitch vector and the linear spectral vector) and the second spliced vector (including the offset pitch vector and the linear spectral vector) are taken as input data of the first decoder in the singing voice conversion model.

Decoding and inverse transformation operations (e.g., inverse transformation operations are inverse fourier transforms (IFFTs)) are performed on the first stitched vector with a first decoder, which gradually restores the first stitched vector to a waveform that approximates the original singing style of the source singer to ensure recovery from frequency domain features to a time domain waveform signal resulting in a reconstructed original waveform.

The first decoder refers to a model trained to be able to decode audio signals in the time domain from the spliced high-dimensional feature vectors.

S4, obtaining a second spliced vector based on the offset pitch vector and the linear spectrum vector, and decoding the second spliced vector to obtain an offset waveform;

In this embodiment, the offset pitch vector is the result of a random or strategic adjustment on the basis of the key pitch vector, which is intended to simulate the gamut characteristics and style of the target singer.

The second spliced vector formed by splicing the offset pitch vector and the linear spectrum vector can better guide the singing voice conversion model to realize the voice style migration from the source singer to the target singer while maintaining the melody structure of the original song.

In one embodiment, the decoding the second spliced vector to obtain an offset waveform includes:

The first decoder is utilized to perform decoding and inverse transformation operations (for example, the inverse transformation operation is inverse fourier transform (IFFT)) on the second spliced vector, and the waveform generated after decoding reflects the gamut characteristics and style of the target singer, so as to ensure that the frequency domain characteristics are restored to the time domain waveform signal, and an offset waveform is obtained.

The reconstructed original waveform retains the main pitch and spectral characteristics of the target singer, while the offset waveform fuses the gamut and style characteristics of the target singer.

S5, converting the second song sample into a target song based on the reconstructed original waveform, the offset waveform and the original waveform of the second song sample.

In one embodiment, the converting the second song sample to the target song based on the reconstructed original waveform, the offset waveform, and the original waveform of the second song sample includes:

And calculating probability distribution values among the reconstructed original waveform, the offset waveform and the actual waveform of the second song sample, and converting the second song sample from the singing voice of the second singer to the singing voice of the first singer when the probability distribution values are smaller than a threshold value to obtain the target song.

In this embodiment, the real waveforms of the reconstructed original waveform, the offset waveform and the second song sample are input into the discriminator of the singing voice conversion model, probability distribution values of the reconstructed original waveform and the offset waveform are calculated under a parameter space of the discriminator, the probability distribution values of the reconstructed original waveform are compared with the probability distribution values corresponding to the real waveforms to obtain a first difference value, the probability distribution values of the offset waveform are compared with the probability distribution values corresponding to the real waveforms to obtain a second difference value, and if the compared first difference value and second difference value are simultaneously lower than a preset threshold, the converted singing voice is considered to be close to the style of the first singer.

And converting the second song sample from the singing voice of the second singer to the singing voice of the first singer to obtain a converted target song.

The discriminator is a deep learning network and is used for distinguishing the differences among the real waveform, the reconstructed original waveform and the offset waveform. That is, one of the innovative points of the present invention is to make the singing voice conversion model equivalent to a generator, construct it with a discriminator to generate an countermeasure network, discriminate by inputting the real waveform, the reconstructed original waveform, the offset waveform outputted from the singing voice conversion model into the discriminator of the singing voice conversion model on the basis of the framework of generating the countermeasure network, and consider that the converted singing voice is close to the style of the first singer when the probability distribution value outputted from the discriminator is smaller than the threshold value.

The discriminator analyzes each input waveform and outputs a probability distribution value, each score representing the degree of similarity with which the discriminator discriminates the waveform as a true waveform.

In practical applications, the financial insurance enterprise may integrate the singing voice conversion model into its mobile application program, designing an interactive module named "singing voice conversion". After the user completes business operations of purchasing insurance, inquiring insurance policy and the like, different interactive entertainment activities can be tried through the module, such as uploading own singing segments and selecting a known singer to be imitated.

And (3) utilizing a pre-trained singing voice conversion model, and carrying out decoding processing according to the original singing segment of the user and combining a key pitch vector and a linear spectrum vector to accurately simulate the sound characteristics of a target singer. Even if the user lacks in the representation of certain voice domains, the technology can make up and approach to the real singing effect of the target singer as much as possible by dynamically adjusting the fundamental frequency and randomly shifting the key pitch vector.

The user can obtain interesting personalized experience and share songs sung by the user to the social platform, so that brand affinity and user viscosity of a financial insurance enterprise are improved. Meanwhile, the novel service mode with the science and technology sense is also beneficial to attracting more young user groups, further widening market coverage and realizing double promotion of service growth and user experience.

In step S5, a probability distribution value between the reconstructed original waveform, the offset waveform and the actual waveform of the second song sample is calculated, and when the probability distribution value is smaller than the threshold value, the second song sample is converted from the singing voice of the second singer to the singing voice of the first singer, so as to obtain the converted target song, which not only can ensure that the converted song retains the basic melody of the second song sample, but also can give the actual singing style and pitch characteristics of the target singer in the first song sample.

In the steps S1-S5, the method and the device cut the first song sample to obtain the key pitch vector, randomly shift the key pitch vector to obtain the shifted pitch vector, simulate the pitch change characteristics of the first singer during singing, and provide a foundation for style migration.

Fig. 3 is a schematic block diagram of an singing voice conversion device based on baseband control according to an embodiment of the invention.

The singing voice conversion device 100 based on fundamental frequency control according to the present invention may be installed in an electronic apparatus. Depending on the implementation, the singing voice conversion apparatus 100 based on the baseband control may include a clipping module 110, a processing module 120, a first splicing module 130, a second splicing module 140, and a conversion module 150. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

In the present embodiment, the functions concerning the respective modules/units are as follows:

A clipping module 110, configured to clip the first song sample to obtain a key pitch vector, and randomly shift the key pitch vector to obtain a shifted pitch vector;

A processing module 120, configured to perform vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector;

a first splicing module 130, configured to obtain a first spliced vector based on the key pitch vector and the linear spectrum vector, and perform decoding on the first spliced vector to obtain a reconstructed original waveform;

A second splicing module 140, configured to obtain a second splicing vector based on the offset pitch vector and the linear spectrum vector, and perform decoding on the second splicing vector to obtain an offset waveform;

a conversion module 150, configured to convert the second song sample into a target song based on the reconstructed original waveform, the offset waveform, and the original waveform of the second song sample.

Fig. 4 is a schematic structural diagram of an electronic device for implementing a singing voice conversion method based on baseband control according to an embodiment of the present invention.

In the present embodiment, the electronic apparatus 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicably connected to each other via a system bus, the memory 11 storing therein a singing voice conversion program 10 based on fundamental frequency control, the singing voice conversion program 10 based on fundamental frequency control being executable by the processor 12. Fig. 4 shows only the electronic device 1 with the components 11-13 and the singing voice conversion program 10 based on fundamental frequency control, it will be understood by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

Wherein the storage 11 comprises a memory and at least one type of readable storage medium. The memory provides a buffer for the operation of the electronic device 1; the readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1; in other embodiments, the nonvolatile storage medium may also be an external storage device of the electronic device 1, such as a plug-in hard disk provided on the electronic device 1, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various kinds of application software installed in the electronic device 1, for example, storing codes of the singing voice conversion program 10 based on fundamental frequency control in one embodiment of the present invention, and the like. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices, etc. In this embodiment, the processor 12 is configured to execute a program code or process data stored in the memory 11, for example, to execute the singing voice conversion program 10 based on the baseband control, and the like.

The network interface 13 may comprise a wireless network interface or a wired network interface, the network interface 13 being used for establishing a communication connection between the electronic device 1 and a terminal (not shown).

Optionally, the electronic device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The singing voice conversion program 10 based on fundamental frequency control stored in the memory 11 in the electronic device 1 is a combination of instructions, which when executed in the processor 12, can realize:

Specifically, the specific implementation method of the singing voice conversion process 10 based on the baseband control by the processor 12 may refer to the description of the related steps in the corresponding embodiment of fig. 1, which is not repeated herein.

Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable medium may be nonvolatile or nonvolatile. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

The computer readable storage medium stores the singing voice conversion program 10 based on the fundamental frequency control, where the singing voice conversion program 10 based on the fundamental frequency control may be executed by one or more processors, and the specific embodiment of the computer readable storage medium is substantially the same as the above embodiments of the singing voice conversion method based on the fundamental frequency control, which is not described herein.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A singing voice conversion method based on fundamental frequency control, the method comprising:

2. The singing voice conversion method based on fundamental frequency control as recited in claim 1, wherein the clipping the first song sample to obtain a key pitch vector comprises:

3. The singing voice conversion method based on fundamental frequency control as recited in claim 2, wherein before said extracting an initial pitch vector from the fundamental frequency signal of the first song sample, the method further comprises:

4. The singing voice conversion method based on fundamental frequency control as recited in claim 2, wherein said clipping non-critical areas in said initial pitch vector to obtain critical pitch vectors comprises:

5. The singing voice conversion method based on fundamental frequency control as claimed in claim 1, wherein said performing vectorization processing on the linear spectrum of the second song sample to obtain a linear spectrum vector comprises:

6. The singing voice conversion method based on fundamental frequency control as recited in claim 1, wherein said performing decoding on said first spliced vector results in a reconstructed original waveform, comprising:

7. The singing voice conversion method based on fundamental frequency control as recited in claim 1, wherein said performing decoding on said second spliced vector results in an offset waveform, comprising:

8. A singing voice conversion apparatus based on fundamental frequency control, the apparatus comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a fundamental frequency control-based singing voice conversion program executable by the at least one processor, the fundamental frequency control-based singing voice conversion program being executed by the at least one processor to enable the at least one processor to perform the fundamental frequency control-based singing voice conversion method as recited in any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a fundamental frequency control-based singing voice conversion program executable by one or more processors to implement a fundamental frequency control-based singing voice conversion method as recited in any one of claims 1 to 7.