CN112735434A

CN112735434A - Voice communication method and system with voiceprint cloning function

Info

Publication number: CN112735434A
Application number: CN202011432039.2A
Authority: CN
Inventors: 孙蒙; 贾冲; 张雄伟; 邹霞; 李莉; 康凯; 曹铁勇; 杨吉斌
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-04-30

Abstract

The invention discloses a voice call method and a system with a voiceprint clone function, wherein voice to be converted is picked up and input into a pre-trained voice conversion module of a specific person, the voice to be converted is converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted; transmitting the specific person voice to a speaker of a listener. The advantages are that: compared with the existing sound changing scheme, the system realizes the voiceprint cloning aiming at any specific character which can be specified by a user, and can play a role of imitating and disguising the specific character; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively; the algorithm of the voice conversion module and the design of the computing platform can realize the real-time call function, and the purpose of calling by imitating and disguising identities can be better fulfilled.

Description

Voice communication method and system with voiceprint cloning function

Technical Field

The invention relates to a voice call method and a voice call system with a voiceprint cloning function, and belongs to the technical field of voice signal processing.

Background

Scenes such as game sound effect, television dubbing, quadratic element virtual image and the like have strong requirements on personalized voice generation. The real-time sound changer is used as an improvement of sound changing after recording, can directly change sound, and achieves the purpose of voice communication. The continuous progress of the technology represented by speech synthesis and speech conversion provides technical support for the simulation of the speech of a specific character in games, television shows, virtual images and the like.

The patent "a real-time sound changing method based on intelligent terminal" proposes the sound changing method of changing fundamental frequency and response function pole zero. The patent 'a method for voice-change call under wireless network based on android' provides a voice-change call method on android equipment. The patent 'high-quality real-time sound changing method based on voice analysis and synthesis' realizes sound changing by interpolating or shearing signals, modifying fundamental frequency and resonance peak positions, and adjusting duration, pitch and silver. However, these methods can only change the thickness of the sound, and cannot generate the sound of a specific target person.

The patent 'live broadcasting microphone' provides a live broadcasting microphone, which can perform functions of balancing and adjusting reverberation of sound to improve interesting sound effect, changing sound of electric sound and the like, and can be connected with a live broadcasting platform through a wireless network to realize a real-time live broadcasting function; meanwhile, the system can also be used as a recording device, the collected audio is stored locally, and the audio is edited and uploaded to the cloud terminal through connecting the mobile device. The patent 'design of an end-to-end voice camouflage system based on Bluetooth' disguises and protects the conversation content of a user through a voice changing mode and an end-to-end real-time voice system based on Bluetooth, thereby realizing a real-time voice changing function of answering and making a call by the user, and also simulating different scenes by adding background voice to achieve the effect of disguising the position. These methods have good real-time performance but cannot survive the voice of a specific target character.

The patent ' design and realization of real-person voice-changing equipment based on deep learning algorithm ' intelligent voice conversion algorithm based on deep learning ' provides an electronic voice-changing module capable of converting the voice of any person into any required target pronunciation person voice function in real time. A brand-new real-time voice-to-voice changing algorithm principle is constructed by adopting the method and the thought of a voice recognition front end and a text voice synthesis rear end. The brand new design shows the effect and real-time performance of text-to-speech synthesis, but the method can not stably achieve the effect of comparing with the real human speech in the aspects of emotion and naturalness by generating the speech in a text-to-speech synthesis mode.

Disclosure of Invention

The technical problem to be solved by the present invention is to overcome the defects of the prior art, and to provide a voice communication method and system with a voiceprint cloning function.

In order to solve the technical problem, the invention provides a voice call method with a voiceprint clone function, which is characterized in that a voice to be converted is picked up and input into a pre-trained voice conversion module of a specific person, the voice to be converted is converted into a target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;

transmitting the specific person voice to a speaker of a listener.

Further, the process of picking up the voice to be converted includes:

the speech to be converted is picked up by an array of microphones with narrow directivity.

Further, the process of inputting the voice to be converted into the voice of the specific person into the pre-trained voice conversion module of the specific person includes:

extracting the voice features of the voice to be converted, wherein the voice features comprise fundamental frequency, log spectrum and non-periodic components;

converting the fundamental frequency of the voice to be converted by using a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target fundamental frequency;

copying the non-periodic component of the voice to be converted into a target non-periodic component;

predicting the difference of the log spectrums of the specific human voice and the voice to be converted by utilizing a long-time and short-time memory model, and determining a target log spectrum;

and integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.

Further, the log-linear function is:

wherein, F0_tAt the target fundamental frequency, F0_sFor the fundamental frequency, mu, of the speech to be converted_sAnd σ_sRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be converted_tAnd σ_tRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.

Further, the predicting the difference between the log spectrums of the specific human voice and the voice to be converted by using the long-time and short-time memory model, and the determining the target log spectrum process includes:

the difference between the log spectrum of the specific human voice and the log spectrum of the voice to be converted is expressed as delta_t＝y_t-x_t；

The structure of the long-time and short-time memory model is shown in formulas (2) to (7):

i_t＝σ(W_xix_t+W_hih_t-1+b_i) (3)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f) (4)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t-1+b_o) (6)

h_t＝o_t⊙tanh(c_t) (7)

wherein, y_tLog spectrum of the t-th frame, x, for a particular person's speech_tLog spectrum of the t-th frame of speech to be converted, h_tFor the t-th implicit element vector, o, of the long-and-short-term memory model_tOutput gate representing the t-th instant i_tInput gate representing the t-th moment, f_tA forgetting gate showing the t-th time, t-1 showing the last time,

representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variables_klAre respective weights, b_lSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;

at the start time, h is initialized₀And c₀(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input₁Obtaining a temporary cell unit vector c by calculation of formula (2)₁(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)₁And forget gate vector f₁(ii) a Updating cell unit vector c by equation (5)₁(ii) a Calculating the output gate o by equation (6)₁(ii) a Finally by the formula(7) Calculating to obtain the implicit unit vector h of the layer output₁(ii) a And so on to any time t until the sequence is finished;

repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layer_tAs input x of the previous layer_tThen the method is finished; finally, the output h of the last LSTM layer is output_tAfter passing through the full-connection network, outputting residual error delta_tThen superimposes the prediction of the residual on the input log spectrum x_tAnd obtaining the converted log spectrum.

A voice call system having a voiceprint cloning function, comprising:

the picking module is used for picking up the voice to be converted and inputting the voice to the pre-trained voice conversion module of the specific person;

the processing module is used for converting the voice to be converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;

and the transmission module is used for transmitting the voice of the specific person to a loudspeaker of a receiver.

Further, the pick-up module includes a microphone array module for picking up the voice to be converted by a microphone array having a narrow directivity.

Further, the processing module comprises:

the extraction module is used for extracting the voice characteristics of the voice to be converted, wherein the voice characteristics comprise fundamental frequency, log spectrum and non-periodic components;

the base frequency conversion module is used for converting the base frequency of the voice to be converted by utilizing a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target base frequency;

the aperiodic component conversion module is used for copying the aperiodic component of the voice to be converted into a target aperiodic component;

the log spectrum conversion module is used for predicting the log spectrum difference of the specific human voice and the voice to be converted by utilizing the long-time memory model and determining a target log spectrum;

and the synthesis module is used for integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.

Further, the fundamental frequency conversion module comprises a function determination module for determining a log-linear function as:

Further, the log spectrum conversion module comprises a log spectrum conversion module processing module,

for passing Δ_t＝y_t-x_tRepresenting the difference of log spectrums of the specific human voice and the voice to be converted;

it is also used for adopting the structure of the long-time memory model as shown in the formulas (2) to (7):

i_t＝σ(W_xix_t+W_hih_t-1+b_i) (3)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f) (4)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t-1+b_o) (6)

h_t＝o_t⊙tanh(c_t) (7)

the structure of the long-time and short-time memory model is used for,

at the start time, h is initialized₀And c₀(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input₁Obtaining a temporary cell unit vector c by calculation of formula (2)₁(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)₁And forget gate vector f₁(ii) a Updating cell unit vector c by equation (5)₁(ii) a Calculating the output gate o by equation (6)₁(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)₁(ii) a And so on to any time t until the sequence is finished;

The invention achieves the following beneficial effects:

compared with the existing sound changing scheme, the system realizes the voiceprint cloning aiming at any specific character which can be specified by a user, and can play a role of imitating and disguising the specific character; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively; the algorithm of the voice conversion module and the design of the computing platform can realize the real-time call function, and the purpose of calling by imitating and disguising identities can be better fulfilled.

Drawings

FIG. 1 is a schematic diagram of a voice call system for voiceprint cloning via voice conversion according to the present invention;

FIG. 2 is a general schematic diagram of a voice conversion scheme employed by the present invention;

fig. 3 is a diagram illustrating log spectrum training and conversion in a speech conversion scheme employed in the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present invention provides a voice communication method with voiceprint cloning function, as shown in fig. 1-3.

(1) Picking up voice of a user A through a microphone array with narrow directivity; the microphone array is formed by combining a group of microphone units in a small range according to certain spatial distribution, and the strong directivity of the microphone is realized by a beam forming method so as to improve the capability of the microphone for picking up clean signals in a noise environment; the beam forming method is to form a cone-shaped narrow beam, suppress noise and interference in the environment, only accept the voice of the speaker A (sound source) direction, achieve the effect of voice enhancement, implement spatial filtering through narrow directivity, and pick up the voice of the user A which is cleaner as the input data of the voice conversion module.

(2) Changing voice by using a trained voice conversion module of a specific person, changing the voice of a user A into the voice of a user B, and simultaneously keeping the content, tone and emotion of the voice unchanged; the conversion is realized according to the following steps:

1) extracting speech features from speech picked up by a microphone array by using a vocoder, wherein the speech features comprise three parts, namely fundamental frequency, logarithmic spectrum and non-periodic components, as shown in FIG. 2;

2) fundamental frequency F0 of target character voice_tPerforming the conversion as a log-linear function

Wherein, F0_sAt the base frequency of the source speech, mu_sAnd σ_sMean and standard deviation, μ, of the source speech fundamental frequency, respectively_tAnd σ_tRespectively representing the mean value and the standard deviation of the target voice fundamental frequency;

3) the aperiodic component of the source voice is directly copied into the aperiodic component of the target character voice;

4) the log spectrum is a sequence of vectors, using { x }_tT1.. T } represents a log spectral sequence of the source speech, denoted by { y }_tT1.. multidot.t. represents a log spectrum sequence of the voice of the target person, and x is considered to reduce the difficulty of prediction_tAnd y_tIn turn predicts the difference, i.e. delta, between the speech content of (a) and (b)_t＝y_t-x_tThe conversion is realized by a Long Short-Term Memory model (LSTM), and the LSTM has a cyclic structure and a Memory unit so as to have the capacity of extracting Long-Term time sequence information; the basic structure of the LSTM is shown in equations (2) to (7):

i_t＝σ(W_xix_t+W_hih_t-1+b_i) (3)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f) (4)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t-1+b_o) (6)

h_t＝o_t⊙tanh(c_t) (7)

wherein x is_tIs the log spectrum of the t-th frame, h_tThe unit vector is implied for the ttm time instant,

is an intermediate variable, c_tCell unit vector, W, specific for LSTM_xcTo connect the inputs with the weight of the cell unit, W_hcTo connect the weight of the hidden unit and the cell unit, W_xiFor weights connecting input and input gate units, W_hiWeights for connecting hidden units and input gate units, W_xfFor connecting the weights of the input and forgetting gate units, W_hfWeights for connecting hidden and forgetting gate units, W_cfWeights for connecting cell units and forgetting gate units, W_xoWeights for connecting input and output gate units, W_hoWeights for connecting hidden units and output gate units, W_coWeights for connecting cell units and output gate units, b_cFor biasing of the cell network, b_iFor biasing of the input gate network, b_fTo forget biasing of the gate network, b_oFor the bias of an output gate network, i, f and o are respectively an input gate, a forgetting gate and an output gate, sigma is an activation function, a Sigmoid function is generally adopted, and the sigma is a point-to-point element multiplication; as can be seen from the formula, the function of LSTM is to actually input the vector sequence x_tMapping via cell unit vector c_tAnd implicit element vector h_tMapping; the above structure can be repeated for multiple times to form multiple layers of LSTM, and output h of the next layer_tAs input x of the previous layer_tThen the method is finished; in the end of this process,will output h of the last LSTM layer_tAfter passing through the full-connection network, outputting residual error delta_tThen superimposes the prediction of the residual on the input log spectrum x_tObtaining a converted log spectrum;

5) the converted logarithmic spectrum, the base frequency after the logarithmic linear conversion and the non-periodic component obtained by copying are sent into a vocoder to generate final converted voice;

the voice change is implemented by means of a trained voice conversion module of a specific person, and the voice of the user A is changed into the voice of the user B through the conversion and synthesis of voice spectrum parameters; the voiceprint of any specific character designated by the user can be cloned, so that the function of imitating and disguising the specific character is realized; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively.

The converted voice is output and transmitted to a loudspeaker of a remote call counterpart through a network, so that the remote call counterpart feels that the remote call counterpart is in conversation with the remote call counterpart; the module writes the voice data after voice change into a sound card appointed by voice communication software, the voice communication software collects audio data from the sound card, namely, a voice sample is converted into a digital signal and is sent after being coded, and after receiving a coded frame, an opposite side decodes the coded frame to recover the data which can be directly played by the sound card.

The step (1) picks up the voice of the user A through a microphone array with narrow directivity, and the layout mode is that the phase difference delta phi of signals synchronously collected by a pair of microphones is measured, and the signals are transmitted according to the frequency f and the sound propagation speed c₀The position interval of the pair of microphones is obtained, the microphone array can point the wave beam to the speaker A after the position of the speaker A is searched, and the influence of surrounding environment noise and echo can be obviously reduced through a strong intelligent directional function; voice data picked up by the microphone array is used as the input of a subsequent voice conversion module;

step (2) using a voice conversion module of a specific person to implement voice change, changing the voice of a user A into the voice of a user B, and simultaneously keeping the content, tone and emotion of the voice unchanged; extracting the features of each frame of the source voice, the 1-dimensional fundamental frequency feature, the 129-dimensional log spectrum feature and the 129-dimensional non-periodic component by using a World vocoder; completing the conversion of the fundamental frequency characteristics by adopting a formula (1); 3 layers of LSTM networks with 100 hidden units in each layer are adopted to complete the conversion of log spectrums; directly copying the 129-dimensional aperiodic component; putting the three parts into World vocoder to output voice waveform; under the conditions of intel i7 CPU and 8GM memory, the conversion of voice output can be completed in real time;

and (3): the converted voice is output and transmitted to a loudspeaker of a remote call counterpart through a network, so that the remote call counterpart feels that the remote call counterpart is in conversation with the remote call counterpart, and the separation of the input voice and the converted voice is realized through a full-duplex sound card.

The voice data after voice change is written into a sound card appointed by the voice communication software, the voice communication software collects audio data from the sound card, namely, a voice sample is converted into a digital signal and is sent after being coded, and after receiving a coded frame, the opposite side decodes the coded frame to recover the data which can be directly played by the sound card.

Correspondingly, the present application also provides a voice communication system with voiceprint cloning function, comprising:

The pick-up module comprises a microphone array module for picking up the voice to be converted by a microphone array with narrow directivity.

The processing module comprises:

The fundamental frequency conversion module comprises a function determination module for determining a log-linear function as:

The log spectrum conversion module comprises a log spectrum conversion module processing module,

i_t＝σ(W_xix_t+W_hih_t-1+b_i) (3)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f) (4)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t-1+b_o) (6)

h_t＝o_t⊙tanh(c_t) (7)

the structure of the long-time and short-time memory model is used for,

repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layer_tAs input x of the previous layer_tThen the method is finished; finally, the output h of the last LSTM layer is output_tThrough full connectionAfter network connection, residual error delta is output_tThen superimposes the prediction of the residual on the input log spectrum x_tAnd obtaining the converted log spectrum.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice communication method with a voiceprint clone function is characterized in that,

picking up a voice to be converted, inputting the voice to be converted into a pre-trained voice conversion module of a specific person, and converting the voice to be converted into a target voice, wherein the content, tone and emotion of the target voice are consistent with the voice to be converted;

transmitting the specific person voice to a speaker of a listener.

2. The voice call method with voiceprint cloning function according to claim 1, wherein the process of picking up the voice to be converted comprises:

3. The voice call method with voiceprint cloning function according to claim 1,

the process of inputting the voice to be converted into the voice of the specific person into the pre-trained voice conversion module of the specific person comprises the following steps:

4. The voice call method with voiceprint cloning function according to claim 3,

the log-linear function is:

5. The voice call method with voiceprint cloning function according to claim 3,

the process of predicting the difference of the log spectrums of the specific human voice and the voice to be converted by utilizing the long-time and short-time memory model and determining the target log spectrums comprises the following steps:

i_t＝σ(W_xix_t+W_hih_t-1+b_i) (3)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f) (4)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t-1+b_o) (6)

h_t＝o_t⊙tanh(c_t) (7)

representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variables_klAre respective weights, b_lSubscript k is x, c or h, for each offset,Subscript l is c, i, f or o, σ is an activation function, and £ is a multiplication of elements in point-to-point;

6. A voice call system having a voiceprint cloning function, comprising:

7. The system according to claim 6, wherein the picking-up module comprises a microphone array module for picking up the voice to be converted by a microphone array having a narrow directivity.

8. The voice call system with voiceprint cloning capability of claim 6,

the processing module comprises:

9. The voice call system with voiceprint cloning capability of claim 8,

10. The voice call system with voiceprint cloning capability of claim 8, wherein the log spectrum conversion module comprises a log spectrum conversion module processing module,

i_t＝σ(W_xix_t+W_hih_t-1+b_i) (3)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f) (4)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t-1+b_o) (6)

h_t＝o_t⊙tanh(c_t) (7)

the structure of the long-time and short-time memory model is used for,