CN112735434A - Voice communication method and system with voiceprint cloning function - Google Patents

Voice communication method and system with voiceprint cloning function Download PDF

Info

Publication number
CN112735434A
CN112735434A CN202011432039.2A CN202011432039A CN112735434A CN 112735434 A CN112735434 A CN 112735434A CN 202011432039 A CN202011432039 A CN 202011432039A CN 112735434 A CN112735434 A CN 112735434A
Authority
CN
China
Prior art keywords
voice
converted
target
log spectrum
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011432039.2A
Other languages
Chinese (zh)
Inventor
孙蒙
贾冲
张雄伟
邹霞
李莉
康凯
曹铁勇
杨吉斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202011432039.2A priority Critical patent/CN112735434A/en
Publication of CN112735434A publication Critical patent/CN112735434A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a voice call method and a system with a voiceprint clone function, wherein voice to be converted is picked up and input into a pre-trained voice conversion module of a specific person, the voice to be converted is converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted; transmitting the specific person voice to a speaker of a listener. The advantages are that: compared with the existing sound changing scheme, the system realizes the voiceprint cloning aiming at any specific character which can be specified by a user, and can play a role of imitating and disguising the specific character; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively; the algorithm of the voice conversion module and the design of the computing platform can realize the real-time call function, and the purpose of calling by imitating and disguising identities can be better fulfilled.

Description

Voice communication method and system with voiceprint cloning function
Technical Field
The invention relates to a voice call method and a voice call system with a voiceprint cloning function, and belongs to the technical field of voice signal processing.
Background
Scenes such as game sound effect, television dubbing, quadratic element virtual image and the like have strong requirements on personalized voice generation. The real-time sound changer is used as an improvement of sound changing after recording, can directly change sound, and achieves the purpose of voice communication. The continuous progress of the technology represented by speech synthesis and speech conversion provides technical support for the simulation of the speech of a specific character in games, television shows, virtual images and the like.
The patent "a real-time sound changing method based on intelligent terminal" proposes the sound changing method of changing fundamental frequency and response function pole zero. The patent 'a method for voice-change call under wireless network based on android' provides a voice-change call method on android equipment. The patent 'high-quality real-time sound changing method based on voice analysis and synthesis' realizes sound changing by interpolating or shearing signals, modifying fundamental frequency and resonance peak positions, and adjusting duration, pitch and silver. However, these methods can only change the thickness of the sound, and cannot generate the sound of a specific target person.
The patent 'live broadcasting microphone' provides a live broadcasting microphone, which can perform functions of balancing and adjusting reverberation of sound to improve interesting sound effect, changing sound of electric sound and the like, and can be connected with a live broadcasting platform through a wireless network to realize a real-time live broadcasting function; meanwhile, the system can also be used as a recording device, the collected audio is stored locally, and the audio is edited and uploaded to the cloud terminal through connecting the mobile device. The patent 'design of an end-to-end voice camouflage system based on Bluetooth' disguises and protects the conversation content of a user through a voice changing mode and an end-to-end real-time voice system based on Bluetooth, thereby realizing a real-time voice changing function of answering and making a call by the user, and also simulating different scenes by adding background voice to achieve the effect of disguising the position. These methods have good real-time performance but cannot survive the voice of a specific target character.
The patent ' design and realization of real-person voice-changing equipment based on deep learning algorithm ' intelligent voice conversion algorithm based on deep learning ' provides an electronic voice-changing module capable of converting the voice of any person into any required target pronunciation person voice function in real time. A brand-new real-time voice-to-voice changing algorithm principle is constructed by adopting the method and the thought of a voice recognition front end and a text voice synthesis rear end. The brand new design shows the effect and real-time performance of text-to-speech synthesis, but the method can not stably achieve the effect of comparing with the real human speech in the aspects of emotion and naturalness by generating the speech in a text-to-speech synthesis mode.
Disclosure of Invention
The technical problem to be solved by the present invention is to overcome the defects of the prior art, and to provide a voice communication method and system with a voiceprint cloning function.
In order to solve the technical problem, the invention provides a voice call method with a voiceprint clone function, which is characterized in that a voice to be converted is picked up and input into a pre-trained voice conversion module of a specific person, the voice to be converted is converted into a target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;
transmitting the specific person voice to a speaker of a listener.
Further, the process of picking up the voice to be converted includes:
the speech to be converted is picked up by an array of microphones with narrow directivity.
Further, the process of inputting the voice to be converted into the voice of the specific person into the pre-trained voice conversion module of the specific person includes:
extracting the voice features of the voice to be converted, wherein the voice features comprise fundamental frequency, log spectrum and non-periodic components;
converting the fundamental frequency of the voice to be converted by using a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target fundamental frequency;
copying the non-periodic component of the voice to be converted into a target non-periodic component;
predicting the difference of the log spectrums of the specific human voice and the voice to be converted by utilizing a long-time and short-time memory model, and determining a target log spectrum;
and integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
Further, the log-linear function is:
Figure BDA0002826823390000021
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
Further, the predicting the difference between the log spectrums of the specific human voice and the voice to be converted by using the long-time and short-time memory model, and the determining the target log spectrum process includes:
the difference between the log spectrum of the specific human voice and the log spectrum of the voice to be converted is expressed as deltat=yt-xt
The structure of the long-time and short-time memory model is shown in formulas (2) to (7):
Figure BDA0002826823390000031
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
Figure BDA0002826823390000032
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,
Figure BDA0002826823390000033
representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally by the formula(7) Calculating to obtain the implicit unit vector h of the layer output1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
A voice call system having a voiceprint cloning function, comprising:
the picking module is used for picking up the voice to be converted and inputting the voice to the pre-trained voice conversion module of the specific person;
the processing module is used for converting the voice to be converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;
and the transmission module is used for transmitting the voice of the specific person to a loudspeaker of a receiver.
Further, the pick-up module includes a microphone array module for picking up the voice to be converted by a microphone array having a narrow directivity.
Further, the processing module comprises:
the extraction module is used for extracting the voice characteristics of the voice to be converted, wherein the voice characteristics comprise fundamental frequency, log spectrum and non-periodic components;
the base frequency conversion module is used for converting the base frequency of the voice to be converted by utilizing a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target base frequency;
the aperiodic component conversion module is used for copying the aperiodic component of the voice to be converted into a target aperiodic component;
the log spectrum conversion module is used for predicting the log spectrum difference of the specific human voice and the voice to be converted by utilizing the long-time memory model and determining a target log spectrum;
and the synthesis module is used for integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
Further, the fundamental frequency conversion module comprises a function determination module for determining a log-linear function as:
Figure BDA0002826823390000041
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
Further, the log spectrum conversion module comprises a log spectrum conversion module processing module,
for passing Δt=yt-xtRepresenting the difference of log spectrums of the specific human voice and the voice to be converted;
it is also used for adopting the structure of the long-time memory model as shown in the formulas (2) to (7):
Figure BDA0002826823390000042
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
Figure BDA0002826823390000051
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,
Figure BDA0002826823390000052
representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;
the structure of the long-time and short-time memory model is used for,
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
The invention achieves the following beneficial effects:
compared with the existing sound changing scheme, the system realizes the voiceprint cloning aiming at any specific character which can be specified by a user, and can play a role of imitating and disguising the specific character; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively; the algorithm of the voice conversion module and the design of the computing platform can realize the real-time call function, and the purpose of calling by imitating and disguising identities can be better fulfilled.
Drawings
FIG. 1 is a schematic diagram of a voice call system for voiceprint cloning via voice conversion according to the present invention;
FIG. 2 is a general schematic diagram of a voice conversion scheme employed by the present invention;
fig. 3 is a diagram illustrating log spectrum training and conversion in a speech conversion scheme employed in the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention provides a voice communication method with voiceprint cloning function, as shown in fig. 1-3.
(1) Picking up voice of a user A through a microphone array with narrow directivity; the microphone array is formed by combining a group of microphone units in a small range according to certain spatial distribution, and the strong directivity of the microphone is realized by a beam forming method so as to improve the capability of the microphone for picking up clean signals in a noise environment; the beam forming method is to form a cone-shaped narrow beam, suppress noise and interference in the environment, only accept the voice of the speaker A (sound source) direction, achieve the effect of voice enhancement, implement spatial filtering through narrow directivity, and pick up the voice of the user A which is cleaner as the input data of the voice conversion module.
(2) Changing voice by using a trained voice conversion module of a specific person, changing the voice of a user A into the voice of a user B, and simultaneously keeping the content, tone and emotion of the voice unchanged; the conversion is realized according to the following steps:
1) extracting speech features from speech picked up by a microphone array by using a vocoder, wherein the speech features comprise three parts, namely fundamental frequency, logarithmic spectrum and non-periodic components, as shown in FIG. 2;
2) fundamental frequency F0 of target character voicetPerforming the conversion as a log-linear function
Figure BDA0002826823390000061
Wherein, F0sAt the base frequency of the source speech, musAnd σsMean and standard deviation, μ, of the source speech fundamental frequency, respectivelytAnd σtRespectively representing the mean value and the standard deviation of the target voice fundamental frequency;
3) the aperiodic component of the source voice is directly copied into the aperiodic component of the target character voice;
4) the log spectrum is a sequence of vectors, using { x }tT1.. T } represents a log spectral sequence of the source speech, denoted by { y }tT1.. multidot.t. represents a log spectrum sequence of the voice of the target person, and x is considered to reduce the difficulty of predictiontAnd ytIn turn predicts the difference, i.e. delta, between the speech content of (a) and (b)t=yt-xtThe conversion is realized by a Long Short-Term Memory model (LSTM), and the LSTM has a cyclic structure and a Memory unit so as to have the capacity of extracting Long-Term time sequence information; the basic structure of the LSTM is shown in equations (2) to (7):
Figure BDA0002826823390000071
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
Figure BDA0002826823390000072
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein x istIs the log spectrum of the t-th frame, htThe unit vector is implied for the ttm time instant,
Figure BDA0002826823390000073
is an intermediate variable, ctCell unit vector, W, specific for LSTMxcTo connect the inputs with the weight of the cell unit, WhcTo connect the weight of the hidden unit and the cell unit, WxiFor weights connecting input and input gate units, WhiWeights for connecting hidden units and input gate units, WxfFor connecting the weights of the input and forgetting gate units, WhfWeights for connecting hidden and forgetting gate units, WcfWeights for connecting cell units and forgetting gate units, WxoWeights for connecting input and output gate units, WhoWeights for connecting hidden units and output gate units, WcoWeights for connecting cell units and output gate units, bcFor biasing of the cell network, biFor biasing of the input gate network, bfTo forget biasing of the gate network, boFor the bias of an output gate network, i, f and o are respectively an input gate, a forgetting gate and an output gate, sigma is an activation function, a Sigmoid function is generally adopted, and the sigma is a point-to-point element multiplication; as can be seen from the formula, the function of LSTM is to actually input the vector sequence xtMapping via cell unit vector ctAnd implicit element vector htMapping; the above structure can be repeated for multiple times to form multiple layers of LSTM, and output h of the next layertAs input x of the previous layertThen the method is finished; in the end of this process,will output h of the last LSTM layertAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtObtaining a converted log spectrum;
5) the converted logarithmic spectrum, the base frequency after the logarithmic linear conversion and the non-periodic component obtained by copying are sent into a vocoder to generate final converted voice;
the voice change is implemented by means of a trained voice conversion module of a specific person, and the voice of the user A is changed into the voice of the user B through the conversion and synthesis of voice spectrum parameters; the voiceprint of any specific character designated by the user can be cloned, so that the function of imitating and disguising the specific character is realized; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively.
The converted voice is output and transmitted to a loudspeaker of a remote call counterpart through a network, so that the remote call counterpart feels that the remote call counterpart is in conversation with the remote call counterpart; the module writes the voice data after voice change into a sound card appointed by voice communication software, the voice communication software collects audio data from the sound card, namely, a voice sample is converted into a digital signal and is sent after being coded, and after receiving a coded frame, an opposite side decodes the coded frame to recover the data which can be directly played by the sound card.
The step (1) picks up the voice of the user A through a microphone array with narrow directivity, and the layout mode is that the phase difference delta phi of signals synchronously collected by a pair of microphones is measured, and the signals are transmitted according to the frequency f and the sound propagation speed c0The position interval of the pair of microphones is obtained, the microphone array can point the wave beam to the speaker A after the position of the speaker A is searched, and the influence of surrounding environment noise and echo can be obviously reduced through a strong intelligent directional function; voice data picked up by the microphone array is used as the input of a subsequent voice conversion module;
step (2) using a voice conversion module of a specific person to implement voice change, changing the voice of a user A into the voice of a user B, and simultaneously keeping the content, tone and emotion of the voice unchanged; extracting the features of each frame of the source voice, the 1-dimensional fundamental frequency feature, the 129-dimensional log spectrum feature and the 129-dimensional non-periodic component by using a World vocoder; completing the conversion of the fundamental frequency characteristics by adopting a formula (1); 3 layers of LSTM networks with 100 hidden units in each layer are adopted to complete the conversion of log spectrums; directly copying the 129-dimensional aperiodic component; putting the three parts into World vocoder to output voice waveform; under the conditions of intel i7 CPU and 8GM memory, the conversion of voice output can be completed in real time;
and (3): the converted voice is output and transmitted to a loudspeaker of a remote call counterpart through a network, so that the remote call counterpart feels that the remote call counterpart is in conversation with the remote call counterpart, and the separation of the input voice and the converted voice is realized through a full-duplex sound card.
The voice data after voice change is written into a sound card appointed by the voice communication software, the voice communication software collects audio data from the sound card, namely, a voice sample is converted into a digital signal and is sent after being coded, and after receiving a coded frame, the opposite side decodes the coded frame to recover the data which can be directly played by the sound card.
Correspondingly, the present application also provides a voice communication system with voiceprint cloning function, comprising:
the picking module is used for picking up the voice to be converted and inputting the voice to the pre-trained voice conversion module of the specific person;
the processing module is used for converting the voice to be converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;
and the transmission module is used for transmitting the voice of the specific person to a loudspeaker of a receiver.
The pick-up module comprises a microphone array module for picking up the voice to be converted by a microphone array with narrow directivity.
The processing module comprises:
the extraction module is used for extracting the voice characteristics of the voice to be converted, wherein the voice characteristics comprise fundamental frequency, log spectrum and non-periodic components;
the base frequency conversion module is used for converting the base frequency of the voice to be converted by utilizing a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target base frequency;
the aperiodic component conversion module is used for copying the aperiodic component of the voice to be converted into a target aperiodic component;
the log spectrum conversion module is used for predicting the log spectrum difference of the specific human voice and the voice to be converted by utilizing the long-time memory model and determining a target log spectrum;
and the synthesis module is used for integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
The fundamental frequency conversion module comprises a function determination module for determining a log-linear function as:
Figure BDA0002826823390000091
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
The log spectrum conversion module comprises a log spectrum conversion module processing module,
for passing Δt=yt-xtRepresenting the difference of log spectrums of the specific human voice and the voice to be converted;
it is also used for adopting the structure of the long-time memory model as shown in the formulas (2) to (7):
Figure BDA0002826823390000101
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
Figure BDA0002826823390000102
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,
Figure BDA0002826823390000103
representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;
the structure of the long-time and short-time memory model is used for,
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtThrough full connectionAfter network connection, residual error delta is outputtThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A voice communication method with a voiceprint clone function is characterized in that,
picking up a voice to be converted, inputting the voice to be converted into a pre-trained voice conversion module of a specific person, and converting the voice to be converted into a target voice, wherein the content, tone and emotion of the target voice are consistent with the voice to be converted;
transmitting the specific person voice to a speaker of a listener.
2. The voice call method with voiceprint cloning function according to claim 1, wherein the process of picking up the voice to be converted comprises:
the speech to be converted is picked up by an array of microphones with narrow directivity.
3. The voice call method with voiceprint cloning function according to claim 1,
the process of inputting the voice to be converted into the voice of the specific person into the pre-trained voice conversion module of the specific person comprises the following steps:
extracting the voice features of the voice to be converted, wherein the voice features comprise fundamental frequency, log spectrum and non-periodic components;
converting the fundamental frequency of the voice to be converted by using a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target fundamental frequency;
copying the non-periodic component of the voice to be converted into a target non-periodic component;
predicting the difference of the log spectrums of the specific human voice and the voice to be converted by utilizing a long-time and short-time memory model, and determining a target log spectrum;
and integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
4. The voice call method with voiceprint cloning function according to claim 3,
the log-linear function is:
Figure FDA0002826823380000011
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
5. The voice call method with voiceprint cloning function according to claim 3,
the process of predicting the difference of the log spectrums of the specific human voice and the voice to be converted by utilizing the long-time and short-time memory model and determining the target log spectrums comprises the following steps:
the difference between the log spectrum of the specific human voice and the log spectrum of the voice to be converted is expressed as deltat=yt-xt
The structure of the long-time and short-time memory model is shown in formulas (2) to (7):
Figure FDA0002826823380000021
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
Figure FDA0002826823380000022
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,
Figure FDA0002826823380000023
representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c or h, for each offset,Subscript l is c, i, f or o, σ is an activation function, and £ is a multiplication of elements in point-to-point;
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
6. A voice call system having a voiceprint cloning function, comprising:
the picking module is used for picking up the voice to be converted and inputting the voice to the pre-trained voice conversion module of the specific person;
the processing module is used for converting the voice to be converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;
and the transmission module is used for transmitting the voice of the specific person to a loudspeaker of a receiver.
7. The system according to claim 6, wherein the picking-up module comprises a microphone array module for picking up the voice to be converted by a microphone array having a narrow directivity.
8. The voice call system with voiceprint cloning capability of claim 6,
the processing module comprises:
the extraction module is used for extracting the voice characteristics of the voice to be converted, wherein the voice characteristics comprise fundamental frequency, log spectrum and non-periodic components;
the base frequency conversion module is used for converting the base frequency of the voice to be converted by utilizing a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target base frequency;
the aperiodic component conversion module is used for copying the aperiodic component of the voice to be converted into a target aperiodic component;
the log spectrum conversion module is used for predicting the log spectrum difference of the specific human voice and the voice to be converted by utilizing the long-time memory model and determining a target log spectrum;
and the synthesis module is used for integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
9. The voice call system with voiceprint cloning capability of claim 8,
the fundamental frequency conversion module comprises a function determination module for determining a log-linear function as:
Figure FDA0002826823380000031
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
10. The voice call system with voiceprint cloning capability of claim 8, wherein the log spectrum conversion module comprises a log spectrum conversion module processing module,
for passing Δt=yt-xtRepresenting the difference of log spectrums of the specific human voice and the voice to be converted;
it is also used for adopting the structure of the long-time memory model as shown in the formulas (2) to (7):
Figure FDA0002826823380000041
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
Figure FDA0002826823380000042
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,
Figure FDA0002826823380000043
representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;
the structure of the long-time and short-time memory model is used for,
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
CN202011432039.2A 2020-12-09 2020-12-09 Voice communication method and system with voiceprint cloning function Pending CN112735434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011432039.2A CN112735434A (en) 2020-12-09 2020-12-09 Voice communication method and system with voiceprint cloning function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011432039.2A CN112735434A (en) 2020-12-09 2020-12-09 Voice communication method and system with voiceprint cloning function

Publications (1)

Publication Number Publication Date
CN112735434A true CN112735434A (en) 2021-04-30

Family

ID=75598732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011432039.2A Pending CN112735434A (en) 2020-12-09 2020-12-09 Voice communication method and system with voiceprint cloning function

Country Status (1)

Country Link
CN (1) CN112735434A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497480A (en) * 2021-06-18 2022-12-20 海信集团控股股份有限公司 Sound repeated engraving method, device, equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 Bi-LSTM and WaveNet fused voice conversion method
US20190251952A1 (en) * 2018-02-09 2019-08-15 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN111223474A (en) * 2020-01-15 2020-06-02 武汉水象电子科技有限公司 Voice cloning method and system based on multi-neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
US20190251952A1 (en) * 2018-02-09 2019-08-15 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 Bi-LSTM and WaveNet fused voice conversion method
CN111223474A (en) * 2020-01-15 2020-06-02 武汉水象电子科技有限公司 Voice cloning method and system based on multi-neural network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
姚天任: "《数字语音处理》", 30 April 1992, 华中科技大学出版社, pages: 232 - 233 *
桑胜举等: "《数字娱乐技术与CAD》", 31 August 2009, 中国铁道出版社, pages: 286 - 287 *
苗晓孔等: "基于参数转换的语音深度伪造及其对声纹认证的威胁评估", 信息安全学报, vol. 5, no. 6, pages 53 - 56 *
蒋刚等: "《工业机器人》", 31 January 2011, 西南交通大学出版社, pages: 148 *
魏序等: "基于波束形成与多参考源噪声对消的语音增强算法", 《计算机与现代化》, no. 196, 31 December 2011 (2011-12-31), pages 46 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497480A (en) * 2021-06-18 2022-12-20 海信集团控股股份有限公司 Sound repeated engraving method, device, equipment and medium

Similar Documents

Publication Publication Date Title
Szöke et al. Building and evaluation of a real room impulse response dataset
JP7258182B2 (en) Speech processing method, device, electronic device and computer program
CN110491404B (en) Voice processing method, device, terminal equipment and storage medium
CN111833896B (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
CN110085245B (en) Voice definition enhancing method based on acoustic feature conversion
US20180358003A1 (en) Methods and apparatus for improving speech communication and speech interface quality using neural networks
CN104157293B (en) The signal processing method of targeted voice signal pickup in a kind of enhancing acoustic environment
CN108877823B (en) Speech enhancement method and device
CN111341303B (en) Training method and device of acoustic model, and voice recognition method and device
CN109887489B (en) Speech dereverberation method based on depth features for generating countermeasure network
CN112382301B (en) Noise-containing voice gender identification method and system based on lightweight neural network
CN113241085B (en) Echo cancellation method, device, equipment and readable storage medium
CN113823273B (en) Audio signal processing method, device, electronic equipment and storage medium
CN111627455A (en) Audio data noise reduction method and device and computer readable storage medium
CN114267372A (en) Voice noise reduction method, system, electronic device and storage medium
CN112735434A (en) Voice communication method and system with voiceprint cloning function
CN111353258A (en) Echo suppression method based on coding and decoding neural network, audio device and equipment
CN113763978B (en) Voice signal processing method, device, electronic equipment and storage medium
CN115705839A (en) Voice playing method and device, computer equipment and storage medium
CN114120965A (en) Audio processing method, electronic device, and storage medium
CN113990337A (en) Audio optimization method and related device, electronic equipment and storage medium
CN111696566A (en) Voice processing method, apparatus and medium
CN112720527B (en) Music dance self-programming robot
CN115762552B (en) Method for training echo cancellation model, echo cancellation method and corresponding device
Huemmer et al. Online environmental adaptation of CNN-based acoustic models using spatial diffuseness features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination