CN112735434A - Voice communication method and system with voiceprint cloning function - Google Patents
Voice communication method and system with voiceprint cloning function Download PDFInfo
- Publication number
- CN112735434A CN112735434A CN202011432039.2A CN202011432039A CN112735434A CN 112735434 A CN112735434 A CN 112735434A CN 202011432039 A CN202011432039 A CN 202011432039A CN 112735434 A CN112735434 A CN 112735434A
- Authority
- CN
- China
- Prior art keywords
- voice
- converted
- target
- log spectrum
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000010367 cloning Methods 0.000 title claims abstract description 18
- 238000004891 communication Methods 0.000 title claims description 7
- 238000006243 chemical reaction Methods 0.000 claims abstract description 50
- 230000008451 emotion Effects 0.000 claims abstract description 12
- 238000001228 spectrum Methods 0.000 claims description 76
- 239000013598 vector Substances 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 31
- 230000015654 memory Effects 0.000 claims description 27
- 238000004364 calculation method Methods 0.000 claims description 14
- 230000000737 periodic effect Effects 0.000 claims description 12
- 238000012886 linear function Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 230000015572 biosynthetic process Effects 0.000 claims description 9
- 238000003786 synthesis reaction Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 239000004576 sand Substances 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 5
- 238000013461 design Methods 0.000 abstract description 5
- 230000033764 rhythmic process Effects 0.000 abstract description 3
- 230000008859 change Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 241000084490 Esenbeckia delta Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005316 response function Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a voice call method and a system with a voiceprint clone function, wherein voice to be converted is picked up and input into a pre-trained voice conversion module of a specific person, the voice to be converted is converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted; transmitting the specific person voice to a speaker of a listener. The advantages are that: compared with the existing sound changing scheme, the system realizes the voiceprint cloning aiming at any specific character which can be specified by a user, and can play a role of imitating and disguising the specific character; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively; the algorithm of the voice conversion module and the design of the computing platform can realize the real-time call function, and the purpose of calling by imitating and disguising identities can be better fulfilled.
Description
Technical Field
The invention relates to a voice call method and a voice call system with a voiceprint cloning function, and belongs to the technical field of voice signal processing.
Background
Scenes such as game sound effect, television dubbing, quadratic element virtual image and the like have strong requirements on personalized voice generation. The real-time sound changer is used as an improvement of sound changing after recording, can directly change sound, and achieves the purpose of voice communication. The continuous progress of the technology represented by speech synthesis and speech conversion provides technical support for the simulation of the speech of a specific character in games, television shows, virtual images and the like.
The patent "a real-time sound changing method based on intelligent terminal" proposes the sound changing method of changing fundamental frequency and response function pole zero. The patent 'a method for voice-change call under wireless network based on android' provides a voice-change call method on android equipment. The patent 'high-quality real-time sound changing method based on voice analysis and synthesis' realizes sound changing by interpolating or shearing signals, modifying fundamental frequency and resonance peak positions, and adjusting duration, pitch and silver. However, these methods can only change the thickness of the sound, and cannot generate the sound of a specific target person.
The patent 'live broadcasting microphone' provides a live broadcasting microphone, which can perform functions of balancing and adjusting reverberation of sound to improve interesting sound effect, changing sound of electric sound and the like, and can be connected with a live broadcasting platform through a wireless network to realize a real-time live broadcasting function; meanwhile, the system can also be used as a recording device, the collected audio is stored locally, and the audio is edited and uploaded to the cloud terminal through connecting the mobile device. The patent 'design of an end-to-end voice camouflage system based on Bluetooth' disguises and protects the conversation content of a user through a voice changing mode and an end-to-end real-time voice system based on Bluetooth, thereby realizing a real-time voice changing function of answering and making a call by the user, and also simulating different scenes by adding background voice to achieve the effect of disguising the position. These methods have good real-time performance but cannot survive the voice of a specific target character.
The patent ' design and realization of real-person voice-changing equipment based on deep learning algorithm ' intelligent voice conversion algorithm based on deep learning ' provides an electronic voice-changing module capable of converting the voice of any person into any required target pronunciation person voice function in real time. A brand-new real-time voice-to-voice changing algorithm principle is constructed by adopting the method and the thought of a voice recognition front end and a text voice synthesis rear end. The brand new design shows the effect and real-time performance of text-to-speech synthesis, but the method can not stably achieve the effect of comparing with the real human speech in the aspects of emotion and naturalness by generating the speech in a text-to-speech synthesis mode.
Disclosure of Invention
The technical problem to be solved by the present invention is to overcome the defects of the prior art, and to provide a voice communication method and system with a voiceprint cloning function.
In order to solve the technical problem, the invention provides a voice call method with a voiceprint clone function, which is characterized in that a voice to be converted is picked up and input into a pre-trained voice conversion module of a specific person, the voice to be converted is converted into a target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;
transmitting the specific person voice to a speaker of a listener.
Further, the process of picking up the voice to be converted includes:
the speech to be converted is picked up by an array of microphones with narrow directivity.
Further, the process of inputting the voice to be converted into the voice of the specific person into the pre-trained voice conversion module of the specific person includes:
extracting the voice features of the voice to be converted, wherein the voice features comprise fundamental frequency, log spectrum and non-periodic components;
converting the fundamental frequency of the voice to be converted by using a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target fundamental frequency;
copying the non-periodic component of the voice to be converted into a target non-periodic component;
predicting the difference of the log spectrums of the specific human voice and the voice to be converted by utilizing a long-time and short-time memory model, and determining a target log spectrum;
and integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
Further, the log-linear function is:
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
Further, the predicting the difference between the log spectrums of the specific human voice and the voice to be converted by using the long-time and short-time memory model, and the determining the target log spectrum process includes:
the difference between the log spectrum of the specific human voice and the log spectrum of the voice to be converted is expressed as deltat=yt-xt;
The structure of the long-time and short-time memory model is shown in formulas (2) to (7):
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally by the formula(7) Calculating to obtain the implicit unit vector h of the layer output1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
A voice call system having a voiceprint cloning function, comprising:
the picking module is used for picking up the voice to be converted and inputting the voice to the pre-trained voice conversion module of the specific person;
the processing module is used for converting the voice to be converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;
and the transmission module is used for transmitting the voice of the specific person to a loudspeaker of a receiver.
Further, the pick-up module includes a microphone array module for picking up the voice to be converted by a microphone array having a narrow directivity.
Further, the processing module comprises:
the extraction module is used for extracting the voice characteristics of the voice to be converted, wherein the voice characteristics comprise fundamental frequency, log spectrum and non-periodic components;
the base frequency conversion module is used for converting the base frequency of the voice to be converted by utilizing a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target base frequency;
the aperiodic component conversion module is used for copying the aperiodic component of the voice to be converted into a target aperiodic component;
the log spectrum conversion module is used for predicting the log spectrum difference of the specific human voice and the voice to be converted by utilizing the long-time memory model and determining a target log spectrum;
and the synthesis module is used for integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
Further, the fundamental frequency conversion module comprises a function determination module for determining a log-linear function as:
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
Further, the log spectrum conversion module comprises a log spectrum conversion module processing module,
for passing Δt=yt-xtRepresenting the difference of log spectrums of the specific human voice and the voice to be converted;
it is also used for adopting the structure of the long-time memory model as shown in the formulas (2) to (7):
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;
the structure of the long-time and short-time memory model is used for,
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
The invention achieves the following beneficial effects:
compared with the existing sound changing scheme, the system realizes the voiceprint cloning aiming at any specific character which can be specified by a user, and can play a role of imitating and disguising the specific character; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively; the algorithm of the voice conversion module and the design of the computing platform can realize the real-time call function, and the purpose of calling by imitating and disguising identities can be better fulfilled.
Drawings
FIG. 1 is a schematic diagram of a voice call system for voiceprint cloning via voice conversion according to the present invention;
FIG. 2 is a general schematic diagram of a voice conversion scheme employed by the present invention;
fig. 3 is a diagram illustrating log spectrum training and conversion in a speech conversion scheme employed in the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention provides a voice communication method with voiceprint cloning function, as shown in fig. 1-3.
(1) Picking up voice of a user A through a microphone array with narrow directivity; the microphone array is formed by combining a group of microphone units in a small range according to certain spatial distribution, and the strong directivity of the microphone is realized by a beam forming method so as to improve the capability of the microphone for picking up clean signals in a noise environment; the beam forming method is to form a cone-shaped narrow beam, suppress noise and interference in the environment, only accept the voice of the speaker A (sound source) direction, achieve the effect of voice enhancement, implement spatial filtering through narrow directivity, and pick up the voice of the user A which is cleaner as the input data of the voice conversion module.
(2) Changing voice by using a trained voice conversion module of a specific person, changing the voice of a user A into the voice of a user B, and simultaneously keeping the content, tone and emotion of the voice unchanged; the conversion is realized according to the following steps:
1) extracting speech features from speech picked up by a microphone array by using a vocoder, wherein the speech features comprise three parts, namely fundamental frequency, logarithmic spectrum and non-periodic components, as shown in FIG. 2;
2) fundamental frequency F0 of target character voicetPerforming the conversion as a log-linear function
Wherein, F0sAt the base frequency of the source speech, musAnd σsMean and standard deviation, μ, of the source speech fundamental frequency, respectivelytAnd σtRespectively representing the mean value and the standard deviation of the target voice fundamental frequency;
3) the aperiodic component of the source voice is directly copied into the aperiodic component of the target character voice;
4) the log spectrum is a sequence of vectors, using { x }tT1.. T } represents a log spectral sequence of the source speech, denoted by { y }tT1.. multidot.t. represents a log spectrum sequence of the voice of the target person, and x is considered to reduce the difficulty of predictiontAnd ytIn turn predicts the difference, i.e. delta, between the speech content of (a) and (b)t=yt-xtThe conversion is realized by a Long Short-Term Memory model (LSTM), and the LSTM has a cyclic structure and a Memory unit so as to have the capacity of extracting Long-Term time sequence information; the basic structure of the LSTM is shown in equations (2) to (7):
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein x istIs the log spectrum of the t-th frame, htThe unit vector is implied for the ttm time instant,is an intermediate variable, ctCell unit vector, W, specific for LSTMxcTo connect the inputs with the weight of the cell unit, WhcTo connect the weight of the hidden unit and the cell unit, WxiFor weights connecting input and input gate units, WhiWeights for connecting hidden units and input gate units, WxfFor connecting the weights of the input and forgetting gate units, WhfWeights for connecting hidden and forgetting gate units, WcfWeights for connecting cell units and forgetting gate units, WxoWeights for connecting input and output gate units, WhoWeights for connecting hidden units and output gate units, WcoWeights for connecting cell units and output gate units, bcFor biasing of the cell network, biFor biasing of the input gate network, bfTo forget biasing of the gate network, boFor the bias of an output gate network, i, f and o are respectively an input gate, a forgetting gate and an output gate, sigma is an activation function, a Sigmoid function is generally adopted, and the sigma is a point-to-point element multiplication; as can be seen from the formula, the function of LSTM is to actually input the vector sequence xtMapping via cell unit vector ctAnd implicit element vector htMapping; the above structure can be repeated for multiple times to form multiple layers of LSTM, and output h of the next layertAs input x of the previous layertThen the method is finished; in the end of this process,will output h of the last LSTM layertAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtObtaining a converted log spectrum;
5) the converted logarithmic spectrum, the base frequency after the logarithmic linear conversion and the non-periodic component obtained by copying are sent into a vocoder to generate final converted voice;
the voice change is implemented by means of a trained voice conversion module of a specific person, and the voice of the user A is changed into the voice of the user B through the conversion and synthesis of voice spectrum parameters; the voiceprint of any specific character designated by the user can be cloned, so that the function of imitating and disguising the specific character is realized; through the conversion from sound to sound, the rhythm and emotion of the source speech can be better reserved than those from text to speech, so that the speech is more vivid and lively.
The converted voice is output and transmitted to a loudspeaker of a remote call counterpart through a network, so that the remote call counterpart feels that the remote call counterpart is in conversation with the remote call counterpart; the module writes the voice data after voice change into a sound card appointed by voice communication software, the voice communication software collects audio data from the sound card, namely, a voice sample is converted into a digital signal and is sent after being coded, and after receiving a coded frame, an opposite side decodes the coded frame to recover the data which can be directly played by the sound card.
The step (1) picks up the voice of the user A through a microphone array with narrow directivity, and the layout mode is that the phase difference delta phi of signals synchronously collected by a pair of microphones is measured, and the signals are transmitted according to the frequency f and the sound propagation speed c0The position interval of the pair of microphones is obtained, the microphone array can point the wave beam to the speaker A after the position of the speaker A is searched, and the influence of surrounding environment noise and echo can be obviously reduced through a strong intelligent directional function; voice data picked up by the microphone array is used as the input of a subsequent voice conversion module;
step (2) using a voice conversion module of a specific person to implement voice change, changing the voice of a user A into the voice of a user B, and simultaneously keeping the content, tone and emotion of the voice unchanged; extracting the features of each frame of the source voice, the 1-dimensional fundamental frequency feature, the 129-dimensional log spectrum feature and the 129-dimensional non-periodic component by using a World vocoder; completing the conversion of the fundamental frequency characteristics by adopting a formula (1); 3 layers of LSTM networks with 100 hidden units in each layer are adopted to complete the conversion of log spectrums; directly copying the 129-dimensional aperiodic component; putting the three parts into World vocoder to output voice waveform; under the conditions of intel i7 CPU and 8GM memory, the conversion of voice output can be completed in real time;
and (3): the converted voice is output and transmitted to a loudspeaker of a remote call counterpart through a network, so that the remote call counterpart feels that the remote call counterpart is in conversation with the remote call counterpart, and the separation of the input voice and the converted voice is realized through a full-duplex sound card.
The voice data after voice change is written into a sound card appointed by the voice communication software, the voice communication software collects audio data from the sound card, namely, a voice sample is converted into a digital signal and is sent after being coded, and after receiving a coded frame, the opposite side decodes the coded frame to recover the data which can be directly played by the sound card.
Correspondingly, the present application also provides a voice communication system with voiceprint cloning function, comprising:
the picking module is used for picking up the voice to be converted and inputting the voice to the pre-trained voice conversion module of the specific person;
the processing module is used for converting the voice to be converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;
and the transmission module is used for transmitting the voice of the specific person to a loudspeaker of a receiver.
The pick-up module comprises a microphone array module for picking up the voice to be converted by a microphone array with narrow directivity.
The processing module comprises:
the extraction module is used for extracting the voice characteristics of the voice to be converted, wherein the voice characteristics comprise fundamental frequency, log spectrum and non-periodic components;
the base frequency conversion module is used for converting the base frequency of the voice to be converted by utilizing a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target base frequency;
the aperiodic component conversion module is used for copying the aperiodic component of the voice to be converted into a target aperiodic component;
the log spectrum conversion module is used for predicting the log spectrum difference of the specific human voice and the voice to be converted by utilizing the long-time memory model and determining a target log spectrum;
and the synthesis module is used for integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
The fundamental frequency conversion module comprises a function determination module for determining a log-linear function as:
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
The log spectrum conversion module comprises a log spectrum conversion module processing module,
for passing Δt=yt-xtRepresenting the difference of log spectrums of the specific human voice and the voice to be converted;
it is also used for adopting the structure of the long-time memory model as shown in the formulas (2) to (7):
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;
the structure of the long-time and short-time memory model is used for,
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtThrough full connectionAfter network connection, residual error delta is outputtThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A voice communication method with a voiceprint clone function is characterized in that,
picking up a voice to be converted, inputting the voice to be converted into a pre-trained voice conversion module of a specific person, and converting the voice to be converted into a target voice, wherein the content, tone and emotion of the target voice are consistent with the voice to be converted;
transmitting the specific person voice to a speaker of a listener.
2. The voice call method with voiceprint cloning function according to claim 1, wherein the process of picking up the voice to be converted comprises:
the speech to be converted is picked up by an array of microphones with narrow directivity.
3. The voice call method with voiceprint cloning function according to claim 1,
the process of inputting the voice to be converted into the voice of the specific person into the pre-trained voice conversion module of the specific person comprises the following steps:
extracting the voice features of the voice to be converted, wherein the voice features comprise fundamental frequency, log spectrum and non-periodic components;
converting the fundamental frequency of the voice to be converted by using a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target fundamental frequency;
copying the non-periodic component of the voice to be converted into a target non-periodic component;
predicting the difference of the log spectrums of the specific human voice and the voice to be converted by utilizing a long-time and short-time memory model, and determining a target log spectrum;
and integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
4. The voice call method with voiceprint cloning function according to claim 3,
the log-linear function is:
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
5. The voice call method with voiceprint cloning function according to claim 3,
the process of predicting the difference of the log spectrums of the specific human voice and the voice to be converted by utilizing the long-time and short-time memory model and determining the target log spectrums comprises the following steps:
the difference between the log spectrum of the specific human voice and the log spectrum of the voice to be converted is expressed as deltat=yt-xt;
The structure of the long-time and short-time memory model is shown in formulas (2) to (7):
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c or h, for each offset,Subscript l is c, i, f or o, σ is an activation function, and £ is a multiplication of elements in point-to-point;
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
6. A voice call system having a voiceprint cloning function, comprising:
the picking module is used for picking up the voice to be converted and inputting the voice to the pre-trained voice conversion module of the specific person;
the processing module is used for converting the voice to be converted into target voice, and the content, tone and emotion of the target voice are kept consistent with the voice to be converted;
and the transmission module is used for transmitting the voice of the specific person to a loudspeaker of a receiver.
7. The system according to claim 6, wherein the picking-up module comprises a microphone array module for picking up the voice to be converted by a microphone array having a narrow directivity.
8. The voice call system with voiceprint cloning capability of claim 6,
the processing module comprises:
the extraction module is used for extracting the voice characteristics of the voice to be converted, wherein the voice characteristics comprise fundamental frequency, log spectrum and non-periodic components;
the base frequency conversion module is used for converting the base frequency of the voice to be converted by utilizing a predetermined logarithmic linear function about the voice of the specific person to obtain a converted target base frequency;
the aperiodic component conversion module is used for copying the aperiodic component of the voice to be converted into a target aperiodic component;
the log spectrum conversion module is used for predicting the log spectrum difference of the specific human voice and the voice to be converted by utilizing the long-time memory model and determining a target log spectrum;
and the synthesis module is used for integrating the target fundamental frequency, the target aperiodic component and the target log spectrum to generate the target voice.
9. The voice call system with voiceprint cloning capability of claim 8,
the fundamental frequency conversion module comprises a function determination module for determining a log-linear function as:
wherein, F0tAt the target fundamental frequency, F0sFor the fundamental frequency, mu, of the speech to be convertedsAnd σsRespectively mean and standard deviation, mu, of the fundamental frequency of the speech to be convertedtAnd σtRespectively, the mean and standard deviation of the fundamental frequency of a particular person's voice.
10. The voice call system with voiceprint cloning capability of claim 8, wherein the log spectrum conversion module comprises a log spectrum conversion module processing module,
for passing Δt=yt-xtRepresenting the difference of log spectrums of the specific human voice and the voice to be converted;
it is also used for adopting the structure of the long-time memory model as shown in the formulas (2) to (7):
it=σ(Wxixt+Whiht-1+bi) (3)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf) (4)
ot=σ(Wxoxt+Whoht-1+Wcoct-1+bo) (6)
ht=ot⊙tanh(ct) (7)
wherein, ytLog spectrum of the t-th frame, x, for a particular person's speechtLog spectrum of the t-th frame of speech to be converted, htFor the t-th implicit element vector, o, of the long-and-short-term memory modeltOutput gate representing the t-th instant itInput gate representing the t-th moment, ftA forgetting gate showing the t-th time, t-1 showing the last time,representing cell unit vectors, W, specific to the long-and-short-term memory model for intermediate variablesklAre respective weights, blSubscript k is x, c, or h, subscript l is c, i, f, or o, σ is an activation function, which is a multiplication of the elements in point-to-point;
the structure of the long-time and short-time memory model is used for,
at the start time, h is initialized0And c0(ii) a At time t equal to 1, the log spectrum x of the 1 st frame is input1Obtaining a temporary cell unit vector c by calculation of formula (2)1(ii) a Obtaining an input gate vector i through the calculation of the formulas (3) and (4)1And forget gate vector f1(ii) a Updating cell unit vector c by equation (5)1(ii) a Calculating the output gate o by equation (6)1(ii) a Finally, the implicit unit vector h output by the layer is obtained through calculation of a formula (7)1(ii) a And so on to any time t until the sequence is finished;
repeating the above structure several times to form multiple layers of LSTM, and outputting h of next layertAs input x of the previous layertThen the method is finished; finally, the output h of the last LSTM layer is outputtAfter passing through the full-connection network, outputting residual error deltatThen superimposes the prediction of the residual on the input log spectrum xtAnd obtaining the converted log spectrum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011432039.2A CN112735434A (en) | 2020-12-09 | 2020-12-09 | Voice communication method and system with voiceprint cloning function |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011432039.2A CN112735434A (en) | 2020-12-09 | 2020-12-09 | Voice communication method and system with voiceprint cloning function |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112735434A true CN112735434A (en) | 2021-04-30 |
Family
ID=75598732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011432039.2A Pending CN112735434A (en) | 2020-12-09 | 2020-12-09 | Voice communication method and system with voiceprint cloning function |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112735434A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115497480A (en) * | 2021-06-18 | 2022-12-20 | 海信集团控股股份有限公司 | Sound repeated engraving method, device, equipment and medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102306492A (en) * | 2011-09-09 | 2012-01-04 | 中国人民解放军理工大学 | Voice conversion method based on convolutive nonnegative matrix factorization |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
CN109767778A (en) * | 2018-12-27 | 2019-05-17 | 中国人民解放军陆军工程大学 | Bi-LSTM and WaveNet fused voice conversion method |
US20190251952A1 (en) * | 2018-02-09 | 2019-08-15 | Baidu Usa Llc | Systems and methods for neural voice cloning with a few samples |
CN111223474A (en) * | 2020-01-15 | 2020-06-02 | 武汉水象电子科技有限公司 | Voice cloning method and system based on multi-neural network |
-
2020
- 2020-12-09 CN CN202011432039.2A patent/CN112735434A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102306492A (en) * | 2011-09-09 | 2012-01-04 | 中国人民解放军理工大学 | Voice conversion method based on convolutive nonnegative matrix factorization |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
US20190251952A1 (en) * | 2018-02-09 | 2019-08-15 | Baidu Usa Llc | Systems and methods for neural voice cloning with a few samples |
CN109767778A (en) * | 2018-12-27 | 2019-05-17 | 中国人民解放军陆军工程大学 | Bi-LSTM and WaveNet fused voice conversion method |
CN111223474A (en) * | 2020-01-15 | 2020-06-02 | 武汉水象电子科技有限公司 | Voice cloning method and system based on multi-neural network |
Non-Patent Citations (5)
Title |
---|
姚天任: "《数字语音处理》", 30 April 1992, 华中科技大学出版社, pages: 232 - 233 * |
桑胜举等: "《数字娱乐技术与CAD》", 31 August 2009, 中国铁道出版社, pages: 286 - 287 * |
苗晓孔等: "基于参数转换的语音深度伪造及其对声纹认证的威胁评估", 信息安全学报, vol. 5, no. 6, pages 53 - 56 * |
蒋刚等: "《工业机器人》", 31 January 2011, 西南交通大学出版社, pages: 148 * |
魏序等: "基于波束形成与多参考源噪声对消的语音增强算法", 《计算机与现代化》, no. 196, 31 December 2011 (2011-12-31), pages 46 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115497480A (en) * | 2021-06-18 | 2022-12-20 | 海信集团控股股份有限公司 | Sound repeated engraving method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Szöke et al. | Building and evaluation of a real room impulse response dataset | |
JP7258182B2 (en) | Speech processing method, device, electronic device and computer program | |
CN110491404B (en) | Voice processing method, device, terminal equipment and storage medium | |
CN111833896B (en) | Voice enhancement method, system, device and storage medium for fusing feedback signals | |
CN110085245B (en) | Voice definition enhancing method based on acoustic feature conversion | |
US20180358003A1 (en) | Methods and apparatus for improving speech communication and speech interface quality using neural networks | |
CN104157293B (en) | The signal processing method of targeted voice signal pickup in a kind of enhancing acoustic environment | |
CN108877823B (en) | Speech enhancement method and device | |
CN111341303B (en) | Training method and device of acoustic model, and voice recognition method and device | |
CN109887489B (en) | Speech dereverberation method based on depth features for generating countermeasure network | |
CN112382301B (en) | Noise-containing voice gender identification method and system based on lightweight neural network | |
CN113241085B (en) | Echo cancellation method, device, equipment and readable storage medium | |
CN113823273B (en) | Audio signal processing method, device, electronic equipment and storage medium | |
CN111627455A (en) | Audio data noise reduction method and device and computer readable storage medium | |
CN114267372A (en) | Voice noise reduction method, system, electronic device and storage medium | |
CN112735434A (en) | Voice communication method and system with voiceprint cloning function | |
CN111353258A (en) | Echo suppression method based on coding and decoding neural network, audio device and equipment | |
CN113763978B (en) | Voice signal processing method, device, electronic equipment and storage medium | |
CN115705839A (en) | Voice playing method and device, computer equipment and storage medium | |
CN114120965A (en) | Audio processing method, electronic device, and storage medium | |
CN113990337A (en) | Audio optimization method and related device, electronic equipment and storage medium | |
CN111696566A (en) | Voice processing method, apparatus and medium | |
CN112720527B (en) | Music dance self-programming robot | |
CN115762552B (en) | Method for training echo cancellation model, echo cancellation method and corresponding device | |
Huemmer et al. | Online environmental adaptation of CNN-based acoustic models using spatial diffuseness features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |