CN109616131B

CN109616131B - Digital real-time voice sound changing method

Info

Publication number: CN109616131B
Application number: CN201811342131.2A
Authority: CN
Inventors: 陈锴; 刘晓峻; 狄敏
Original assignee: Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd; Nanjing Nanda Electronic Wisdom Service Robot Research Institute Co ltd; Nanjing University
Current assignee: Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd; Nanjing Nanda Electronic Wisdom Service Robot Research Institute Co ltd; Nanjing University
Priority date: 2018-11-12
Filing date: 2018-11-12
Publication date: 2023-07-07
Anticipated expiration: 2038-11-12
Also published as: CN109616131A

Abstract

The invention discloses a digital real-time voice sound changing method, which is characterized in that a non-unvoiced part of original voice is adjusted and analyzed, signals in a specific person pitch library are extracted according to a comparison result to replace original pitches, and a sound changing signal is further obtained through synthesis and superposition processing. The invention has the characteristics of high naturalness and intelligibility, the voice after the sound change is not easy to be restored, and the invention has stronger confidentiality, and simultaneously has the characteristics of low time delay and low complexity.

Description

Digital real-time voice sound changing method

Technical Field

The invention relates to a voice change method, and belongs to the technical field of audio.

Background

The sound variation is an important voice processing technology and is widely applied to voice interaction, secret communication, special sound effects of consumer electronic equipment and the like.

The traditional voice sound changing technology mainly uses a frequency modulation technology, and the sound changing technology mainly has the following technical defects: firstly, the naturalness of the voice after the voice is changed is lower, and the intelligibility is reduced; secondly, the sound changing method is simple and is easy to restore to original voice by people, thereby affecting the effect of secret communication; finally, the complexity of the tone variation is higher, the processing time delay is larger, and the real-time performance is limited.

Disclosure of Invention

The invention aims to: in order to overcome the defects in the prior art, the invention provides a real-time digital voice sound changing method, which overcomes the following three problems in the current main stream sound changing method: 1. the naturalness and the intelligibility of the sound-changing effect are low, 2, the voice after sound changing is easy to recover, 3, the time delay of the sound-changing processing process is higher, and the operation complexity is higher.

The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:

a digital real-time voice pitch method, comprising the steps of:

and step 1, distinguishing unvoiced sound from non-unvoiced sound in the voice through initial and final segmentation.

And 2, decomposing the non-unvoiced sound through linear prediction, and dividing the original voice into two parts of an original fundamental tone model and an original acoustic model.

And step 3, adjusting the original fundamental tone according to the actual requirement, which can be changing the fundamental tone frequency, changing the fundamental tone frequency change speed and the like.

And 4, comparing the adjusted fundamental tone with fundamental tone information in a specific person fundamental tone library to find out the fundamental tone signal which meets the requirement best.

And 5, reconstructing and optimizing the pitch information to obtain a corrected pitch signal.

And 6, performing voice synthesis on the corrected fundamental tone and the vocal tract model to form a non-unvoiced sound signal after the sound change.

And 7, synthesizing the original unvoiced sound signal and the non-unvoiced sound signal to form an adjusted voice signal.

Preferably: the pitch libraries of the specific person are mainly from the content of analyzing and extracting the voice of the specific person, and comprise pitch signals of common syllables and words corresponding to the specific person in the pronunciation process.

Preferably: in the step 2, the voice is decomposed into two parts of a sound channel model and an original fundamental tone through linear prediction, wherein parameters of the sound channel model are reserved for later voice synthesis.

Preferably: and comparing the adjusted original pitch with all pitch signals in a specific person pitch library, and obtaining the most similar pitch signal fragments through a correlation comparison method, a pattern matching method or a machine learning method.

Preferably: the specific person pitch library is stored in the cloud system, and a special real-time retrieval system is utilized.

Preferably: the method is realized by adopting a DSP and ARM system.

Preferably: the DSP realizes the functions of initial and final segmentation and linear prediction and extracts the original fundamental tone of the non-unvoiced sound signal.

Preferably: and the DSP synthesizes the adjusted fundamental tone and sound channel model into a non-unvoiced sound signal and further overlaps the original unvoiced sound model to form a voice signal after the sound change.

Compared with the prior art, the invention has the following beneficial effects:

1. the pitch information used in the pitch changing process of the invention is all from the extracted pitch in the natural voice, and the frequency conversion operation is not directly carried out on the voice, so the naturalness and the intelligibility of the voice are ensured.

2. The voice pitch information after the voice change is completely from a voice library of a specific person, and the characteristic information in the original voice signal is completely removed, so that the voice pitch information is not easy to restore by other systems.

3. The invention has low operation complexity of the variable tone and small processing time delay, and is beneficial to the realization of a real-time system by combining a cloud processing technology.

Drawings

FIG. 1 is a schematic diagram of a sound conversion system

FIG. 2 is a block diagram of an implementation of the present invention based on a floating point DSP and ARM system.

Detailed Description

The present invention is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the invention and not limiting of its scope, and various equivalent modifications to the invention will fall within the scope of the appended claims to the skilled person after reading the invention.

A digital real-time voice pitch-shifting method, as shown in fig. 1, comprises the following 7 parts:

1. distinguishing unvoiced sound from non-unvoiced sound (voiced sound, voiced consonant, fricative sound) in the voice through phonological segmentation;

2. non-unvoiced sounds (voiced sounds, voiced consonants and fricatives) are decomposed through linear prediction, and original voice is divided into two parts of an original fundamental tone model and an original acoustic model;

3. the original fundamental tone can be adjusted according to actual demands, such as changing the fundamental tone frequency, changing the fundamental tone frequency change speed and the like;

4. comparing the adjusted fundamental tone with fundamental tone information in a fundamental tone library of specific people to find out the fundamental tone signal which meets the requirement best;

5. reconstructing and optimizing the pitch information to obtain a corrected pitch signal;

6. the corrected fundamental tone and the sound channel model are subjected to voice synthesis to form a non-unvoiced sound signal after sound change;

7. and synthesizing the unvoiced signal and the non-unvoiced signal after the sound change to form an adjusted voice signal.

The initial and final segmentation is used for distinguishing unvoiced and non-unvoiced parts in the voice, wherein the non-unvoiced parts comprise voiced sounds, turbid consonants and fricatives, and in the comprehensive process, the system superimposes the adjusted non-unvoiced sounds and original unvoiced sounds to form a new voice signal after the voice is changed.

The speaker-specific pitch pool is mainly derived from the analysis and extraction of the speaker-specific speech, including pitch signals during common syllable and word pronunciation. A specific training procedure is required for pitch library establishment for a specific person.

The voice is decomposed into two parts of a vocal tract model and an original fundamental tone through linear prediction, wherein parameters of the vocal tract model are reserved for later voice synthesis.

The original pitch is adjusted according to the requirements of users, including adjusting the pitch frequency, adjusting the change speed of the pitch frequency, and the like.

The adjusted original pitch is compared with all pitch signals in a pitch library of specific people, the most similar pitch signal fragments are obtained through methods of correlation comparison, pattern matching, machine learning and the like, certain optimization is carried out, the purpose of optimization is mainly to ensure the continuity of the pitch, improve the naturalness of the voice and finally form the corrected pitch.

The specific person pitch library can be stored in the cloud system, and meanwhile, the efficiency and the utilization rate of the system are improved by utilizing a special retrieval system.

The modified pitch and the vocal tract model are synthesized to form a modified non-unvoiced speech segment.

The pitch-changing system adjusts and analyzes the non-unvoiced part of the original voice, extracts signals in the specific person pitch bank to replace the original pitch according to the comparison result, and further obtains a pitch-changing signal through synthesis and superposition operation. The pitch pool for a particular person results from the analysis and extraction of speech for the particular person.

As shown in fig. 2, the whole system is realized based on a floating point DSP and an ARM system:

1. ARM transmits the adjustment requirement of the system to the floating point DSP;

2. microphone acquisition data is transmitted to a floating-point DSP through an ADC (analog-to-digital converter) to be used as system input;

3. the floating-point DSP feeds signals to a loudspeaker through a DAC (digital-to-analog converter) for playback as system output;

4. the floating-point DSP realizes the functions of initial and final segmentation, linear prediction and the like, and extracts the original fundamental tone of a non-unvoiced signal;

5. the floating point DSP adjusts the original fundamental tone and transmits the adjusted original fundamental tone to the cloud end through the ARM;

6. the cloud compares the adjusted original fundamental tone with a fundamental tone library of specific people, finds out the most similar fundamental tone signal, and transmits the signal back to the floating point DSP;

the floating-point DSP synthesizes the adjusted fundamental tone and the sound channel model into a non-unvoiced sound signal, and further superimposes the non-unvoiced sound signal with the original unvoiced sound signal to form a voice signal after sound change.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A digital real-time voice conversion method, comprising the steps of:

step 1, distinguishing unvoiced sound from non-unvoiced sound in voice through initial and final segmentation;

step 2, decomposing the non-unvoiced sound through linear prediction, and dividing the original voice into two parts of an original fundamental tone model and an original acoustic model;

step 3, the original fundamental tone is adjusted according to the actual demand;

step 4, comparing the adjusted fundamental tone with the fundamental tone information in the fundamental tone library of the specific person to find out the fundamental tone signal which meets the requirement most; the specific person pitch library mainly comes from the content for analyzing and extracting the voice of the specific person, and comprises pitch signals of common syllables and words corresponding to the specific person in the pronunciation process; comparing the adjusted original fundamental tone with all fundamental tone signals in a fundamental tone library of a specific person, and obtaining the most similar fundamental tone signal fragments through a correlation comparison method, a pattern matching method or a machine learning method;

step 5, reconstructing and optimizing the pitch information to obtain a corrected pitch signal;

step 6, the corrected fundamental tone and the sound channel model are subjected to voice synthesis to form a non-unvoiced sound signal after the sound change;

step 7, synthesizing the original unvoiced sound signal and the non-unvoiced sound signal to form an adjusted voice signal; and the DSP synthesizes the adjusted fundamental tone and sound channel model into a non-unvoiced sound signal and further overlaps the original unvoiced sound model to form a voice signal after the sound change.

2. The digital real-time voice conversion method according to claim 1, wherein: in the step 2, the voice is decomposed into two parts of a sound channel model and an original fundamental tone through linear prediction, wherein parameters of the sound channel model are reserved for later voice synthesis.

3. The digital real-time voice conversion method according to claim 2, wherein: the specific person pitch library is stored in the cloud system, and a special real-time retrieval system is utilized.

4. A digital real-time voice conversion method according to claim 3, wherein: the method is realized by adopting a DSP and ARM system.

5. The digital real-time voice conversion method according to claim 4, wherein: the DSP realizes the functions of initial and final segmentation and linear prediction and extracts the original fundamental tone of the non-unvoiced sound signal.

6. The digital real-time voice conversion method according to claim 5, wherein: in step 3, the step of adjusting the original pitch according to the actual requirement comprises changing the pitch frequency and/or changing the change speed of the pitch frequency.