CN111210834A - Voice exaggeration system - Google Patents

Voice exaggeration system Download PDF

Info

Publication number
CN111210834A
CN111210834A CN201811386157.7A CN201811386157A CN111210834A CN 111210834 A CN111210834 A CN 111210834A CN 201811386157 A CN201811386157 A CN 201811386157A CN 111210834 A CN111210834 A CN 111210834A
Authority
CN
China
Prior art keywords
exaggeration
voice
exaggerated
phoneme
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811386157.7A
Other languages
Chinese (zh)
Inventor
骆成品
钟建生
李坤
孙立发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Speechx Ltd
Original Assignee
Speechx Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Speechx Ltd filed Critical Speechx Ltd
Priority to CN201811386157.7A priority Critical patent/CN111210834A/en
Publication of CN111210834A publication Critical patent/CN111210834A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a voice exaggeration system which comprises a voice input module, a voice exaggeration module and a voice output module, wherein the voice input module is used for acquiring normal voice and exaggeration voice before a voice exaggeration model is established, acquiring required exaggeration voice after the voice exaggeration model is established, and respectively transmitting the voice to the voice exaggeration module; the voice exaggeration module is used for obtaining the exaggeration parameters corresponding to the voice with the pitch, the duration and the volume in the exaggeration mode compared with the normal voice by using the deep neural network, establishing a voice exaggeration model according to the exaggeration parameters, processing the voice with the required exaggeration degree marked by the voice exaggeration model, and outputting the voice after the exaggeration. The invention enhances the learner's perception of the correct pronunciation by exaggerating the correct pronunciation, assists language learning, and can be applied to the fields of speech synthesis (TTS) and the like.

Description

Voice exaggeration system
Technical Field
The invention relates to a voice exaggeration system.
Background
In the non-native language learning process, hearing and spoken language are the key points of learning, the non-native language can be correctly listened to and understood to really master the non-native language, the hearing is the basis of the spoken language, and the correct pronunciation can be spoken only by hearing the correct pronunciation. The existing language-assisted learning device can only repeatedly play correct voice and can not exaggerate correct pronunciation to enhance the learner's perception of correct pronunciation.
Disclosure of Invention
The invention provides a voice exaggeration system, which solves the problem that the perception of a learner on correct pronunciation cannot be enhanced by exaggerating the correct pronunciation in the prior art.
The technical scheme of the invention is realized as follows:
a voice exaggeration system comprises
The voice input module is used for acquiring normal voice and exaggerated voice before the voice exaggerated model is established, acquiring required exaggerated voice after the voice exaggerated model is established, and respectively transmitting the voice to the voice exaggerated module;
the voice exaggeration module is used for obtaining the exaggeration parameters corresponding to the voice with the pitch, the duration and the volume in the exaggeration mode compared with the normal voice by using the deep neural network, establishing a voice exaggeration model according to the exaggeration parameters, processing the voice with the required exaggeration degree marked by the voice exaggeration model, and outputting the voice after the exaggeration.
Preferably, the normal speech and the exaggerated speech comprise a plurality of phonemes, the exaggeration levels of the current exaggeration phoneme and the first three and the last three exaggeration phonemes in the exaggerated speech are extracted, an input feature vector is formed by the current exaggeration phoneme and the ID of the current exaggeration phoneme, the input feature vector is input into a deep neural network, the exaggeration parameters are trained, and the output feature vector is obtained.
Preferably, the current exaggerated phoneme is divided into five frames, and a pitch difference, a duration difference and a volume difference corresponding to the phoneme of the normal speech are respectively extracted from each frame to form an output feature vector.
Preferably, the output feature vector is a 1 × 15 matrix.
Preferably, the degree of exaggeration is formed by a 2-bit binary number, the ID of the exaggeration phoneme is formed by a 6-bit binary number, and the input feature vector is a 1 × 20 matrix.
Preferably, the degree of exaggeration is none, weak, and strong.
The invention has the beneficial effects that: the learner can enhance the perception of the correct pronunciation by exaggerating the correct pronunciation, and can assist the language learning, and can be applied to the fields of speech synthesis (TTS) and the like.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a functional block diagram of an embodiment of a speech exaggeration system according to the present invention.
In the figure, 1-voice input module; 2-voice exaggeration module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in FIG. 1, the present invention provides a speech exaggeration system, comprising
The voice input module 1 is used for acquiring normal voice and the exaggerated voice with different exaggeration degrees before the voice exaggeration model is established, acquiring the voice with the required exaggeration after the voice exaggeration model is established, and respectively transmitting the voice to the voice exaggeration module 2;
the voice exaggeration module 2 obtains the exaggeration parameters of the exaggeration voice with different exaggeration degrees corresponding to the normal voice in three aspects of pitch, duration and volume by using the deep neural network, establishes a voice exaggeration model according to the exaggeration parameters, processes the voice with the required exaggeration degree marked by the voice exaggeration model, and outputs the voice after exaggeration.
The invention mainly exaggerates the voice, pitch, duration and volume from three aspects; pitch mainly refers to the fundamental frequency of sound, e.g. female is higher in pitch and male is lower in pitch, unit: a semitone interval; duration mainly refers to the duration of sound, unit: s; volume reflects the size of sound aurally, unit: dB.
The normal voice and the exaggerated voice comprise a plurality of phonemes, the exaggeration levels of the current exaggeration phoneme and the first three and the last three exaggeration phonemes in the exaggerated voice are extracted, the current exaggeration phoneme and the ID of the current exaggeration phoneme form an input feature vector, the input feature vector is input into a deep neural network, the exaggeration parameters are trained, and the output feature vector is obtained.
And dividing the current exaggerated phoneme into five frames, and respectively extracting a pitch difference value, a duration difference value and a volume difference value corresponding to the phoneme of the normal voice of each frame to form an output feature vector. The output feature vector is a 1x 15 matrix.
The degree of exaggeration is made up of 2-bit binary numbers, the ID of the exaggeration phoneme is made up of 6-bit binary numbers, and the input feature vector is a 1 × 20 matrix. The exaggeration degree of the exaggeration voice is not exaggeration, is weakly exaggerated and is strongly exaggerated.
The exaggerated phonemes are divided into three levels of no exaggeration (0, 0), weak exaggeration (0, 1) and strong exaggeration (1, 0), and the current exaggeration phoneme and the three levels of the exaggeration phonemes before and after are added with id (6-bit binary number) of the exaggeration phoneme in the current exaggeration voice to form a feature vector of 1x 20.
And inputting the required exaggerated phoneme into a neural network to obtain the difference value of the three parameters of the required exaggerated phoneme and the normal phoneme. Assume the original value as pitch P1Length of sound D1Volume I1The exaggerated value is the pitch P2Length of sound D2Volume I2The respective units are semitone, s and dB in this order.
P2=P1+ΔP
D2=D1+ΔD
I2=I1+ΔI
△ P is the pitch difference, △ D is the duration difference, and △ I is the volume difference.
Taking the adjustment of the sound volume as an example, the divided five-frame phonemes are respectively processed. Assuming that the volume increase of five frames is 5dB, 4dB, 3dB, 4dB, 5dB, each frame of phonemes is from 0.1 time length to 0.9 time length and is in a constant state. For example, the first frame has 100ms, 10ms to 90ms, and the sound pressure p2Comprises the following steps:
p2=p1*105/20
sound pressure p2Harmony sound pressure p1In Pa. Sound pressure p1Is a standard pressure of 0.02 Pa.
Whereas from 0ms to 10ms and from 90ms to 110ms there is a linear transition. Take the transition from 0ms to 10ms as an example:
Figure BDA0001873001950000041
b=I2-k1*x=5-0.5*10=0
I=I1*10(k1*x+b)/20,0<x<10ms
when the amount of adjustment is volume, k1The speed of the volume I transition, i.e. the slope, is represented in dB/ms. b represents the volume at the beginning of the transition in dB
For the semitone, assuming that the difference between the semitones of the two pitches is Δ semitone, and the two pitches are F1 and F2 in Hz, respectively, there is a conversion formula:
Figure BDA0001873001950000042
the invention has the beneficial effects that: the learner can enhance the perception of the correct pronunciation by exaggerating the correct pronunciation, and can assist the language learning, and can be applied to the fields of speech synthesis (TTS) and the like.
The technical scheme discloses the improvement point of the invention, and technical contents which are not disclosed in detail can be realized by the prior art by a person skilled in the art.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. A speech exaggeration system, characterized by: comprises that
The voice input module is used for acquiring normal voice and exaggerated voice before the voice exaggerated model is established, acquiring required exaggerated voice after the voice exaggerated model is established, and respectively transmitting the voice to the voice exaggerated module;
the voice exaggeration module is used for obtaining the exaggeration parameters corresponding to the voice with the pitch, the duration and the volume in the exaggeration mode compared with the normal voice by using the deep neural network, establishing a voice exaggeration model according to the exaggeration parameters, processing the voice with the required exaggeration degree marked by the voice exaggeration model, and outputting the voice after the exaggeration.
2. The speech exaggeration system of claim 1, wherein: the normal voice and the exaggerated voice comprise a plurality of phonemes, the exaggeration levels of the current exaggeration phoneme and the first three and the last three exaggeration phonemes in the exaggerated voice are extracted, the current exaggeration phoneme and the ID of the current exaggeration phoneme form an input feature vector, the input feature vector is input into a deep neural network, the exaggeration parameters are trained, and the output feature vector is obtained.
3. The speech exaggeration system of claim 2, wherein: and dividing the current exaggerated phoneme into five frames, and respectively extracting a pitch difference value, a duration difference value and a volume difference value corresponding to the phoneme of the normal voice of each frame to form an output feature vector.
4. The speech exaggeration system of claim 3, wherein: the output feature vector is a 1x 15 matrix.
5. The speech exaggeration system of claim 2, wherein: the degree of exaggeration is made up of 2-bit binary numbers, the ID of the exaggeration phoneme is made up of 6-bit binary numbers, and the input feature vector is a 1 × 20 matrix.
6. The speech exaggeration system of claim 2 or 5, wherein: the degree of exaggeration is not exaggeration, weakly exaggeration and strongly exaggeration.
CN201811386157.7A 2018-11-20 2018-11-20 Voice exaggeration system Pending CN111210834A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811386157.7A CN111210834A (en) 2018-11-20 2018-11-20 Voice exaggeration system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811386157.7A CN111210834A (en) 2018-11-20 2018-11-20 Voice exaggeration system

Publications (1)

Publication Number Publication Date
CN111210834A true CN111210834A (en) 2020-05-29

Family

ID=70786365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811386157.7A Pending CN111210834A (en) 2018-11-20 2018-11-20 Voice exaggeration system

Country Status (1)

Country Link
CN (1) CN111210834A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1118493A (en) * 1994-08-01 1996-03-13 中国科学院声学研究所 Language and speech converting system with synchronous fundamental tone waves
CN102664017A (en) * 2012-04-25 2012-09-12 武汉大学 Three-dimensional (3D) audio quality objective evaluation method
CN106203626A (en) * 2016-06-30 2016-12-07 北京奇虎科技有限公司 Car steering behavioral value method and device, automobile
CN107682561A (en) * 2017-11-10 2018-02-09 广东欧珀移动通信有限公司 volume adjusting method, device, terminal and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1118493A (en) * 1994-08-01 1996-03-13 中国科学院声学研究所 Language and speech converting system with synchronous fundamental tone waves
CN102664017A (en) * 2012-04-25 2012-09-12 武汉大学 Three-dimensional (3D) audio quality objective evaluation method
CN106203626A (en) * 2016-06-30 2016-12-07 北京奇虎科技有限公司 Car steering behavioral value method and device, automobile
CN107682561A (en) * 2017-11-10 2018-02-09 广东欧珀移动通信有限公司 volume adjusting method, device, terminal and storage medium

Similar Documents

Publication Publication Date Title
Tran et al. Improvement to a NAM-captured whisper-to-speech system
CN108597496A (en) Voice generation method and device based on generation type countermeasure network
Pettinato et al. Vowel space area in later childhood and adolescence: Effects of age, sex and ease of communication
CN108831436A (en) A method of text speech synthesis after simulation speaker's mood optimization translation
CN108831463B (en) Lip language synthesis method and device, electronic equipment and storage medium
EP1280137B1 (en) Method for speaker identification
CN109493846B (en) English accent recognition system
CN105765654A (en) Hearing assistance device with fundamental frequency modification
Sparks et al. Investigating the MESA (multipoint electrotactile speech aid): The transmission of connected discourse
TW202036535A (en) System and method for improving speech comprehension of abnormal articulation capable of ensuring that training corpuses are completely synchronized with source corpuses to save labor and time costs
US20160210982A1 (en) Method and Apparatus to Enhance Speech Understanding
CN102176313A (en) Formant-frequency-based Mandarin single final vioce visualizing method
Arunachalam A strategic approach to recognize the speech of the children with hearing impairment: different sets of features and models
CN111210834A (en) Voice exaggeration system
CN104240699A (en) Simple and effective phrase speech recognition method
Shahrul Malay word pronunciation application for pre-school children using vowel recognition
Li et al. An unsupervised two-talker speech separation system based on CASA
Koster Acoustic-phonetic characteristics of hyperarticulated speech for different speaking styles
CN109346058B (en) Voice acoustic feature expansion system
Tseng Speech Production of Mandarin-speaking Children with Hearing Impairment and Normal Hearing.
Liu et al. Intelligibility of American English vowels of native and non-native speakers in quiet and speech-shaped noise
Pickett Sound patterns of speech: An introductory sketch
Sahoo et al. Word extraction from speech recognition using correlation coefficients
CN117711374B (en) Audio-visual consistent personalized voice synthesis system, synthesis method and training method
Pavithran et al. An ASR model for Individuals with Hearing Impairment using Hidden Markov Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200529

WD01 Invention patent application deemed withdrawn after publication