CN111210834A

CN111210834A - Voice exaggeration system

Info

Publication number: CN111210834A
Application number: CN201811386157.7A
Authority: CN
Inventors: 骆成品; 钟建生; 李坤; 孙立发
Original assignee: Speechx Ltd
Current assignee: Speechx Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2020-05-29

Abstract

The invention discloses a voice exaggeration system which comprises a voice input module, a voice exaggeration module and a voice output module, wherein the voice input module is used for acquiring normal voice and exaggeration voice before a voice exaggeration model is established, acquiring required exaggeration voice after the voice exaggeration model is established, and respectively transmitting the voice to the voice exaggeration module; the voice exaggeration module is used for obtaining the exaggeration parameters corresponding to the voice with the pitch, the duration and the volume in the exaggeration mode compared with the normal voice by using the deep neural network, establishing a voice exaggeration model according to the exaggeration parameters, processing the voice with the required exaggeration degree marked by the voice exaggeration model, and outputting the voice after the exaggeration. The invention enhances the learner's perception of the correct pronunciation by exaggerating the correct pronunciation, assists language learning, and can be applied to the fields of speech synthesis (TTS) and the like.

Description

Voice exaggeration system

Technical Field

The invention relates to a voice exaggeration system.

Background

In the non-native language learning process, hearing and spoken language are the key points of learning, the non-native language can be correctly listened to and understood to really master the non-native language, the hearing is the basis of the spoken language, and the correct pronunciation can be spoken only by hearing the correct pronunciation. The existing language-assisted learning device can only repeatedly play correct voice and can not exaggerate correct pronunciation to enhance the learner's perception of correct pronunciation.

Disclosure of Invention

The invention provides a voice exaggeration system, which solves the problem that the perception of a learner on correct pronunciation cannot be enhanced by exaggerating the correct pronunciation in the prior art.

The technical scheme of the invention is realized as follows:

a voice exaggeration system comprises

The voice input module is used for acquiring normal voice and exaggerated voice before the voice exaggerated model is established, acquiring required exaggerated voice after the voice exaggerated model is established, and respectively transmitting the voice to the voice exaggerated module;

the voice exaggeration module is used for obtaining the exaggeration parameters corresponding to the voice with the pitch, the duration and the volume in the exaggeration mode compared with the normal voice by using the deep neural network, establishing a voice exaggeration model according to the exaggeration parameters, processing the voice with the required exaggeration degree marked by the voice exaggeration model, and outputting the voice after the exaggeration.

Preferably, the normal speech and the exaggerated speech comprise a plurality of phonemes, the exaggeration levels of the current exaggeration phoneme and the first three and the last three exaggeration phonemes in the exaggerated speech are extracted, an input feature vector is formed by the current exaggeration phoneme and the ID of the current exaggeration phoneme, the input feature vector is input into a deep neural network, the exaggeration parameters are trained, and the output feature vector is obtained.

Preferably, the current exaggerated phoneme is divided into five frames, and a pitch difference, a duration difference and a volume difference corresponding to the phoneme of the normal speech are respectively extracted from each frame to form an output feature vector.

Preferably, the output feature vector is a 1 × 15 matrix.

Preferably, the degree of exaggeration is formed by a 2-bit binary number, the ID of the exaggeration phoneme is formed by a 6-bit binary number, and the input feature vector is a 1 × 20 matrix.

Preferably, the degree of exaggeration is none, weak, and strong.

The invention has the beneficial effects that: the learner can enhance the perception of the correct pronunciation by exaggerating the correct pronunciation, and can assist the language learning, and can be applied to the fields of speech synthesis (TTS) and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a functional block diagram of an embodiment of a speech exaggeration system according to the present invention.

In the figure, 1-voice input module; 2-voice exaggeration module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in FIG. 1, the present invention provides a speech exaggeration system, comprising

The voice input module 1 is used for acquiring normal voice and the exaggerated voice with different exaggeration degrees before the voice exaggeration model is established, acquiring the voice with the required exaggeration after the voice exaggeration model is established, and respectively transmitting the voice to the voice exaggeration module 2;

the voice exaggeration module 2 obtains the exaggeration parameters of the exaggeration voice with different exaggeration degrees corresponding to the normal voice in three aspects of pitch, duration and volume by using the deep neural network, establishes a voice exaggeration model according to the exaggeration parameters, processes the voice with the required exaggeration degree marked by the voice exaggeration model, and outputs the voice after exaggeration.

The invention mainly exaggerates the voice, pitch, duration and volume from three aspects; pitch mainly refers to the fundamental frequency of sound, e.g. female is higher in pitch and male is lower in pitch, unit: a semitone interval; duration mainly refers to the duration of sound, unit: s; volume reflects the size of sound aurally, unit: dB.

The normal voice and the exaggerated voice comprise a plurality of phonemes, the exaggeration levels of the current exaggeration phoneme and the first three and the last three exaggeration phonemes in the exaggerated voice are extracted, the current exaggeration phoneme and the ID of the current exaggeration phoneme form an input feature vector, the input feature vector is input into a deep neural network, the exaggeration parameters are trained, and the output feature vector is obtained.

And dividing the current exaggerated phoneme into five frames, and respectively extracting a pitch difference value, a duration difference value and a volume difference value corresponding to the phoneme of the normal voice of each frame to form an output feature vector. The output feature vector is a 1x 15 matrix.

The degree of exaggeration is made up of 2-bit binary numbers, the ID of the exaggeration phoneme is made up of 6-bit binary numbers, and the input feature vector is a 1 × 20 matrix. The exaggeration degree of the exaggeration voice is not exaggeration, is weakly exaggerated and is strongly exaggerated.

The exaggerated phonemes are divided into three levels of no exaggeration (0, 0), weak exaggeration (0, 1) and strong exaggeration (1, 0), and the current exaggeration phoneme and the three levels of the exaggeration phonemes before and after are added with id (6-bit binary number) of the exaggeration phoneme in the current exaggeration voice to form a feature vector of 1x 20.

And inputting the required exaggerated phoneme into a neural network to obtain the difference value of the three parameters of the required exaggerated phoneme and the normal phoneme. Assume the original value as pitch P₁Length of sound D₁Volume I₁The exaggerated value is the pitch P₂Length of sound D₂Volume I₂The respective units are semitone, s and dB in this order.

P₂＝P₁+ΔP

D₂＝D₁+ΔD

I₂＝I₁+ΔI

△ P is the pitch difference, △ D is the duration difference, and △ I is the volume difference.

Taking the adjustment of the sound volume as an example, the divided five-frame phonemes are respectively processed. Assuming that the volume increase of five frames is 5dB, 4dB, 3dB, 4dB, 5dB, each frame of phonemes is from 0.1 time length to 0.9 time length and is in a constant state. For example, the first frame has 100ms, 10ms to 90ms, and the sound pressure p₂Comprises the following steps:

p₂＝p₁*10^5/20

sound pressure p₂Harmony sound pressure p₁In Pa. Sound pressure p₁Is a standard pressure of 0.02 Pa.

Whereas from 0ms to 10ms and from 90ms to 110ms there is a linear transition. Take the transition from 0ms to 10ms as an example:

b＝I₂-k₁*x＝5-0.5*10＝0

I＝I₁*10^(k1*x+b)/20，0＜x＜10ms

when the amount of adjustment is volume, k₁The speed of the volume I transition, i.e. the slope, is represented in dB/ms. b represents the volume at the beginning of the transition in dB

For the semitone, assuming that the difference between the semitones of the two pitches is Δ semitone, and the two pitches are F1 and F2 in Hz, respectively, there is a conversion formula:

The technical scheme discloses the improvement point of the invention, and technical contents which are not disclosed in detail can be realized by the prior art by a person skilled in the art.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A speech exaggeration system, characterized by: comprises that

2. The speech exaggeration system of claim 1, wherein: the normal voice and the exaggerated voice comprise a plurality of phonemes, the exaggeration levels of the current exaggeration phoneme and the first three and the last three exaggeration phonemes in the exaggerated voice are extracted, the current exaggeration phoneme and the ID of the current exaggeration phoneme form an input feature vector, the input feature vector is input into a deep neural network, the exaggeration parameters are trained, and the output feature vector is obtained.

3. The speech exaggeration system of claim 2, wherein: and dividing the current exaggerated phoneme into five frames, and respectively extracting a pitch difference value, a duration difference value and a volume difference value corresponding to the phoneme of the normal voice of each frame to form an output feature vector.

4. The speech exaggeration system of claim 3, wherein: the output feature vector is a 1x 15 matrix.

5. The speech exaggeration system of claim 2, wherein: the degree of exaggeration is made up of 2-bit binary numbers, the ID of the exaggeration phoneme is made up of 6-bit binary numbers, and the input feature vector is a 1 × 20 matrix.

6. The speech exaggeration system of claim 2 or 5, wherein: the degree of exaggeration is not exaggeration, weakly exaggeration and strongly exaggeration.