CN108108357A - Accent conversion method and device, electronic equipment - Google Patents

Accent conversion method and device, electronic equipment Download PDF

Info

Publication number
CN108108357A
CN108108357A CN201810029495.9A CN201810029495A CN108108357A CN 108108357 A CN108108357 A CN 108108357A CN 201810029495 A CN201810029495 A CN 201810029495A CN 108108357 A CN108108357 A CN 108108357A
Authority
CN
China
Prior art keywords
source
feature vector
voice data
accent
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810029495.9A
Other languages
Chinese (zh)
Other versions
CN108108357B (en
Inventor
王雪云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN201810029495.9A priority Critical patent/CN108108357B/en
Publication of CN108108357A publication Critical patent/CN108108357A/en
Application granted granted Critical
Publication of CN108108357B publication Critical patent/CN108108357B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to accent conversion method and device, electronic equipments.The described method includes:Obtain the source voice data with the first accent;Obtain the corresponding source speech feature vector of the source voice data;Feature Conversion model is called, the source speech feature vector is converted into target voice feature vector;Based on target speech data of the target voice feature vector synthesis with the second accent.According to an embodiment of the invention, accent conversion is carried out by the source voice data to both sides, makes both sides that there is same or similar accent, reduce the communication disorders because of caused by accent difference, both sides' communication efficiency can be promoted.In addition, characteristic voice when the present embodiment can keep the both sides to speak, makes both sides experience the content and emotion of other side, further promotes communication efficiency.

Description

Accent conversion method and device, electronic equipment
Technical field
The present invention relates to voice processing technology field more particularly to a kind of accent conversion method and device, electronic equipments.
Background technology
With the development of economy, user can exchange with the user of country variant or area in work either life.It is double Side can be linked up using same language, and respective accent can unconsciously be embedded into the language by both sides, be influenced To the understanding of other side.By taking English as an example, including British English, Americanese, Australia's formula English, Chinglish, print formula English etc..Separately Outside, in telephonic communication scene, due to lacking the help of the supplementary modes such as expression, action, it is more obvious to influence result.
The content of the invention
The present invention provides a kind of accent conversion method and device, electronic equipment, to solve the deficiency in correlation technique.
It is according to embodiments of the present invention in a first aspect, provide a kind of accent conversion method, the described method includes:
Obtain the source voice data with the first accent;
Obtain the corresponding source speech feature vector of the source voice data;
Feature Conversion model is called, the source speech feature vector is converted into target voice feature vector;
Based on target speech data of the target voice feature vector synthesis with the second accent.
Optionally, the source speech feature vector includes following at least one:The fundamental frequency feature of the source voice data to Amount, word speed feature vector, the feature parameter vectors and spectrum signature vector.
Optionally, the source speech feature vector includes the fundamental frequency feature vector of the source voice data, the acquisition institute Stating the corresponding source speech feature vector of source voice data includes:
The fundamental frequency feature vector of the source voice data is obtained using autocorrelation method.
Optionally, the source speech feature vector includes the word speed feature vector of the source voice data, the acquisition institute Stating the corresponding source speech feature vector of source voice data includes:
Utilize the border for visualizing syllable in the voice tool acquisition source voice data;
The duration of the source voice data and quantity comprising word are determined according to the border;
The word speed feature vector of the source voice data is obtained according to the duration and the quantity.
Optionally, the source speech feature vector includes the feature parameter vectors and spectrum signature is vectorial, described in the acquisition Voice data corresponding source speech feature vector in source includes:
The source voice data is encoded using linear predictive coding LPC, obtains the coefficient of the LPC;
The cepstrum of the coefficient is obtained, obtains linear prediction residue error;The energy as the source voice data Feature vector and spectrum signature vector.
Optionally, the source speech feature vector includes the feature parameter vectors and spectrum signature is vectorial, described in the acquisition Voice data corresponding source speech feature vector in source includes:
Based on the frequency of the source voice data, the corresponding energy envelope of the source voice data is obtained according to Meier formula Spectrum;
The energy envelope is composed into input Meier wave filter group, obtains mel-frequency scale;
Logarithmic transformation and discrete cosine transform are carried out to the mel-frequency scale, obtain mel-frequency cepstrum coefficient MFCC;The feature parameter vectors and spectrum signature vectors of the MFCC as the source voice data.
Optionally, the method further includes:
Obtain the sample voice data pair of setting quantity;The sample voice data to include being respectively adopted the first accent and The voice data obtained after same content is read aloud in second accent;
Initial Feature Conversion model is trained using the sample voice data, until stopping instruction when meeting end condition Practice, the Feature Conversion model after being trained;
Wherein, the end condition includes:The speech feature vector and described second of initial Feature Conversion model output Penalty values between the corresponding speech feature vector of voice data of accent, less than or equal to penalty values threshold value.
Second aspect according to embodiments of the present invention, provides a kind of accent conversion equipment, and described device includes:
Source voice data acquisition module, for obtaining the source voice data with the first accent;
Source feature vector acquisition module, for obtaining the corresponding source speech feature vector of the source voice data;
For calling Feature Conversion model, the source speech feature vector is converted into for target feature vector acquisition module Target voice feature vector;
Target voice synthesis module, for based on target language of the target voice feature vector synthesis with the second accent Sound data.
Optionally, the source speech feature vector that the source feature vector acquisition module obtains includes following at least one:Institute State fundamental frequency feature vector, word speed feature vector, the feature parameter vectors and the spectrum signature vector of source voice data.
Optionally, the source speech feature vector that the source feature vector acquisition module obtains includes the source voice data Fundamental frequency feature vector, described device include:
The source feature vector acquisition module is used to obtain the fundamental frequency feature of the source voice data using autocorrelation method Vector.
Optionally, the source speech feature vector that the source feature vector acquisition module obtains includes the source voice data Word speed feature vector, the source feature vector acquisition module include:
Syllable boundaries acquiring unit, for utilizing the side for visualizing syllable in the voice tool acquisition source voice data Boundary;
When long word number acquiring unit, for determining the duration of the source voice data according to the border and comprising word Quantity;
Word speed feature vector acquiring unit, for obtaining the language of the source voice data according to the duration and the quantity Fast feature vector.
Optionally, the source speech feature vector that the source feature vector acquisition module obtains includes the feature parameter vectors and frequency Spectrum signature vector, the source feature vector acquisition module include:
LPC coefficient acquiring unit for being encoded using linear predictive coding LPC to the source voice data, is obtained The coefficient of the LPC;
Feature vector acquiring unit for obtaining the cepstrum of the coefficient, obtains linear prediction residue error;The conduct The feature parameter vectors and the spectrum signature vector of the source voice data.
Optionally, the source speech feature vector that the source feature vector acquisition module obtains includes the feature parameter vectors and frequency Spectrum signature vector, the source feature vector acquisition module include:
For the frequency based on the source voice data, the source language is obtained according to Meier formula for envelope spectrum acquiring unit The corresponding energy envelope spectrum of sound data;
Scale acquiring unit for the energy envelope to be composed input Meier wave filter group, obtains mel-frequency scale;
MFCC acquiring units for carrying out logarithmic transformation and discrete cosine transform to the mel-frequency scale, obtain plum You are frequency cepstral coefficient MFCC;The feature parameter vectors and spectrum signature vectors of the MFCC as the source voice data.
Optionally, described device further includes:
Sample data is to acquisition module, for obtaining the sample voice data pair of setting quantity;The sample voice data To including the first accent is respectively adopted and the voice data obtained after same content is read aloud in the second accent;
Transformation model training module, for training initial Feature Conversion model using the sample voice data, until Deconditioning when meeting end condition, the Feature Conversion model after being trained;
Wherein, the end condition includes:The speech feature vector and described second of initial Feature Conversion model output Penalty values between the corresponding speech feature vector of voice data of accent, less than or equal to penalty values threshold value.
The third aspect according to embodiments of the present invention, provides a kind of electronic equipment, and the electronic equipment includes:
Receiver;
Loud speaker;
Processor;
For storing the memory of the processor-executable instruction;
Wherein, it is square described in first aspect to realize to be configured as performing executable instruction in the memory for the processor The step of method.
According to above-described embodiment, by obtain with the first voice the corresponding source phonetic feature of source idea data to Then amount calls Feature Conversion model, it is special to be converted into the target voice to match with the second accent according to source speech feature vector Sign vector, finally according to target speech data of the target voice feature vector synthesis with the second accent.As it can be seen that the present embodiment In, accent conversion is carried out by the source voice data to both sides, makes both sides that there is same or similar accent, reduces because of accent Communication disorders caused by difference can promote both sides' communication efficiency.In addition, language when the present embodiment can keep the both sides to speak Sound feature makes both sides experience the content and emotion of other side, further promotes communication efficiency.
It should be appreciated that above general description and following detailed description are only exemplary and explanatory, not It can the limitation present invention.
Description of the drawings
Attached drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the present invention Example, and the principle for explaining the present invention together with specification.
Fig. 1 is a kind of application scenarios schematic diagram of the accent conversion method shown according to embodiments of the present invention;
Fig. 2 is a kind of flow diagram of the accent conversion method shown according to embodiments of the present invention;
Fig. 3 is the flow diagram of the acquisition word speed feature vector shown according to embodiments of the present invention;
Fig. 4 is the acquisition LPC shown according to embodiments of the present invention as the feature parameter vectors and the stream of spectrum signature vector Journey schematic diagram;
Fig. 5 is the acquisition MFCC shown according to embodiments of the present invention as the feature parameter vectors and the stream of spectrum signature vector Journey schematic diagram;
Fig. 6 is the flow diagram of the training characteristics transformation model shown according to embodiments of the present invention;
Fig. 7~Figure 11 is a kind of block diagram of the accent conversion equipment shown according to embodiments of the present invention;
Figure 12 is the structure diagram of a kind of electronic equipment shown according to embodiments of the present invention.
Specific embodiment
Here exemplary embodiment will be illustrated in detail, example is illustrated in the accompanying drawings.Following description is related to During attached drawing, unless otherwise indicated, the same numbers in different attached drawings represent the same or similar element.Following exemplary embodiment Described in embodiment do not represent and the consistent all embodiments of the present invention.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects being described in detail in claims, of the invention.
With the development of economy, user can exchange with the user of country variant or area in work either life.It is double Side can be linked up using same language, and respective accent can unconsciously be embedded into the language by both sides, be influenced To the understanding of other side.By taking English as an example, including British English, Americanese, Australia's formula English, Chinglish, print formula English etc..Separately Outside, in telephonic communication scene, due to lacking the help of the supplementary modes such as expression, action, it is more obvious to influence result.To solve The above problem, an embodiment of the present invention provides a kind of accent conversion method, Fig. 1 is a kind of mouth shown according to embodiments of the present invention The application scenarios schematic diagram of sound conversion method.Referring to Fig. 1, an Australian and a Chinese in telephonic communication, due to Australian holds the first accent, and described English can be referred to as " Australia's formula English ", and Chinese hold the second accent, described in English can be referred to as " Chinglish ".If Australian directly hears that " Chinglish " and Chinese directly hear " Australia's formula English ", it is gently then uncomfortable, reduce communication efficiency, the heavy then meaning of error understanding other side.In the present embodiment, " Chinglish " or Person's " Australia's formula English " is transformed into the English of other side's accent, in this way, Australian can hear " Australia's formula English ", Chinese can be with It hears " Chinglish ", due to only changing accent without changing content, can so carry so as to make the communication of both sides rapidly and efficiently Rise the usage experience of other side.
Fig. 2 is a kind of flow diagram of the accent conversion method shown according to embodiments of the present invention.Mouth in the present embodiment Sound conversion method can be applied to the equipment such as terminal, server.For convenience of description, with accent conversion method application in the present embodiment It is illustrated exemplified by the communication server, in a scene of voice communication, the user with the first accent (is referred to as the afterwards One user) for the masters of voice communication, the user (being referred to as second user afterwards) with the second accent is the quilt of voice communication Dynamic side.Referring to Fig. 2, which includes step 201~step 204:
201, obtain the source voice data with the first accent.
In the present embodiment, voice capture device collects the source voice data of the first user, and then voice capture device will Source voice data is sent to the communication server.The communication server can obtain the source voice data with the first accent.
In voice call process, there may be some noises, such as voice, echo, wind for first user's local environment Sound etc. influences the specific implementation of accent conversion method.For this purpose, voice capture device is to the initial language that collects in the present embodiment Sound data (being different from subsequent source voice data) are pre-processed to obtain source voice data.Certainly, the pre- place of source voice data Reason can also be realized by the communication server, be not limited thereto.
Wherein pretreatment includes at least one following processing:Filter noise processed, removal echo processing, enhancing useful signal Processing, signal polishing and cutting process.It is mended for filtering noise processed, removal echo processing, the processing of enhancing useful signal, signal The scheme that may be employed together with cutting process etc. in correlation technique is realized, is not limited thereto.
202, obtain the corresponding source speech feature vector of the source voice data.
In the present embodiment, then the phonetic feature of communication server extraction source voice data utilizes these phonetic feature shapes Into the corresponding source speech feature vector of source voice data.It will be appreciated that above-mentioned phonetic feature is related to the first accent, it can be with The features such as tone, word speed, loudness, tone color and the pronunciation of one accent.For example, the fundamental frequency (fundamental frequency) of tone and source voice data Correlation, word speed is related to the duration of source voice data, and loudness is related to the energy of source voice data, tone color and source voice data Each harmonic is related, and pronunciation is related to the frequency spectrum of source voice data.
Based on above-mentioned analysis, above-mentioned source speech feature vector can include following at least one:The source voice data Fundamental frequency feature vector, word speed feature vector, the feature parameter vectors and spectrum signature vector.It will be appreciated that the source voice number According to fundamental frequency feature vector, word speed feature vector, the feature parameter vectors and spectrum signature vector can include one or more, It can be adjusted according to concrete scene.In one embodiment, source speech feature vector includes aforementioned four feature vector.
Source speech feature vector can include fundamental frequency feature vector in one embodiment, this is because the tone of men and women user There are notable difference, and determine tone height is fundamental frequency part in the voice data of source, each harmonic shadow of source voice data Resonant color.The step of obtaining fundamental frequency feature vector can include:
The present embodiment calls the band-pass filtering method of a 60-500Hz, and source voice data is filtered out using band-pass filtering method In higher harmonic components, can detect to be distributed in the fundamental frequency of 100-200Hz in the voice data of source.
Then, source voice data is divided into several speech frames.Under normal circumstances, source voice data belongs to short-term stationarity Signal, characteristics of speech sounds is substantially constant in 10-30ms or variation is slow, therefore can therefrom intercept one section of progress Spectrum analysis.The length of speech frame is 30ms in one embodiment.
Afterwards, Voicing decision is carried out to the speech frame split, at the time of determining that voiced sound becomes voiceless sound, extracts voiced segments. For example, being analyzed using Blackman window the short-time energy of speech frame, the energy in two windows is got.Voiced sound has fundamental frequency week Phase, voiceless sound do not have fundamental frequency cycles, i.e., the short-time average energy of voiced segments is far longer than the short-time average energy of voiceless sound section in speech frame Amount, i.e., by the calculating of short-time energy can distinguish voiceless sound section and it is turbid between section.
Finally, the auto-correlation function of voiced sound segment signal is calculated, according to the periodical calculating source voice data of auto-correlation function Fundamental frequency cycles.Auto-correlation function reflects the phase of signal sequence x (n) and its signal sequence x (n+m) after one section of delay Like degree, n represents the sequence number of discrete signal in sequence, and m represents time delay.If signal sequence x (n) has cycle Np, Its auto-correlation function is almost-periodic function, and cyclically-varying is identical with x (n) sequences.Voiced signal has quasi periodic, voiced sound letter Number auto-correlation function there is peak value on the integral multiple position of fundamental frequency cycles, and the auto-correlation function of Unvoiced signal is not apparent Peak value, for Voiced signal, as long as detecting the position of N, it is possible to estimate the fundamental frequency cycles value of source voice data.Certainly, The fundamental frequency feature of source voice data can also be realized using other methods, can equally realize the scheme of the application, herein not It repeats again.
It, can be by fundamental frequency Feature Conversion into specified shape after the fundamental frequency feature of source voice data is got in the present embodiment The fundamental frequency feature vector of formula.
In another embodiment, source speech feature vector can include the word speed feature vector of source voice data.Referring to figure 3, obtain word speed feature vector the step of can include:
301, utilize the border for visualizing syllable in voice Visual Speech instruments acquisition source voice data.It is such as clear The separation moment of sound and voiced sound.302, the duration of each speech frame in the voice data of source and the number comprising word are determined according to border Amount.303, the word speed of speech frame is obtained according to duration and quantity, and source voice data is determined according to each speech frame and word speed Word speed feature vector.
In practical application, Visual Speech instruments are equipped with visualized graph interface, user manual correction side in interface Boundary, so as to improve the interactivity of user.
In another embodiment, source speech feature vector includes the feature parameter vectors and spectrum signature vector.The present embodiment In, the source language is obtained using the method for linear prediction residue error LPC (Linear Predictive Coding, LPC) The corresponding source speech feature vector of sound data, referring to Fig. 4, including:
401, source voice data is encoded using linear predictive coding LPC, passes through the sampled value and line of actual speech Reach mean square deviation minimum LMS between property prediction samples value to get to the coefficient of LPC
402, obtain coefficientCepstrum, obtain linear prediction residue error Cn;Coefficient CnEnergy as source voice data Measure feature vector sum spectrum signature vector.Wherein, coefficientWith coefficient CnRelation be:
K is sequence number variable, and n is the exponent number of LPCC coefficients, and P is the divisor of Power estimation model.
In practical application, there are correlation method, covariance method, Site substitution etc. to the computational methods of LPC, for details, reference can be made to correlation Technology, this will not be detailed here.
In another embodiment, source speech feature vector includes the feature parameter vectors and spectrum signature vector.The present embodiment In, the corresponding source speech feature vector of the source voice data is obtained using Meier (Mel) cepstrum coefficient MFCC methods, referring to figure 5, including:
501, based on the frequency of the source voice data, the corresponding energy of the source voice data is obtained according to Meier formula Envelope spectrum;Wherein, following formula expression may be employed in Meier formula:
Mel (f)=2595*lg (1+f/700);
In formula, f is the frequency of source voice data.
502, the energy envelope is composed into input Meier wave filter group, obtains mel-frequency scale M (k);
503, logarithmic transformation and discrete cosine transform are carried out to the mel-frequency scale, obtain mel-frequency cepstrum coefficient MFCC;The feature parameter vectors and spectrum signature vectors of the MFCC as the source voice data.The calculation formula of MFCC is such as Under:
In formula, M is the number of Meier wave filter group median filter, and L is exponent number.The MFCC of standard is 13 dimensional vectors, including Energy peacekeeping frequency spectrum is tieed up, and is reacted the static characteristic of phonetic feature, that is, is selected preceding 13 coefficients of MFCC as final result.
203, Feature Conversion model is called, the source speech feature vector is converted into target voice feature vector.
In the present embodiment, the communication server calls trained Feature Conversion model in advance, and source speech feature vector is defeated Enter into Feature Conversion model, target voice feature vector can be converted thereof by this feature transformation model.
It will be appreciated that features described above transformation model can include following at least one:Convolutional neural networks, Xun Huan nerve Network, shot and long term Memory Neural Networks, non-linear transformation method.In one embodiment of the invention, Feature Conversion model includes convolution Neutral net and ReLU nonlinear transformations.
Features described above modular converter can be trained in advance before the use, and training process can include:
Obtain the sample voice data pair of setting quantity;The sample voice data to include being respectively adopted the first accent and The voice data obtained after same content is read aloud in second accent;
Initial Feature Conversion model is trained using sample voice data, until deconditioning when meeting end condition, obtains Feature Conversion model after to training;
Wherein, the end condition includes:The speech feature vector and described second of initial Feature Conversion model output Penalty values between the corresponding speech feature vector of voice data of accent, less than or equal to penalty values threshold value.
In one embodiment, shown in Fig. 1 under scene, sample content is selected, keeps same by Australian and Chinese Sample emotion (such as happy, sad, impassioned) reads aloud each sentence in sample content, and each sentence corresponds to " Australia's formula English " form Voice data and " Chinglish " form voice data, so as to form a sample voice data pair.It repeats the above steps, Obtain several sample voice data pair.Data are set in the present embodiment and are at least 10000.Referring to Fig. 6, by each sample language The voice data of the centering of sound data " Australia's formula English " form makees the voice data of " Chinglish " form as source voice data For target speech data.Then, the source speech feature vector of the source voice data is obtained.Afterwards, it is source speech feature vector is defeated Enter, to initial Feature Conversion model, to obtain the actual speech feature vector of Feature Conversion model output.It finally, will be actual The input quantity of speech feature vector and target voice feature vector as loss function obtains the penalty values of the loss function.If The penalty values are more than penalty values threshold value, then initial Feature Conversion model are continued to train, until penalty values are less than or equal to Until penalty values threshold value, the Feature Conversion model after being trained.
204, based on target speech data of the target voice feature vector synthesis with the second accent.
In the present embodiment, target voice feature vector is converted into having second using phoneme synthesizing method by the communication server The target speech data of accent.Wherein, phoneme synthesizing method can include waveform resultant theory, Parameter synthesis or ruled synthesis method Deng.Then the target speech data is output to second user used terminal, second user can hear target voice number at this time According to.
As it can be seen that in the present embodiment, accent conversion is carried out by the source voice data to both sides, make both sides have it is identical or Similar accent reduces the communication disorders because of caused by accent difference, can promote both sides' communication efficiency.In addition, the present embodiment Characteristic voice when can keep the both sides to speak, makes both sides experience the content and emotion of other side, further promotes communication efficiency.
The embodiment of the present invention also proposed a kind of accent conversion equipment, and Fig. 7 is one shown according to embodiments of the present invention The block diagram of kind accent conversion equipment.Referring to Fig. 7, accent conversion equipment 700 includes:
Source voice data acquisition module 701, for obtaining the source voice data with the first accent;
Source feature vector acquisition module 702, for obtaining the corresponding source speech feature vector of the source voice data;
For calling Feature Conversion model, the source speech feature vector is converted for target feature vector acquisition module 703 Into target voice feature vector;
Target voice synthesis module 704, for based on mesh of the target voice feature vector synthesis with the second accent Mark voice data.
In the present embodiment, source voice data acquisition module 701 obtains the source voice data with the first accent.It is intelligible It is that the source voice data acquisition module 701 can pre-process the source voice data of acquisition, obtain needing the source language of form Sound data.The source voice data is sent to source feature vector acquisition module 702 by source voice data acquisition module 701.Source feature Vectorial acquisition module 702 obtains its source speech feature vector based on source voice data, and exports and obtain mould to target feature vector Block 703.Target feature vector acquisition module 703 calls Feature Conversion model, and source speech feature vector is converted into target voice Feature vector.Target voice synthesis module 704 is based on target voice number of the target voice feature vector synthesis with the second accent According to, and it is output to designated equipment.
In the present embodiment, accent conversion is carried out by the source voice data to both sides, there are both sides same or similar Accent, reduce the communication disorders because of caused by accent difference, both sides' communication efficiency can be promoted.In addition, the present embodiment can be with Characteristic voice when holding both sides speak, makes both sides experience the content and emotion of other side, further promotes communication efficiency.
In one embodiment, the source speech feature vector that source feature vector acquisition module 702 obtains includes following at least one It is a:Fundamental frequency feature vector, word speed feature vector, the feature parameter vectors and the spectrum signature vector of the source voice data.
In one embodiment, the source speech feature vector that the source feature vector acquisition module 702 obtains includes the source The fundamental frequency feature vector of voice data, described device include:
The fundamental frequency that the source feature vector acquisition module 702 is used to obtain the source voice data using autocorrelation method is special Sign vector.
In one embodiment, the source speech feature vector that the source feature vector acquisition module 702 obtains includes the source The word speed feature vector of voice data, referring to Fig. 8, the source feature vector acquisition module 702 includes:
Syllable boundaries acquiring unit 801 visualizes syllable in the voice tool acquisition source voice data for utilizing Border;
When long word number acquiring unit 802, for determining the duration of the source voice data according to the border and comprising word The quantity of language;
Word speed feature vector acquiring unit 803, for obtaining the source voice data according to the duration and the quantity Word speed feature vector.
In one embodiment, it is special that the source speech feature vector that the source feature vector acquisition module 702 obtains includes energy Vector sum spectrum signature vector is levied, referring to Fig. 9, the source feature vector acquisition module 702 includes:
LPC coefficient acquiring unit 901 for being encoded using linear predictive coding LPC to the source voice data, is obtained To the coefficient of the LPC;
Feature vector acquiring unit 902 for obtaining the cepstrum of the coefficient, obtains linear prediction residue error;It is described The feature parameter vectors and spectrum signature vector as the source voice data.
In one embodiment, it is special that the source speech feature vector that the source feature vector acquisition module 702 obtains includes energy Vector sum spectrum signature vector is levied, referring to Figure 10, the source feature vector acquisition module 702 includes:
Envelope spectrum acquiring unit 1001, for the frequency based on the source voice data, according to obtaining Meier formula The corresponding energy envelope spectrum of source voice data;
Scale acquiring unit 1002 for the energy envelope to be composed input Meier wave filter group, obtains mel-frequency quarter Degree;
MFCC acquiring units 1003 for carrying out logarithmic transformation and discrete cosine transform to the mel-frequency scale, obtain To mel-frequency cepstrum coefficient MFCC;The MFCC as the source voice data the feature parameter vectors and spectrum signature to Amount.
In one embodiment, referring to Figure 11, accent conversion equipment 700 further includes:
Sample data is to acquisition module 1101, for obtaining the sample voice data pair of setting quantity;The sample voice Data are to including the first accent is respectively adopted and the voice data obtained after same content is read aloud in the second accent;
Transformation model training module 1102, for training initial Feature Conversion model using the sample voice data, Until deconditioning when meeting end condition, the Feature Conversion model after being trained;
Wherein, the end condition includes:The speech feature vector and described second of initial Feature Conversion model output Penalty values between the corresponding speech feature vector of voice data of accent, less than or equal to penalty values threshold value.
On the accent conversion equipment in above-described embodiment, the concrete mode that wherein modules perform operation is having It closes and is described in detail in the embodiment of the accent conversion method, explanation will be not set forth in detail herein.
The embodiment of the present invention also proposed a kind of electronic equipment, and referring to Figure 12, which includes:
Receiver 1201;
Loud speaker 1202;
Processor 1203;
For storing the memory 1204 of 1203 executable instruction of processor;
Wherein, processor 1203 is configured as performing in memory 1204 executable instruction to realize that above-mentioned each accent is converted The step of method.
It should be noted that electronic equipment 1200 in the present embodiment can be telephone set, television set, Electronic Paper, mobile phone, Any product with voice data input and voice data output such as tablet computer, laptop, Digital Frame, navigator Or component.User can carry out voice communication, language learning etc. using electronic equipment.
In the present invention, term " first ", " second " are only used for description purpose, and it is not intended that instruction or hint are opposite Importance.Term " multiple " refers to two or more, unless otherwise restricted clearly.
Those skilled in the art will readily occur to the present invention its after considering specification and putting into practice disclosure disclosed herein Its embodiment.It is contemplated that cover the present invention any variations, uses, or adaptations, these modifications, purposes or Person's adaptive change follows the general principle of the present invention and including undocumented common knowledge in the art of the invention Or conventional techniques.Description and embodiments are considered only as illustratively, and true scope and spirit of the invention are by following Claim is pointed out.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is only limited by appended claim.

Claims (15)

1. a kind of accent conversion method, which is characterized in that the described method includes:
Obtain the source voice data with the first accent;
Obtain the corresponding source speech feature vector of the source voice data;
Feature Conversion model is called, the source speech feature vector is converted into target voice feature vector;
Based on target speech data of the target voice feature vector synthesis with the second accent.
2. accent conversion method according to claim 1, which is characterized in that the source speech feature vector include with down toward It is one few:Fundamental frequency feature vector, word speed feature vector, the feature parameter vectors and the spectrum signature vector of the source voice data.
3. accent conversion method according to claim 2, which is characterized in that the source speech feature vector includes the source The fundamental frequency feature vector of voice data, the corresponding source speech feature vector of the source voice data that obtains include:
The fundamental frequency feature vector of the source voice data is obtained using autocorrelation method.
4. accent conversion method according to claim 2, which is characterized in that the source speech feature vector includes the source The word speed feature vector of voice data, the corresponding source speech feature vector of the source voice data that obtains include:
Utilize the border for visualizing syllable in the voice tool acquisition source voice data;
The source is determined according to the border
The duration of voice data and the quantity comprising word;
The word speed feature vector of the source voice data is obtained according to the duration and the quantity.
5. accent conversion method according to claim 2, which is characterized in that it is special that the source speech feature vector includes energy Vector sum spectrum signature vector is levied, the corresponding source speech feature vector of the source voice data that obtains includes:
The source voice data is encoded using linear predictive coding LPC, obtains the coefficient of the LPC
Obtain the coefficientCepstrum, obtain linear prediction residue error Cn;The CnEnergy as the source voice data Feature vector and spectrum signature vector.
6. accent conversion method according to claim 2, which is characterized in that it is special that the source speech feature vector includes energy Vector sum spectrum signature vector is levied, the corresponding source speech feature vector of the source voice data that obtains includes:
Based on the frequency of the source voice data, the corresponding energy envelope of the source voice data is obtained according to Meier formula and is composed;
The energy envelope is composed into input Meier wave filter group, obtains mel-frequency scale;
Logarithmic transformation and discrete cosine transform are carried out to the mel-frequency scale, obtain mel-frequency cepstrum coefficient MFCC;Institute State the feature parameter vectors and spectrum signature vectors of the MFCC as the source voice data.
7. accent conversion method according to claim 1, which is characterized in that the method further includes:
Obtain the sample voice data pair of setting quantity;The sample voice data are to including the first accent and second is respectively adopted The voice data obtained after same content is read aloud in accent;
Initial Feature Conversion model is trained using the sample voice data, until deconditioning when meeting end condition, obtains Feature Conversion model after to training;
Wherein, the end condition includes:The speech feature vector of initial Feature Conversion model output and second accent The corresponding speech feature vector of voice data between penalty values, less than or equal to penalty values threshold value.
8. a kind of accent conversion equipment, which is characterized in that described device includes:
Source voice data acquisition module, for obtaining the source voice data with the first accent;
Source feature vector acquisition module, for obtaining the corresponding source speech feature vector of the source voice data;
The source speech feature vector for calling Feature Conversion model, is converted into target by target feature vector acquisition module Speech feature vector;
Target voice synthesis module, for based on target voice number of the target voice feature vector synthesis with the second accent According to.
9. accent conversion equipment according to claim 8, which is characterized in that the source feature vector acquisition module obtained Source speech feature vector includes following at least one:The fundamental frequency feature vector of the source voice data, word speed feature vector, energy Feature vector and spectrum signature vector.
10. accent conversion equipment according to claim 9, which is characterized in that the source feature vector acquisition module obtains Source speech feature vector include the fundamental frequency feature vector of the source voice data, described device includes:
The source feature vector acquisition module is used to obtain the fundamental frequency feature vector of the source voice data using autocorrelation method.
11. accent conversion equipment according to claim 9, which is characterized in that the source feature vector acquisition module obtains Source speech feature vector include the word speed feature vector of the source voice data, the source feature vector acquisition module includes:
Syllable boundaries acquiring unit, for utilizing the border for visualizing syllable in the voice tool acquisition source voice data;
When long word number acquiring unit, the number for the duration that the source voice data is determined according to the border and comprising word Amount;
Word speed feature vector acquiring unit, it is special for obtaining the word speed of the source voice data according to the duration and the quantity Sign vector.
12. accent conversion equipment according to claim 9, which is characterized in that the source feature vector acquisition module obtains Source speech feature vector include the feature parameter vectors and spectrum signature vector, the source feature vector acquisition module includes:
LPC coefficient acquiring unit for being encoded using linear predictive coding LPC to the source voice data, is obtained described The coefficient of LPC
Feature vector acquiring unit, for obtaining the coefficientCepstrum, obtain linear prediction residue error Cn;The CnMake For the feature parameter vectors and the spectrum signature vector of the source voice data.
13. accent conversion equipment according to claim 9, which is characterized in that the source feature vector acquisition module obtains Source speech feature vector include the feature parameter vectors and spectrum signature vector, the source feature vector acquisition module includes:
For the frequency based on the source voice data, the source voice number is obtained according to Meier formula for envelope spectrum acquiring unit It is composed according to corresponding energy envelope;
Scale acquiring unit for the energy envelope to be composed input Meier wave filter group, obtains mel-frequency scale;
MFCC acquiring units for carrying out logarithmic transformation and discrete cosine transform to the mel-frequency scale, obtain Meier frequency Rate cepstrum coefficient MFCC;The feature parameter vectors and spectrum signature vectors of the MFCC as the source voice data.
14. accent conversion equipment according to claim 8, which is characterized in that described device further includes:
Sample data is to acquisition module, for obtaining the sample voice data pair of setting quantity;The sample voice data are to bag It includes and the first accent is respectively adopted and the voice data obtained after same content is read aloud in the second accent;
Transformation model training module, for training initial Feature Conversion model using the sample voice data, until meeting Deconditioning during end condition, the Feature Conversion model after being trained;
Wherein, the end condition includes:The speech feature vector of initial Feature Conversion model output and second accent The corresponding speech feature vector of voice data between penalty values, less than or equal to penalty values threshold value.
15. a kind of electronic equipment, which is characterized in that the electronic equipment includes:
Receiver;
Loud speaker;
Processor;
For storing the memory of the processor-executable instruction;
Wherein, it is any in claim 1~7 to realize to be configured as performing executable instruction in the memory for the processor The step of item the method.
CN201810029495.9A 2018-01-12 2018-01-12 Accent conversion method and device and electronic equipment Active CN108108357B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810029495.9A CN108108357B (en) 2018-01-12 2018-01-12 Accent conversion method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810029495.9A CN108108357B (en) 2018-01-12 2018-01-12 Accent conversion method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN108108357A true CN108108357A (en) 2018-06-01
CN108108357B CN108108357B (en) 2022-08-09

Family

ID=62219900

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810029495.9A Active CN108108357B (en) 2018-01-12 2018-01-12 Accent conversion method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN108108357B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922556A (en) * 2018-07-16 2018-11-30 百度在线网络技术(北京)有限公司 sound processing method, device and equipment
CN109741744A (en) * 2019-01-14 2019-05-10 博拉网络股份有限公司 AI robot dialog control method and system based on big data search
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN110910865A (en) * 2019-11-25 2020-03-24 秒针信息技术有限公司 Voice conversion method and device, storage medium and electronic device
CN111223475A (en) * 2019-11-29 2020-06-02 北京达佳互联信息技术有限公司 Voice data generation method and device, electronic equipment and storage medium
CN111462769A (en) * 2020-03-30 2020-07-28 深圳市声希科技有限公司 End-to-end accent conversion method
CN112308101A (en) * 2019-07-30 2021-02-02 杭州海康威视数字技术股份有限公司 Method and device for object recognition
CN113223542A (en) * 2021-04-26 2021-08-06 北京搜狗科技发展有限公司 Audio conversion method and device, storage medium and electronic equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1622195A (en) * 2003-11-28 2005-06-01 株式会社东芝 Speech synthesis method and speech synthesis system
US20080201150A1 (en) * 2007-02-20 2008-08-21 Kabushiki Kaisha Toshiba Voice conversion apparatus and speech synthesis apparatus
CN102227767A (en) * 2008-11-12 2011-10-26 Scti控股公司 System and method for automatic speach to text conversion
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN104050965A (en) * 2013-09-02 2014-09-17 广东外语外贸大学 English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
US20160171974A1 (en) * 2014-12-15 2016-06-16 Baidu Usa Llc Systems and methods for speech transcription
US20160279414A1 (en) * 2015-03-26 2016-09-29 Med-El Elektromedizinische Geraete Gmbh Rate and Place of Stimulation Matched to Instantaneous Frequency
CN106057192A (en) * 2016-07-07 2016-10-26 Tcl集团股份有限公司 Real-time voice conversion method and apparatus
CN106531157A (en) * 2016-10-28 2017-03-22 中国科学院自动化研究所 Regularization accent adapting method for speech recognition
US20170278513A1 (en) * 2016-03-23 2017-09-28 Google Inc. Adaptive audio enhancement for multichannel speech recognition
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1622195A (en) * 2003-11-28 2005-06-01 株式会社东芝 Speech synthesis method and speech synthesis system
US20080201150A1 (en) * 2007-02-20 2008-08-21 Kabushiki Kaisha Toshiba Voice conversion apparatus and speech synthesis apparatus
CN102227767A (en) * 2008-11-12 2011-10-26 Scti控股公司 System and method for automatic speach to text conversion
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN104050965A (en) * 2013-09-02 2014-09-17 广东外语外贸大学 English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
US20160171974A1 (en) * 2014-12-15 2016-06-16 Baidu Usa Llc Systems and methods for speech transcription
US20160279414A1 (en) * 2015-03-26 2016-09-29 Med-El Elektromedizinische Geraete Gmbh Rate and Place of Stimulation Matched to Instantaneous Frequency
US20170278513A1 (en) * 2016-03-23 2017-09-28 Google Inc. Adaptive audio enhancement for multichannel speech recognition
CN106057192A (en) * 2016-07-07 2016-10-26 Tcl集团股份有限公司 Real-time voice conversion method and apparatus
CN106531157A (en) * 2016-10-28 2017-03-22 中国科学院自动化研究所 Regularization accent adapting method for speech recognition
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DONGLAI ZHU等: "Product of Power Spectrum and Group Delay Function for Speech Recognition", 《IEEE》, 30 June 2004 (2004-06-30) *
丁耀娥等: "采用谱包络与超音段韵律调整的高自然度语音转换", 《苏州大学学报(工科版)》, no. 04, 20 August 2009 (2009-08-20) *
惠琳等: "短时频谱通用背景模型群联合韵律的年龄语音转换", 《声学学报》, no. 06, 15 November 2017 (2017-11-15) *
李伟林等: "基于深度神经网络的语音识别系统研究", 《计算机科学》, 15 November 2016 (2016-11-15) *
汪成亮等: "基于共振峰合成和韵律调整的语音验证码方法研究", 《计算机应用研究》, no. 07, 15 July 2011 (2011-07-15) *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922556B (en) * 2018-07-16 2019-08-27 百度在线网络技术(北京)有限公司 Sound processing method, device and equipment
CN108922556A (en) * 2018-07-16 2018-11-30 百度在线网络技术(北京)有限公司 sound processing method, device and equipment
CN109741744A (en) * 2019-01-14 2019-05-10 博拉网络股份有限公司 AI robot dialog control method and system based on big data search
CN109741744B (en) * 2019-01-14 2021-03-09 博拉网络股份有限公司 AI robot conversation control method and system based on big data search
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
WO2020232860A1 (en) * 2019-05-22 2020-11-26 平安科技(深圳)有限公司 Speech synthesis method and apparatus, and computer readable storage medium
CN112308101A (en) * 2019-07-30 2021-02-02 杭州海康威视数字技术股份有限公司 Method and device for object recognition
CN112308101B (en) * 2019-07-30 2023-08-22 杭州海康威视数字技术股份有限公司 Method and device for identifying object
CN110910865A (en) * 2019-11-25 2020-03-24 秒针信息技术有限公司 Voice conversion method and device, storage medium and electronic device
CN110910865B (en) * 2019-11-25 2022-12-13 秒针信息技术有限公司 Voice conversion method and device, storage medium and electronic device
CN111223475B (en) * 2019-11-29 2022-10-14 北京达佳互联信息技术有限公司 Voice data generation method and device, electronic equipment and storage medium
CN111223475A (en) * 2019-11-29 2020-06-02 北京达佳互联信息技术有限公司 Voice data generation method and device, electronic equipment and storage medium
CN111462769A (en) * 2020-03-30 2020-07-28 深圳市声希科技有限公司 End-to-end accent conversion method
CN111462769B (en) * 2020-03-30 2023-10-27 深圳市达旦数生科技有限公司 End-to-end accent conversion method
CN113223542A (en) * 2021-04-26 2021-08-06 北京搜狗科技发展有限公司 Audio conversion method and device, storage medium and electronic equipment
CN113223542B (en) * 2021-04-26 2024-04-12 北京搜狗科技发展有限公司 Audio conversion method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN108108357B (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN108108357A (en) Accent conversion method and device, electronic equipment
Shrawankar et al. Techniques for feature extraction in speech recognition system: A comparative study
US5450522A (en) Auditory model for parametrization of speech
US20170301343A1 (en) Method and system for generating advanced feature discrimination vectors for use in speech recognition
GB1569990A (en) Frequency compensation method for use in speech analysis apparatus
CN102543073A (en) Shanghai dialect phonetic recognition information processing method
JP4516157B2 (en) Speech analysis device, speech analysis / synthesis device, correction rule information generation device, speech analysis system, speech analysis method, correction rule information generation method, and program
Tanaka et al. A hybrid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation
CN111833843A (en) Speech synthesis method and system
CN112489629A (en) Voice transcription model, method, medium, and electronic device
RU2427044C1 (en) Text-dependent voice conversion method
Shanthi Therese et al. Review of feature extraction techniques in automatic speech recognition
Prasad et al. Speech features extraction techniques for robust emotional speech analysis/recognition
Dalmiya et al. An efficient method for Tamil speech recognition using MFCC and DTW for mobile applications
KR100827097B1 (en) Method for determining variable length of frame for preprocessing of a speech signal and method and apparatus for preprocessing a speech signal using the same
CN112116909A (en) Voice recognition method, device and system
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
Hasija et al. Recognition of Children Punjabi Speech using Tonal Non-Tonal Classifier
CN113539239B (en) Voice conversion method and device, storage medium and electronic equipment
Kameoka et al. Speech spectrum modeling for joint estimation of spectral envelope and fundamental frequency
Park et al. Improving pitch detection through emphasized harmonics in time-domain
Singh et al. A comparative study on feature extraction techniques for language identification
Trivedi A survey on English digit speech recognition using HMM
CN111862931A (en) Voice generation method and device
Mittal et al. An impulse sequence representation of the excitation source characteristics of nonverbal speech sounds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant