CN109473091A - A kind of speech samples generation method and device - Google Patents
A kind of speech samples generation method and device Download PDFInfo
- Publication number
- CN109473091A CN109473091A CN201811593971.6A CN201811593971A CN109473091A CN 109473091 A CN109473091 A CN 109473091A CN 201811593971 A CN201811593971 A CN 201811593971A CN 109473091 A CN109473091 A CN 109473091A
- Authority
- CN
- China
- Prior art keywords
- voice
- variable
- voice variable
- mel
- characteristic value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
- G10L2015/0636—Threshold criteria for the updating
Abstract
The present invention provides a kind of speech samples generation method and device, this method comprises: extracting the mel-frequency characteristic value of the first voice variable after obtaining the first voice variable;Utilize the loss function of the mel-frequency characteristic value of the mel-frequency characteristic value and target voice of neural computing the first voice variable;Optimize loss function using the optimization algorithm in neural network by adjusting the value of sampled point in the first voice variable, until the value of the loss function after optimization is less than preset threshold, it is target voice sample that the value for meeting loss function, which is less than the voice variable of preset threshold,.Therefore, inverse Meier transformation based on Neural Networks Solution voice variable, and it is optimized by error of the neural network to the mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of target voice, in the hope of making voice variable of the error when being less than preset threshold, and using voice variable at this time as one to resisting sample, thus the speech samples collection of abundant speech recognition system.
Description
Technical field
The present invention relates to technical field of voice recognition, in particular to a kind of speech samples generation method and device.
Background technique
In the existing speech recognition system based on deep learning model, since corpus is not comprehensive enough, speech samples
Collect the reasons such as scarcity, cause the robustness of speech recognition system not strong enough, is easy by the interference to resisting sample.
Summary of the invention
The present invention provides a kind of speech samples generation method and device, deficient with the speech samples collection for solving speech recognition system
Weary problem.
To achieve the goals above, technical solution provided by the embodiment of the present invention is as follows:
In a first aspect, the embodiment of the present invention provides a kind of speech samples generation method, comprising: obtaining the first voice variable
Afterwards, the mel-frequency characteristic value of the first voice variable is extracted;Wherein, the characteristic parameter and target of the first voice variable
The characteristic parameter of voice is identical, and the characteristic parameter includes: length, sample rate and sound channel;Using described in neural computing
The loss function of the mel-frequency characteristic value of the mel-frequency characteristic value and target voice of first voice variable;It utilizes
Optimization algorithm in the neural network optimizes the loss function by adjusting the value of sampled point in the first voice variable,
Until the value of the loss function after optimization is less than preset threshold, the value for meeting the loss function is less than the preset threshold
Voice variable be target voice sample.Therefore, the inverse Meier transformation based on Neural Networks Solution voice variable, and pass through nerve
Network optimizes the error of the mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of target voice, in the hope of
It must make voice variable of the error when being less than preset threshold, and using voice variable at this time as one to resisting sample, thus
The speech samples collection of abundant speech recognition system.
In alternative embodiment of the invention, the mel-frequency characteristic value for extracting the first voice variable, comprising:
Fourier transformation is carried out to each frame of the first voice variable and obtains the second voice variable;To the second voice variable into
Row Meier filters to obtain third voice variable;Discrete cosine transform is carried out to the third voice variable and obtains melscale scramble
Spectrum, and using the melscale cepstrum as the mel-frequency characteristic value of the first voice variable.Therefore, voice is extracted to become
The process of the mel-frequency characteristic value of amount can be with are as follows: and Fourier transformation is carried out, Meier filtering is carried out, carries out discrete cosine transform,
To make voice variable have better table using obtained melscale cepstrum as the mel-frequency characteristic value of voice variable
Show.
In alternative embodiment of the invention, plum is obtained to third voice variable progress discrete cosine transform described
Now it spends after cepstrum, the method also includes: calculus of differences is carried out to the melscale cepstrum;It is described by the plum
Now mel-frequency characteristic value of the degree cepstrum as the first voice variable, comprising: insert the result of the calculus of differences
Enter to obtain the mel-frequency characteristic value of the first voice variable in the melscale cepstrum.Therefore, voice is become
The difference for the melscale cepstrum that before and after frames are extracted in amount, as the parameter for representing voice variable interframe multidate information, supplement
Into melscale cepstrum, together as the mel-frequency characteristic value of voice variable, so that utilizing in speech recognition system should
Voice variable has bigger application range after being trained.
In alternative embodiment of the invention, Fourier transformation is carried out in each frame to the first voice variable and is obtained
Before second voice variable, the method also includes: high-pass filtering processing is carried out to the first voice variable, and will be at filtering
The first voice variable after reason is divided into continuous frame, carries out windowing process to each frame.Therefore, voice variable is being solved
Before mel-frequency characteristic value, voice variable can be filtered first, framing, the preemphasis processing such as adding window, make to handle
The voice variable arrived is more conducive to solve mel-frequency characteristic value.
In alternative embodiment of the invention, the first voice variable of the acquisition, comprising: generate sound bite;To described
Sound bite is formatted to obtain the first voice variable, so that the characteristic parameter of the first voice variable and institute
The characteristic parameter for stating target voice is identical.Therefore, voice variable can be the Duan Yuyin generated at random, the length of the voice
The characteristic parameters such as degree, sample rate and sound channel should be identical as the length of target voice, sample rate and sound channel, so as to protect
The voice variable that card final optimization pass obtains can be used as the sample of speech recognition system.
In alternative embodiment of the invention, the mel-frequency characteristic value for extracting the first voice variable it
Before, the method also includes: obtain the target voice;Extract the mel-frequency characteristic value of the target voice.Therefore,
Before handling voice variable, one section of target voice can be obtained first, which is the optimization of voice variable
Target.
In alternative embodiment of the invention, it is less than the language of the preset threshold in the value for meeting the loss function
Change of tune amount be target voice sample after, the method also includes: using the target voice sample be sample utilize the nerve
Network is trained speech recognition system.It therefore, can be with after obtaining standard compliant voice variable using neural network
Speech recognition system is trained using the voice variable as training sample, to promote the robustness of the voice system.
Second aspect, the embodiment of the present invention provide a kind of speech samples generating means, comprising: the first extraction module is used for
After obtaining the first voice variable, the mel-frequency characteristic value of the first voice variable is extracted;Wherein, first voice becomes
The characteristic parameter of amount and the characteristic parameter of target voice are identical, and the characteristic parameter includes: length, sample rate and sound channel;The
One computing module, for the mel-frequency characteristic value and the target using the first voice variable described in neural computing
The loss function of the mel-frequency characteristic value of voice;Optimization module, for being passed through using the optimization algorithm in the neural network
The value for adjusting sampled point in the first voice variable optimizes the loss function, until the value of the loss function after optimization
Less than preset threshold, it is target voice sample that the value for meeting the loss function, which is less than the voice variable of the preset threshold,.Cause
This, is converted using the first extraction module based on the inverse Meier of Neural Networks Solution voice variable, and pass through mind using optimization module
It is optimized through error of the network to the mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of target voice, with
The voice variable for making the error when being less than preset threshold is acquired, and using voice variable at this time as one to resisting sample, from
And the speech samples collection of abundant speech recognition system.
In alternative embodiment of the invention, first extraction module includes: the first conversion module, for described the
Each frame of one voice variable carries out Fourier transformation and obtains the second voice variable;First filter module, for described second
Voice variable carries out Meier and filters to obtain third voice variable;Second conversion module, for being carried out to the third voice variable
Discrete cosine transform obtains melscale cepstrum, and using the melscale cepstrum as the plum of the first voice variable
That frequecy characteristic value.Therefore, the process that the first extraction module extracts the mel-frequency characteristic value of voice variable can be with are as follows: utilizes the
One conversion module carry out Fourier transformation, using the first filter module carry out Meier filtering, using the second conversion module carry out from
Cosine transform is dissipated, to make voice variable using obtained melscale cepstrum as the mel-frequency characteristic value of voice variable
There is better expression.
In alternative embodiment of the invention, described device further include: the second computing module, for the melscale
Cepstrum carries out calculus of differences;Second conversion module includes: insertion module, for the result of the calculus of differences to be inserted into
The mel-frequency characteristic value of the first voice variable is obtained in the melscale cepstrum.Therefore, second is calculated
The difference for the melscale cepstrum that before and after frames are extracted in the voice variable being calculated in module, as representing voice variable frame
Between multidate information parameter, and using insertion module add in melscale cepstrum, together as the Meier of voice variable
Frequecy characteristic value, to have bigger application range after speech recognition system is trained using the voice variable.
In alternative embodiment of the invention, described device further include: third filter module, for first voice
Variable carries out high-pass filtering processing, and the first voice variable after filtering processing is divided into continuous frame, to each frame into
Row windowing process.It therefore, can be sharp first before the mel-frequency characteristic value for solving voice variable using the first extraction module
Voice variable is filtered with third filter module, framing, the preemphasis processing such as adding window, the voice variable for obtaining processing is more
Conducive to solution mel-frequency characteristic value.
In alternative embodiment of the invention, first extraction module includes: generation module, for generating voice sheet
Section;Formatting module obtains the first voice variable for being formatted to the sound bite, so that first language
The characteristic parameter of change of tune amount and the characteristic parameter of the target voice are identical.Therefore, voice variable can be generation
The Duan Yuyin that module generates at random, the characteristic parameters such as length, sample rate and sound channel of the voice should be with target voices
Length, sample rate and sound channel are identical, and the voice variable that thereby may be ensured that final optimization pass obtains can be used as speech recognition system
The sample of system.
In alternative embodiment of the invention, described device further include: module is obtained, for obtaining the target voice;
Second extraction module, for extracting the mel-frequency characteristic value of the target voice.Therefore, to voice variable
Before reason, one section of target voice can be obtained first with module is obtained, which is the target of voice variable optimization.
In alternative embodiment of the invention, described device further include: training module, for the target voice sample
Speech recognition system is trained using the neural network for sample.Therefore, it is complied with standard using neural network
Voice variable after, can use training module and speech recognition system instructed using the voice variable as training sample
Practice, to promote the robustness of the voice system.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor, memory and bus, it is described
Memory is stored with the executable machine readable instructions of the processor, when electronic equipment operation, the processor with
By bus communication between the memory, executes in first aspect and appoint when the machine readable instructions are executed by the processor
Method described in one.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer readable storage medium
On be stored with computer program, when which is run by processor execute any optional implementation of first aspect in
Any method.
To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, the embodiment of the present invention is cited below particularly, and match
Appended attached drawing is closed, is described in detail below.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described.It should be appreciated that the following drawings illustrates only certain embodiments of the present invention, therefore it is not construed as pair
The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this
A little attached drawings obtain other relevant attached drawings.
Fig. 1 is a kind of flow chart of speech samples generation method provided in an embodiment of the present invention;
Fig. 2 is the flow chart of another speech samples generation method provided in an embodiment of the present invention;
Fig. 3 is the flow chart of another speech samples generation method provided in an embodiment of the present invention;
Fig. 4 is the flow chart of another speech samples generation method provided in an embodiment of the present invention;
Fig. 5 is the flow chart of another speech samples generation method provided in an embodiment of the present invention;
Fig. 6 is a kind of structural block diagram of speech samples generating means provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description.Obviously, described embodiment is only a part of the embodiments of the present invention, instead of all the embodiments.It is logical
The component for the embodiment of the present invention being often described and illustrated herein in the accompanying drawings can be arranged and be designed with a variety of different configurations.
Therefore, the detailed description of the embodiment of the present invention provided in the accompanying drawings is not intended to limit below claimed
The scope of the present invention, but be merely representative of selected embodiment of the invention.Based on the embodiment of the present invention, those skilled in the art
Member's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi
It is defined in a attached drawing, does not then need that it is further defined and explained in subsequent attached drawing.
In the description of the present invention, it should be noted that term " in ", "upper", "lower", "horizontal", "inner", "outside" etc. refer to
The orientation or positional relationship shown is to be based on the orientation or positional relationship shown in the drawings or when invention product use usually puts
The orientation or positional relationship put, is merely for convenience of description of the present invention and simplification of the description, rather than the dress of indication or suggestion meaning
It sets or element must have a particular orientation, be constructed and operated in a specific orientation, therefore should not be understood as to limit of the invention
System.In addition, term " first ", " second " etc. are only used for distinguishing description, it is not understood to indicate or imply relative importance.
In addition, the terms such as term "horizontal", "vertical" are not offered as requiring component abswolute level or pendency, but can be slightly
Low dip.It is not to indicate that the structure has been had to if "horizontal" only refers to that its direction is more horizontal with respect to for "vertical"
It is complete horizontal, but can be slightly tilted.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " setting ", " phase
Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected.It can
To be mechanical connection, it is also possible to be electrically connected.It can be directly connected, can also indirectly connected through an intermediary, it can be with
It is the connection inside two elements.For the ordinary skill in the art, it can understand that above-mentioned term exists with concrete condition
Concrete meaning in the present invention.
With reference to the accompanying drawing, it elaborates to some embodiments of the present invention.In the absence of conflict, following
Feature in embodiment and embodiment can be combined with each other.
First embodiment
The embodiment of the present invention provides a kind of speech samples generation method, please refers to Fig. 1, and Fig. 1 provides for the embodiment of the present invention
A kind of speech samples generation method flow chart, this method comprises the following steps:
Step S100: after obtaining the first voice variable, the mel-frequency characteristic value of the first voice variable is extracted.
Specifically, mel-frequency cepstrum is the non-linear melscale based on sound frequency in acoustic processing field
The linear transformation of logarithmic energy frequency spectrum, it more can the approximate mankind than the frequency band for the linear interval in normal cepstrum
Auditory system, therefore, such non-linear expression can make voice signal have better expression in multiple fields.And it mentions
Take mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) characteristic value of one section of voice
Mode is the conventional means of those skilled in the art, can be acquired by many modes, the embodiment of the present invention is not made specifically
Restriction.
For example, preemphasis, framing are carried out to voice first and divides window, then to each short-time analysis window, passed through
Fast Fourier Transform (Fast Fourier Transformation, FFT) obtains corresponding frequency spectrum.It will then be obtained above
Frequency spectrum Meier frequency spectrum is obtained by Meier (Mel) filter group, cepstral analysis is finally carried out on Meier frequency spectrum, to obtain
Mel-frequency cepstrum coefficient is obtained, which is the feature of the frame voice.Wherein, cepstral analysis can wrap
It includes: taking logarithm, does inverse transformation, practical inverse transformation is generally by discrete cosine transform (Discrete Cosine
Transform, DCT) Lai Shixian.
It should be noted that in inventive embodiments, the first voice variable for carrying out mel-frequency characteristics extraction can be with
It can be noise to generate the Duan Yuyin perhaps used at random, mute or any voice.But obtaining such one
After Duan Yuyin, need by being formatted to it, the feature of the characteristic parameter and one section of target voice that make this section of voice is joined
Number is identical, wherein characteristic parameter may include: length, sample rate and sound channel etc..Above-mentioned target voice is the voice variable
References object, the voice variable are finally needed as close possible to above-mentioned target voice.In addition to guaranteeing the by the way of formatting
The characteristic parameter of one voice variable is identical as the characteristic parameter of target voice, can also just generate spy when generating the voice variable
A parameter Duan Yuyin identical with the characteristic parameter of target voice is levied, to simplify the process being formatted to this section of voice.
Step S200: the mel-frequency characteristic value and the mesh using the first voice variable described in neural computing
The loss function of the mel-frequency characteristic value of poster sound.
Specifically, being calculated after the mel-frequency characteristic value of the first voice variable in the step s 100, in order to judge
The error of the first voice variable and target voice at this time can use a loss function to indicate the first voice variable
The error of mel-frequency characteristic value and the mel-frequency characteristic value of target voice.The process can be by neural fusion, benefit
This is asked with log logarithm loss function, quadratic loss function, figure penalties function or the unknown losses function in neural network
The error of the mel-frequency characteristic value of the mel-frequency characteristic value and target voice of first voice variable.
It should be noted that for those skilled in the art's in the way of the loss function solution error in neural network
Conventional means can be acquired by many modes, and the embodiment of the present invention does not limit specifically.
Step S300: using the optimization algorithm in the neural network by adjusting sampled point in the first voice variable
Value optimize the loss function, until the value of the loss function after optimization is less than preset threshold, meet the loss letter
The voice variable that several values is less than the preset threshold is target voice sample.
Specifically, the Meier frequency of the first voice variable is given expression to using the loss function in neural network in step s 200
After the error of rate characteristic value and the mel-frequency characteristic value of target voice, the optimization algorithm in neural network can be continued with
Above-mentioned loss function is optimized, when each calculates an error amount, by the error amount calculated and default threshold
Value is compared, and when error amount is greater than preset threshold, by changing the value of several sampled points in voice variable, is calculated new
The mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of new voice variable and the Meier of target voice frequency
The loss function of rate characteristic value, repeated optimization process and the error amount for obtaining optimization compared with preset threshold, work as error again
When value is greater than preset threshold, repeat the above process;When error amount is less than preset threshold, terminate optimization, exports error amount at this time
Corresponding voice variable, the voice variable are the satisfactory voice sample closest to target voice in the embodiment of the present invention
This.Wherein, above-mentioned preset threshold is user's preset fixed value according to the actual situation, in neural network to loss letter
During number optimizes, it is believed that when optimum results are less than the fixed value, loss function reaches minimum, to avoid
The process of optimization is excessively tediously long, cannot obtain a satisfactory voice variable always.
It is sufficiently small to obtain meeting loss function it should be noted that optimized using neural network to loss function
In the case where optimal solution mode, be those skilled in the art conventional means, can be by many modes to loss function
It optimizes, such as gradient decline, least square method etc., the embodiment of the present invention does not limit specifically.
In embodiments of the present invention, the inverse Meier transformation based on Neural Networks Solution voice variable, and pass through neural network
The error of the mel-frequency characteristic value of mel-frequency characteristic value and target voice to voice variable optimizes, in the hope of making
Voice variable of the error when being less than preset threshold, and using voice variable at this time as one to resisting sample, thus abundant
The speech samples collection of speech recognition system.
Further, referring to figure 2., Fig. 2 is the stream of another speech samples generation method provided in an embodiment of the present invention
Cheng Tu, step S100 include the following steps:
Step S110: Fourier transformation is carried out to each frame of the first voice variable and obtains the second voice variable.
Specifically, can become first to the first voice during solving the first voice variable mel-frequency characteristic value
Amount carries out the processing of Fourier transformation, and each frame in the first voice variable is carried out Short Time Fourier Transform, is calculated each
The power spectrum of frame obtains the second voice variable, so that the information of the first voice variable is transformed into frequency domain from time domain.
Step S120: Meier is carried out to the second voice variable and filters to obtain third voice variable.
Specifically, the second voice variable after Fourier transformation is carried out Meier filtering by a Meier filter.?
, can be using n triangle bandpass filtering as Meier filter in the embodiment of the present invention, this n triangle bandpass filter is in plum
It can be above your frequency equally distributed.Second voice variable signal obtains every after multiplied by this n triangle bandpass filter
The logarithmic energy of one filter output, as third voice variable.
Step S130: carrying out discrete cosine transform to the third voice variable and obtain melscale cepstrum, and by institute
State mel-frequency characteristic value of the melscale cepstrum as the first voice variable.
Specifically, by third voice variable, i.e. n logarithmic energy carries out discrete cosine transformation, so as to find out the plum of L rank
Now cepstrum is spent, which is the mel-frequency characteristic value of the first voice variable.
In embodiments of the present invention, the process for extracting the mel-frequency characteristic value of voice variable can be with are as follows: carries out Fourier
Transformation carries out Meier filtering, carries out discrete cosine transform, thus using obtained melscale cepstrum as the plum of voice variable
That frequecy characteristic value, makes voice variable have better expression.
Further, referring to figure 3., Fig. 3 is the stream of another speech samples generation method provided in an embodiment of the present invention
Cheng Tu, step S110 could alternatively be following steps to step S130:
Step S110: Fourier transformation is carried out to each frame of the first voice variable and obtains the second voice variable.
Step S120: Meier is carried out to the second voice variable and filters to obtain third voice variable.
Step S131: discrete cosine transform is carried out to the third voice variable and obtains melscale cepstrum.
Step S140: calculus of differences is carried out to the melscale cepstrum.
Specifically, in step S131 by discrete cosine transform obtain the first voice variable melscale cepstrum it
Afterwards, discrete differential operation can be carried out to melscale cepstrum, it can carry out discrete first-order difference calculating, can also carry out
Discrete second differnce calculates or discrete first-order difference calculates and the calculating of discrete second differnce is calculated, and obtains difference
Value after calculating.
Step S150: the result of the calculus of differences is inserted into the melscale cepstrum and obtains first voice
The mel-frequency characteristic value of variable.
Specifically, the melscale that the first voice variable will be inserted into step S140 by the value that discrete differential is calculated
The mel-frequency characteristic value that the first voice variable is obtained among cepstrum, using the multidate information as the first voice variable interframe.
It should be noted that the value that discrete first-order difference is calculated can be only inserted into, discrete second differnce also can be inserted and calculate
The value arrived, or it is inserted into the value that discrete first-order difference calculates and discrete second differnce is calculated simultaneously.
In embodiments of the present invention, the difference of melscale cepstrum before and after frames in voice variable extracted, as generation
The parameter of predicative change of tune amount interframe multidate information, adds in melscale cepstrum, together as the Meier frequency of voice variable
Rate characteristic value, to have bigger application range after speech recognition system is trained using the voice variable.
Further, further include following steps before step S110:
Step S160: high-pass filtering processing is carried out to the first voice variable, and by described first after filtering processing
Voice variable is divided into continuous frame, carries out windowing process to each frame.
Specifically, solve the first voice variable mel-frequency characteristic value before, can first to the first voice variable into
The a series of processing of row, treatment process may include: firstly, carrying out preemphasis processing to the first voice variable, i.e., by the first language
Sound signal is by a high-pass filter, to eliminate during generating first cloud because of signal vocal cords and lip to voice
It is influenced caused by signal, so that the first voice signal of compensation receives the high frequency section that articulatory system is suppressed.
Secondly, carrying out sub-frame processing to the first voice signal by preemphasis processing, i.e., continuous first voice is believed
Number it is divided into continuous multiple frames, the length of each frame can control within the scope of 20-50 millisecond, and corresponding sample point is a
Number is equal to the product of the first speech signal samples rate and every frame length.
Finally, the first voice signal after sub-frame processing, in order to keep its each frame two-end-point smooth and
Continuity can carry out windowing process to every frame of the first voice signal after sub-frame processing.This is because in subsequent step
When carrying out Fourier transform to the first voice signal in rapid, all assume that the signal in a sound frame is to represent a cycle news
Number, if this periodicity is not present, Fourier transform can for the discontinuous variation in left and right end to be met, and generate it is some not
There are the Energy distributions of former signal, cause the error in analysis.In embodiments of the present invention, using will be after sub-frame processing
Each frame of first voice signal keeps the continuous mesh in sound frame left and right ends multiplied by the mode of the Hamming window of same length to reach
's.
It should be noted that the above-mentioned concrete mode handled the first voice signal and data are of the invention real
Several schemes of example offer are applied, those skilled in the art can be easy to think of the side of remaining signal processing on this basis
Formula and the scheme of protection of the embodiment of the present invention.
In embodiments of the present invention, before the mel-frequency characteristic value for solving voice variable, voice can be become first
Amount is filtered, framing, the preemphasis processing such as adding window, and the voice variable for obtaining processing is more conducive to solve mel-frequency feature
Value.
Further, referring to figure 4., Fig. 4 is the stream of another speech samples generation method provided in an embodiment of the present invention
Cheng Tu, step S100 further includes following steps:
Step S170: sound bite is generated.
Specifically, the generation of the sound bite is a random process, it can be using typing one end audio, one section of downloading
The modes such as voice, the sound bite of generation can be one section of noise, mute or any voice.
Step S180: being formatted the sound bite to obtain the first voice variable, so that first language
The characteristic parameter of change of tune amount and the characteristic parameter of the target voice are identical.
Specifically, can be formatted to the voice variable, after generating sound bite in step S170 so that format
The first voice variable after change is identical as the characteristic parameter of one end target voice, and this feature parameter can be length, sample rate, sound
The characteristic parameters such as road.Wherein, above-mentioned target voice is the voice sheet of most original being trained in neural network as sample
Section.
In embodiments of the present invention, voice variable can be the Duan Yuyin generated at random, length, the sample rate of the voice
And the characteristic parameters such as sound channel should be identical as the length of target voice, sample rate and sound channel, thereby may be ensured that final excellent
Changing obtained voice variable can be used as the sample of speech recognition system.
Further, referring to figure 5., Fig. 5 is the stream of another speech samples generation method provided in an embodiment of the present invention
Cheng Tu further includes following steps before step S100:
Step S400: the target voice is obtained.
Step S500: the mel-frequency characteristic value of the target voice is extracted.
Specifically, target voice is the sound bite of most original being trained in neural network as sample, it can be with
Using the mel-frequency characteristic value for extracting target voice with identical mode in step S100, to be calculated by the optimization of neural network
Method makes the first voice variable approach target voice as far as possible, so that the first voice variable for guaranteeing that final optimization pass obtains can be made
It is trained for the sample of neural network, to achieve the purpose that increase the robustness of speech recognition system.
In embodiments of the present invention, before handling voice variable, one section of target voice can be obtained first, it should
Target voice is the target of voice variable optimization.
Further, further include following steps after step S300:
Step S600: speech recognition system is instructed using the neural network using the target voice sample as sample
Practice.
Specifically, the first voice variable is speech recognition system after having obtained suitable first voice variable
It is new to resisting sample.It can be to have identical length, sample rate, sound channel etc. with target voice by saving the first voice variable
The voice of feature, and the amplitude of the speech waveform generally takes normal range (NR), i.e., -215To 215Between -1, add the voice
Enter and carry out dual training into former speech recognition system, to enhance the robustness of existing voice identifying system.
It in embodiments of the present invention, can be by the language after obtaining standard compliant voice variable using neural network
Change of tune amount is trained speech recognition system as training sample, to promote the robustness of the voice system.
Second embodiment
The embodiment of the present invention provides a kind of speech samples generating means 600, please refers to Fig. 6, and Fig. 6 mentions for the embodiment of the present invention
A kind of structural block diagram of the speech samples generating means supplied, which includes: the first extraction module 610,
For extracting the mel-frequency characteristic value of the first voice variable after obtaining the first voice variable;Wherein, first language
The characteristic parameter of change of tune amount and the characteristic parameter of target voice are identical, and the characteristic parameter includes: length, sample rate and sound
Road;First computing module 620, for using the first voice variable described in neural computing the mel-frequency characteristic value with
The loss function of the mel-frequency characteristic value of the target voice;Optimization module 630, it is excellent in the neural network for utilizing
Change algorithm and optimize the loss function by adjusting the value of sampled point in the first voice variable, until the damage after optimization
The value for losing function is less than preset threshold, and it is target language that the value for meeting the loss function, which is less than the voice variable of the preset threshold,
Sound sample.
In embodiments of the present invention, the inverse Meier using the first extraction module 610 based on Neural Networks Solution voice variable
Transformation, and pass through neural network to the mel-frequency characteristic value of voice variable and the plum of target voice using optimization module 630
You optimize the error of frequecy characteristic value, in the hope of making voice variable of the error when being less than preset threshold, and will at this time
Voice variable as one to resisting sample, thus the speech samples collection of abundant speech recognition system.
Further, first extraction module 610 includes: the first conversion module, for the first voice variable
Each frame carry out Fourier transformation obtain the second voice variable;First filter module, for the second voice variable into
Row Meier filters to obtain third voice variable;Second conversion module, for carrying out discrete cosine change to the third voice variable
Get melscale cepstrum in return, and using the melscale cepstrum as the mel-frequency feature of the first voice variable
Value.
In embodiments of the present invention, the process that the first extraction module 610 extracts the mel-frequency characteristic value of voice variable can
With are as follows: Fourier transformation is carried out using the first conversion module, Meier filtering is carried out using the first filter module, utilizes the second transformation
Module carries out discrete cosine transform, thus using obtained melscale cepstrum as the mel-frequency characteristic value of voice variable,
Voice variable is set to have better expression.
Further, described device further include: the second computing module, for carrying out difference to the melscale cepstrum
Operation;Second conversion module includes: insertion module, is fallen for the result of the calculus of differences to be inserted into the melscale
The mel-frequency characteristic value of the first voice variable is obtained in frequency spectrum.
In embodiments of the present invention, Meier before and after frames in the voice variable being calculated in the second computing module extracted
The difference of scale cepstrum adds to Meier as the parameter for representing voice variable interframe multidate information, and using insertion module
In scale cepstrum, together as the mel-frequency characteristic value of voice variable, to be become in speech recognition system using the voice
Amount has bigger application range after being trained.
Further, described device further include: third filter module, for carrying out high pass filter to the first voice variable
Wave processing, and the first voice variable after filtering processing is divided into continuous frame, windowing process is carried out to each frame.
In embodiments of the present invention, using the first extraction module 610 solve voice variable mel-frequency characteristic value it
Before, voice variable can be filtered first with third filter module, framing, the preemphasis processing such as adding window, make to handle
The voice variable arrived is more conducive to solve mel-frequency characteristic value.
Further, first extraction module 610 includes: generation module, for generating sound bite;Format mould
Block obtains the first voice variable for being formatted to the sound bite, so that the institute of the first voice variable
It is identical as the characteristic parameter of the target voice to state characteristic parameter.
In embodiments of the present invention, voice variable can be the Duan Yuyin that generation module generates at random, the length of the voice
The characteristic parameters such as degree, sample rate and sound channel should be identical as the length of target voice, sample rate and sound channel, so as to protect
The voice variable that card final optimization pass obtains can be used as the sample of speech recognition system.
Further, described device further include: module is obtained, for obtaining the target voice;Second extraction module is used
In the mel-frequency characteristic value for extracting the target voice.
In embodiments of the present invention, before handling voice variable, one can be obtained first with module is obtained
Section target voice, the target voice are the target of voice variable optimization.
Further, described device further include: training module, for using the target voice sample as described in sample utilization
Neural network is trained speech recognition system.
In embodiments of the present invention, after obtaining standard compliant voice variable using neural network, it can use instruction
Practice module to be trained speech recognition system using the voice variable as training sample, to promote the robust of the voice system
Property.
3rd embodiment
The embodiment of the present invention provides a kind of electronic equipment, comprising: processor, memory and bus, the memory are deposited
The executable machine readable instructions of the processor are contained, when electronic equipment operation, the processor and the storage
It is executed by bus communication between device, when the machine readable instructions are executed by the processor any described in first embodiment
Method.
Memory can include but is not limited to random access memory (Random Access Memory, RAM), read-only to deposit
Reservoir (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory,
PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electric erasable
Read-only memory (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..
Processor can be a kind of IC chip, have signal handling capacity.Above-mentioned processor can be general place
Manage device, including central processing unit (Central Processing Unit, CPU), network processing unit (Network
Processor, NP) etc.;It can also be digital signal processor (DSP), specific integrated circuit (ASIC), ready-made programmable gate array
Arrange (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components.It can be real
Now or execute the embodiment of the present invention disclosed in various methods, step and logic diagram.General processor can be micro process
Device or the processor are also possible to any conventional processor etc..
Fourth embodiment
The embodiment of the present invention provides a kind of computer readable storage medium, and meter is stored on the computer readable storage medium
Calculation machine program, when which is run by processor execute any optional side of realizationing of first embodiment in it is any described in
Method.
In conclusion the present invention provides a kind of speech samples generation method and device, this method comprises: obtaining the first language
After change of tune amount, the mel-frequency characteristic value of the first voice variable is extracted;Wherein, the characteristic parameter of the first voice variable
Identical as the characteristic parameter of target voice, the characteristic parameter includes: length, sample rate and sound channel;Utilize neural network meter
Calculate the loss letter of the mel-frequency characteristic value of the first voice variable and the mel-frequency characteristic value of the target voice
Number;Optimize the damage using the optimization algorithm in the neural network by adjusting the value of sampled point in the first voice variable
Function is lost, until the value of the loss function after optimization is less than preset threshold, meets the value of the loss function less than described
The voice variable of preset threshold is target voice sample.Therefore, the inverse Meier transformation based on Neural Networks Solution voice variable, and
It is carried out by error of the neural network to the mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of target voice
Optimization in the hope of making voice variable of the error when being less than preset threshold, and is fought voice variable at this time as one
Sample, thus the speech samples collection of abundant speech recognition system.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field
For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair
Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain
Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
Claims (10)
1. a kind of speech samples generation method characterized by comprising
After obtaining the first voice variable, the mel-frequency characteristic value of the first voice variable is extracted;Wherein, first language
The characteristic parameter of change of tune amount and the characteristic parameter of target voice are identical, and the characteristic parameter includes: length, sample rate and sound
Road;
Utilize the Meier of the mel-frequency characteristic value and the target voice of the first voice variable described in neural computing
The loss function of frequecy characteristic value;
Using described in value optimization of the optimization algorithm in the neural network by adjusting sampled point in the first voice variable
Loss function, until the value of the loss function after optimization is less than preset threshold, the value for meeting the loss function is less than institute
The voice variable for stating preset threshold is target voice sample.
2. speech samples generation method according to claim 1, which is characterized in that described to extract the first voice variable
Mel-frequency characteristic value, comprising:
Fourier transformation is carried out to each frame of the first voice variable and obtains the second voice variable;
Meier is carried out to the second voice variable to filter to obtain third voice variable;
Discrete cosine transform is carried out to the third voice variable and obtains melscale cepstrum, and by the melscale scramble
Compose the mel-frequency characteristic value as the first voice variable.
3. speech samples generation method according to claim 2, which is characterized in that described to the third voice variable
After progress discrete cosine transform obtains melscale cepstrum, the method also includes:
Calculus of differences is carried out to the melscale cepstrum;
It is described using the melscale cepstrum as the mel-frequency characteristic value of the first voice variable, comprising:
The result of the calculus of differences is inserted into the melscale cepstrum and obtains the plum of the first voice variable
That frequecy characteristic value.
4. speech samples generation method according to claim 2, which is characterized in that the every of the first voice variable
Before one frame progress Fourier transformation obtains the second voice variable, the method also includes:
High-pass filtering processing is carried out to the first voice variable, and the first voice variable after filtering processing is divided into company
Continuous frame carries out windowing process to each frame.
5. speech samples generation method according to claim 1, which is characterized in that the first voice variable of the acquisition, packet
It includes:
Generate sound bite;
The sound bite is formatted to obtain the first voice variable, so that the spy of the first voice variable
It is identical as the characteristic parameter of the target voice to levy parameter.
6. speech samples generation method according to claim 1, which is characterized in that extract the first voice change described
Before the mel-frequency characteristic value of amount, the method also includes:
Obtain the target voice;
Extract the mel-frequency characteristic value of the target voice.
7. speech samples generation method according to claim 1-6, which is characterized in that meet the damage described
Lose function value be less than the preset threshold voice variable be target voice sample after, the method also includes:
Speech recognition system is trained using the neural network using the target voice sample as sample.
8. a kind of speech samples generating means characterized by comprising
First extraction module, for extracting the mel-frequency feature of the first voice variable after obtaining the first voice variable
Value;Wherein, the characteristic parameter of the first voice variable is identical as the characteristic parameter of target voice, and the characteristic parameter includes:
Length, sample rate and sound channel;
First computing module, for utilizing the mel-frequency characteristic value and the institute of the first voice variable described in neural computing
State the loss function of the mel-frequency characteristic value of target voice;
Optimization module, for utilizing the optimization algorithm in the neural network by adjusting sampled point in the first voice variable
Value optimize the loss function, until the value of the loss function after optimization is less than preset threshold, meet the loss letter
The voice variable that several values is less than the preset threshold is target voice sample.
9. speech samples generating means according to claim 8, which is characterized in that first extraction module includes:
First conversion module carries out Fourier transformation for each frame to the first voice variable and obtains the change of the second voice
Amount;
First filter module filters to obtain third voice variable for carrying out Meier to the second voice variable;
Second conversion module obtains melscale cepstrum for carrying out discrete cosine transform to the third voice variable, and
Using the melscale cepstrum as the mel-frequency characteristic value of the first voice variable.
10. speech samples generating means according to claim 9, which is characterized in that described device further include:
Second computing module, for carrying out calculus of differences to the melscale cepstrum;
Second conversion module includes:
It is inserted into module, obtains first voice for the result of the calculus of differences to be inserted into the melscale cepstrum
The mel-frequency characteristic value of variable.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811593971.6A CN109473091B (en) | 2018-12-25 | 2018-12-25 | Voice sample generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811593971.6A CN109473091B (en) | 2018-12-25 | 2018-12-25 | Voice sample generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109473091A true CN109473091A (en) | 2019-03-15 |
CN109473091B CN109473091B (en) | 2021-08-10 |
Family
ID=65676987
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811593971.6A Active CN109473091B (en) | 2018-12-25 | 2018-12-25 | Voice sample generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109473091B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111292766A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device, and medium for generating speech samples |
CN111477247A (en) * | 2020-04-01 | 2020-07-31 | 宁波大学 | GAN-based voice countermeasure sample generation method |
WO2020232860A1 (en) * | 2019-05-22 | 2020-11-26 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus, and computer readable storage medium |
CN112201227A (en) * | 2020-09-28 | 2021-01-08 | 海尔优家智能科技(北京)有限公司 | Voice sample generation method and device, storage medium and electronic device |
CN112216296A (en) * | 2020-09-25 | 2021-01-12 | 脸萌有限公司 | Audio anti-disturbance testing method and device and storage medium |
CN112466298A (en) * | 2020-11-24 | 2021-03-09 | 网易(杭州)网络有限公司 | Voice detection method and device, electronic equipment and storage medium |
WO2021137754A1 (en) * | 2019-12-31 | 2021-07-08 | National University Of Singapore | Feedback-controlled voice conversion |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107293289A (en) * | 2017-06-13 | 2017-10-24 | 南京医科大学 | A kind of speech production method that confrontation network is generated based on depth convolution |
CN108182936A (en) * | 2018-03-14 | 2018-06-19 | 百度在线网络技术(北京)有限公司 | Voice signal generation method and device |
CN108597496A (en) * | 2018-05-07 | 2018-09-28 | 广州势必可赢网络科技有限公司 | A kind of speech production method and device for fighting network based on production |
CN108899032A (en) * | 2018-06-06 | 2018-11-27 | 平安科技(深圳)有限公司 | Method for recognizing sound-groove, device, computer equipment and storage medium |
US20180342258A1 (en) * | 2017-05-24 | 2018-11-29 | Modulate, LLC | System and Method for Creating Timbres |
CN109036389A (en) * | 2018-08-28 | 2018-12-18 | 出门问问信息科技有限公司 | The generation method and device of a kind of pair of resisting sample |
-
2018
- 2018-12-25 CN CN201811593971.6A patent/CN109473091B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180342258A1 (en) * | 2017-05-24 | 2018-11-29 | Modulate, LLC | System and Method for Creating Timbres |
CN107293289A (en) * | 2017-06-13 | 2017-10-24 | 南京医科大学 | A kind of speech production method that confrontation network is generated based on depth convolution |
CN108182936A (en) * | 2018-03-14 | 2018-06-19 | 百度在线网络技术(北京)有限公司 | Voice signal generation method and device |
CN108597496A (en) * | 2018-05-07 | 2018-09-28 | 广州势必可赢网络科技有限公司 | A kind of speech production method and device for fighting network based on production |
CN108899032A (en) * | 2018-06-06 | 2018-11-27 | 平安科技(深圳)有限公司 | Method for recognizing sound-groove, device, computer equipment and storage medium |
CN109036389A (en) * | 2018-08-28 | 2018-12-18 | 出门问问信息科技有限公司 | The generation method and device of a kind of pair of resisting sample |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020232860A1 (en) * | 2019-05-22 | 2020-11-26 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus, and computer readable storage medium |
WO2021137754A1 (en) * | 2019-12-31 | 2021-07-08 | National University Of Singapore | Feedback-controlled voice conversion |
CN111292766A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device, and medium for generating speech samples |
CN111292766B (en) * | 2020-02-07 | 2023-08-08 | 抖音视界有限公司 | Method, apparatus, electronic device and medium for generating voice samples |
CN111477247A (en) * | 2020-04-01 | 2020-07-31 | 宁波大学 | GAN-based voice countermeasure sample generation method |
CN111477247B (en) * | 2020-04-01 | 2023-08-11 | 宁波大学 | Speech countermeasure sample generation method based on GAN |
CN112216296A (en) * | 2020-09-25 | 2021-01-12 | 脸萌有限公司 | Audio anti-disturbance testing method and device and storage medium |
CN112216296B (en) * | 2020-09-25 | 2023-09-22 | 脸萌有限公司 | Audio countermeasure disturbance testing method, device and storage medium |
CN112201227A (en) * | 2020-09-28 | 2021-01-08 | 海尔优家智能科技(北京)有限公司 | Voice sample generation method and device, storage medium and electronic device |
CN112466298A (en) * | 2020-11-24 | 2021-03-09 | 网易(杭州)网络有限公司 | Voice detection method and device, electronic equipment and storage medium |
CN112466298B (en) * | 2020-11-24 | 2023-08-11 | 杭州网易智企科技有限公司 | Voice detection method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109473091B (en) | 2021-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109473091A (en) | A kind of speech samples generation method and device | |
RU2685391C1 (en) | Method, device and system for noise rejection | |
EP2178082B1 (en) | Cyclic signal processing method, cyclic signal conversion method, cyclic signal processing device, and cyclic signal analysis method | |
JP4177755B2 (en) | Utterance feature extraction system | |
CN102054480B (en) | Method for separating monaural overlapping speeches based on fractional Fourier transform (FrFT) | |
CN109767783A (en) | Sound enhancement method, device, equipment and storage medium | |
CN109256127B (en) | Robust voice feature extraction method based on nonlinear power transformation Gamma chirp filter | |
KR20120090086A (en) | Determining an upperband signal from a narrowband signal | |
Kesarkar et al. | Feature extraction for speech recognition | |
US6701291B2 (en) | Automatic speech recognition with psychoacoustically-based feature extraction, using easily-tunable single-shape filters along logarithmic-frequency axis | |
CN103258537A (en) | Method utilizing characteristic combination to identify speech emotions and device thereof | |
JP2013512475A (en) | Complex acoustic resonance speech analysis system | |
JP2002507776A (en) | Signal processing method for analyzing transients in audio signals | |
CN103778914A (en) | Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching | |
RU2013119828A (en) | METHOD FOR DETERMINING THE RISK OF DEVELOPMENT OF INDIVIDUAL DISEASES BY ITS VOICE AND HARDWARE AND SOFTWARE COMPLEX FOR IMPLEMENTING THE METHOD | |
JP2006215228A (en) | Speech signal analysis method and device for implementing this analysis method, speech recognition device using this device for analyzing speech signal, program for implementing this analysis method, and recording medium thereof | |
JP4166405B2 (en) | Drive signal analyzer | |
CN112863517A (en) | Speech recognition method based on perceptual spectrum convergence rate | |
CN107993666B (en) | Speech recognition method, speech recognition device, computer equipment and readable storage medium | |
Singh et al. | A comparative study of recognition of speech using improved MFCC algorithms and Rasta filters | |
Zouhir et al. | Speech Signals Parameterization Based on Auditory Filter Modeling | |
Flynn et al. | A comparative study of auditory-based front-ends for robust speech recognition using the Aurora 2 database | |
JP4537821B2 (en) | Audio signal analysis method, audio signal recognition method using the method, audio signal section detection method, apparatus, program and recording medium thereof | |
Singh et al. | A novel algorithm using MFCC and ERB gammatone filters in speech recognition | |
Bhore et al. | Comparison of Formant Estimation Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |