CN109473091A

CN109473091A - A kind of speech samples generation method and device

Info

Publication number: CN109473091A
Application number: CN201811593971.6A
Authority: CN
Inventors: 魏华强; 李锐; 彭凝多; 唐博; 彭恒进
Original assignee: Sichuan Hongwei Technology Co Ltd
Current assignee: Sichuan Hongwei Technology Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2019-03-15
Anticipated expiration: 2038-12-25
Also published as: CN109473091B

Abstract

The present invention provides a kind of speech samples generation method and device, this method comprises: extracting the mel-frequency characteristic value of the first voice variable after obtaining the first voice variable；Utilize the loss function of the mel-frequency characteristic value of the mel-frequency characteristic value and target voice of neural computing the first voice variable；Optimize loss function using the optimization algorithm in neural network by adjusting the value of sampled point in the first voice variable, until the value of the loss function after optimization is less than preset threshold, it is target voice sample that the value for meeting loss function, which is less than the voice variable of preset threshold,.Therefore, inverse Meier transformation based on Neural Networks Solution voice variable, and it is optimized by error of the neural network to the mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of target voice, in the hope of making voice variable of the error when being less than preset threshold, and using voice variable at this time as one to resisting sample, thus the speech samples collection of abundant speech recognition system.

Description

A kind of speech samples generation method and device

Technical field

The present invention relates to technical field of voice recognition, in particular to a kind of speech samples generation method and device.

Background technique

In the existing speech recognition system based on deep learning model, since corpus is not comprehensive enough, speech samples Collect the reasons such as scarcity, cause the robustness of speech recognition system not strong enough, is easy by the interference to resisting sample.

Summary of the invention

The present invention provides a kind of speech samples generation method and device, deficient with the speech samples collection for solving speech recognition system Weary problem.

To achieve the goals above, technical solution provided by the embodiment of the present invention is as follows:

In a first aspect, the embodiment of the present invention provides a kind of speech samples generation method, comprising: obtaining the first voice variable Afterwards, the mel-frequency characteristic value of the first voice variable is extracted；Wherein, the characteristic parameter and target of the first voice variable The characteristic parameter of voice is identical, and the characteristic parameter includes: length, sample rate and sound channel；Using described in neural computing The loss function of the mel-frequency characteristic value of the mel-frequency characteristic value and target voice of first voice variable；It utilizes Optimization algorithm in the neural network optimizes the loss function by adjusting the value of sampled point in the first voice variable, Until the value of the loss function after optimization is less than preset threshold, the value for meeting the loss function is less than the preset threshold Voice variable be target voice sample.Therefore, the inverse Meier transformation based on Neural Networks Solution voice variable, and pass through nerve Network optimizes the error of the mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of target voice, in the hope of It must make voice variable of the error when being less than preset threshold, and using voice variable at this time as one to resisting sample, thus The speech samples collection of abundant speech recognition system.

In alternative embodiment of the invention, the mel-frequency characteristic value for extracting the first voice variable, comprising: Fourier transformation is carried out to each frame of the first voice variable and obtains the second voice variable；To the second voice variable into Row Meier filters to obtain third voice variable；Discrete cosine transform is carried out to the third voice variable and obtains melscale scramble Spectrum, and using the melscale cepstrum as the mel-frequency characteristic value of the first voice variable.Therefore, voice is extracted to become The process of the mel-frequency characteristic value of amount can be with are as follows: and Fourier transformation is carried out, Meier filtering is carried out, carries out discrete cosine transform, To make voice variable have better table using obtained melscale cepstrum as the mel-frequency characteristic value of voice variable Show.

In alternative embodiment of the invention, plum is obtained to third voice variable progress discrete cosine transform described Now it spends after cepstrum, the method also includes: calculus of differences is carried out to the melscale cepstrum；It is described by the plum Now mel-frequency characteristic value of the degree cepstrum as the first voice variable, comprising: insert the result of the calculus of differences Enter to obtain the mel-frequency characteristic value of the first voice variable in the melscale cepstrum.Therefore, voice is become The difference for the melscale cepstrum that before and after frames are extracted in amount, as the parameter for representing voice variable interframe multidate information, supplement Into melscale cepstrum, together as the mel-frequency characteristic value of voice variable, so that utilizing in speech recognition system should Voice variable has bigger application range after being trained.

In alternative embodiment of the invention, Fourier transformation is carried out in each frame to the first voice variable and is obtained Before second voice variable, the method also includes: high-pass filtering processing is carried out to the first voice variable, and will be at filtering The first voice variable after reason is divided into continuous frame, carries out windowing process to each frame.Therefore, voice variable is being solved Before mel-frequency characteristic value, voice variable can be filtered first, framing, the preemphasis processing such as adding window, make to handle The voice variable arrived is more conducive to solve mel-frequency characteristic value.

In alternative embodiment of the invention, the first voice variable of the acquisition, comprising: generate sound bite；To described Sound bite is formatted to obtain the first voice variable, so that the characteristic parameter of the first voice variable and institute The characteristic parameter for stating target voice is identical.Therefore, voice variable can be the Duan Yuyin generated at random, the length of the voice The characteristic parameters such as degree, sample rate and sound channel should be identical as the length of target voice, sample rate and sound channel, so as to protect The voice variable that card final optimization pass obtains can be used as the sample of speech recognition system.

In alternative embodiment of the invention, the mel-frequency characteristic value for extracting the first voice variable it Before, the method also includes: obtain the target voice；Extract the mel-frequency characteristic value of the target voice.Therefore, Before handling voice variable, one section of target voice can be obtained first, which is the optimization of voice variable Target.

In alternative embodiment of the invention, it is less than the language of the preset threshold in the value for meeting the loss function Change of tune amount be target voice sample after, the method also includes: using the target voice sample be sample utilize the nerve Network is trained speech recognition system.It therefore, can be with after obtaining standard compliant voice variable using neural network Speech recognition system is trained using the voice variable as training sample, to promote the robustness of the voice system.

Second aspect, the embodiment of the present invention provide a kind of speech samples generating means, comprising: the first extraction module is used for After obtaining the first voice variable, the mel-frequency characteristic value of the first voice variable is extracted；Wherein, first voice becomes The characteristic parameter of amount and the characteristic parameter of target voice are identical, and the characteristic parameter includes: length, sample rate and sound channel；The One computing module, for the mel-frequency characteristic value and the target using the first voice variable described in neural computing The loss function of the mel-frequency characteristic value of voice；Optimization module, for being passed through using the optimization algorithm in the neural network The value for adjusting sampled point in the first voice variable optimizes the loss function, until the value of the loss function after optimization Less than preset threshold, it is target voice sample that the value for meeting the loss function, which is less than the voice variable of the preset threshold,.Cause This, is converted using the first extraction module based on the inverse Meier of Neural Networks Solution voice variable, and pass through mind using optimization module It is optimized through error of the network to the mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of target voice, with The voice variable for making the error when being less than preset threshold is acquired, and using voice variable at this time as one to resisting sample, from And the speech samples collection of abundant speech recognition system.

In alternative embodiment of the invention, first extraction module includes: the first conversion module, for described the Each frame of one voice variable carries out Fourier transformation and obtains the second voice variable；First filter module, for described second Voice variable carries out Meier and filters to obtain third voice variable；Second conversion module, for being carried out to the third voice variable Discrete cosine transform obtains melscale cepstrum, and using the melscale cepstrum as the plum of the first voice variable That frequecy characteristic value.Therefore, the process that the first extraction module extracts the mel-frequency characteristic value of voice variable can be with are as follows: utilizes the One conversion module carry out Fourier transformation, using the first filter module carry out Meier filtering, using the second conversion module carry out from Cosine transform is dissipated, to make voice variable using obtained melscale cepstrum as the mel-frequency characteristic value of voice variable There is better expression.

In alternative embodiment of the invention, described device further include: the second computing module, for the melscale Cepstrum carries out calculus of differences；Second conversion module includes: insertion module, for the result of the calculus of differences to be inserted into The mel-frequency characteristic value of the first voice variable is obtained in the melscale cepstrum.Therefore, second is calculated The difference for the melscale cepstrum that before and after frames are extracted in the voice variable being calculated in module, as representing voice variable frame Between multidate information parameter, and using insertion module add in melscale cepstrum, together as the Meier of voice variable Frequecy characteristic value, to have bigger application range after speech recognition system is trained using the voice variable.

In alternative embodiment of the invention, described device further include: third filter module, for first voice Variable carries out high-pass filtering processing, and the first voice variable after filtering processing is divided into continuous frame, to each frame into Row windowing process.It therefore, can be sharp first before the mel-frequency characteristic value for solving voice variable using the first extraction module Voice variable is filtered with third filter module, framing, the preemphasis processing such as adding window, the voice variable for obtaining processing is more Conducive to solution mel-frequency characteristic value.

In alternative embodiment of the invention, first extraction module includes: generation module, for generating voice sheet Section；Formatting module obtains the first voice variable for being formatted to the sound bite, so that first language The characteristic parameter of change of tune amount and the characteristic parameter of the target voice are identical.Therefore, voice variable can be generation The Duan Yuyin that module generates at random, the characteristic parameters such as length, sample rate and sound channel of the voice should be with target voices Length, sample rate and sound channel are identical, and the voice variable that thereby may be ensured that final optimization pass obtains can be used as speech recognition system The sample of system.

In alternative embodiment of the invention, described device further include: module is obtained, for obtaining the target voice； Second extraction module, for extracting the mel-frequency characteristic value of the target voice.Therefore, to voice variable Before reason, one section of target voice can be obtained first with module is obtained, which is the target of voice variable optimization.

In alternative embodiment of the invention, described device further include: training module, for the target voice sample Speech recognition system is trained using the neural network for sample.Therefore, it is complied with standard using neural network Voice variable after, can use training module and speech recognition system instructed using the voice variable as training sample Practice, to promote the robustness of the voice system.

The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor, memory and bus, it is described Memory is stored with the executable machine readable instructions of the processor, when electronic equipment operation, the processor with By bus communication between the memory, executes in first aspect and appoint when the machine readable instructions are executed by the processor Method described in one.

Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer readable storage medium On be stored with computer program, when which is run by processor execute any optional implementation of first aspect in Any method.

To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, the embodiment of the present invention is cited below particularly, and match Appended attached drawing is closed, is described in detail below.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described.It should be appreciated that the following drawings illustrates only certain embodiments of the present invention, therefore it is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 is a kind of flow chart of speech samples generation method provided in an embodiment of the present invention；

Fig. 2 is the flow chart of another speech samples generation method provided in an embodiment of the present invention；

Fig. 3 is the flow chart of another speech samples generation method provided in an embodiment of the present invention；

Fig. 4 is the flow chart of another speech samples generation method provided in an embodiment of the present invention；

Fig. 5 is the flow chart of another speech samples generation method provided in an embodiment of the present invention；

Fig. 6 is a kind of structural block diagram of speech samples generating means provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description.Obviously, described embodiment is only a part of the embodiments of the present invention, instead of all the embodiments.It is logical The component for the embodiment of the present invention being often described and illustrated herein in the accompanying drawings can be arranged and be designed with a variety of different configurations.

Therefore, the detailed description of the embodiment of the present invention provided in the accompanying drawings is not intended to limit below claimed The scope of the present invention, but be merely representative of selected embodiment of the invention.Based on the embodiment of the present invention, those skilled in the art Member's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, does not then need that it is further defined and explained in subsequent attached drawing.

In the description of the present invention, it should be noted that term " in ", "upper", "lower", "horizontal", "inner", "outside" etc. refer to The orientation or positional relationship shown is to be based on the orientation or positional relationship shown in the drawings or when invention product use usually puts The orientation or positional relationship put, is merely for convenience of description of the present invention and simplification of the description, rather than the dress of indication or suggestion meaning It sets or element must have a particular orientation, be constructed and operated in a specific orientation, therefore should not be understood as to limit of the invention System.In addition, term " first ", " second " etc. are only used for distinguishing description, it is not understood to indicate or imply relative importance.

In addition, the terms such as term "horizontal", "vertical" are not offered as requiring component abswolute level or pendency, but can be slightly Low dip.It is not to indicate that the structure has been had to if "horizontal" only refers to that its direction is more horizontal with respect to for "vertical" It is complete horizontal, but can be slightly tilted.

In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " setting ", " phase Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected.It can To be mechanical connection, it is also possible to be electrically connected.It can be directly connected, can also indirectly connected through an intermediary, it can be with It is the connection inside two elements.For the ordinary skill in the art, it can understand that above-mentioned term exists with concrete condition Concrete meaning in the present invention.

With reference to the accompanying drawing, it elaborates to some embodiments of the present invention.In the absence of conflict, following Feature in embodiment and embodiment can be combined with each other.

First embodiment

The embodiment of the present invention provides a kind of speech samples generation method, please refers to Fig. 1, and Fig. 1 provides for the embodiment of the present invention A kind of speech samples generation method flow chart, this method comprises the following steps:

Step S100: after obtaining the first voice variable, the mel-frequency characteristic value of the first voice variable is extracted.

Specifically, mel-frequency cepstrum is the non-linear melscale based on sound frequency in acoustic processing field The linear transformation of logarithmic energy frequency spectrum, it more can the approximate mankind than the frequency band for the linear interval in normal cepstrum Auditory system, therefore, such non-linear expression can make voice signal have better expression in multiple fields.And it mentions Take mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) characteristic value of one section of voice Mode is the conventional means of those skilled in the art, can be acquired by many modes, the embodiment of the present invention is not made specifically Restriction.

For example, preemphasis, framing are carried out to voice first and divides window, then to each short-time analysis window, passed through Fast Fourier Transform (Fast Fourier Transformation, FFT) obtains corresponding frequency spectrum.It will then be obtained above Frequency spectrum Meier frequency spectrum is obtained by Meier (Mel) filter group, cepstral analysis is finally carried out on Meier frequency spectrum, to obtain Mel-frequency cepstrum coefficient is obtained, which is the feature of the frame voice.Wherein, cepstral analysis can wrap It includes: taking logarithm, does inverse transformation, practical inverse transformation is generally by discrete cosine transform (Discrete Cosine Transform, DCT) Lai Shixian.

It should be noted that in inventive embodiments, the first voice variable for carrying out mel-frequency characteristics extraction can be with It can be noise to generate the Duan Yuyin perhaps used at random, mute or any voice.But obtaining such one After Duan Yuyin, need by being formatted to it, the feature of the characteristic parameter and one section of target voice that make this section of voice is joined Number is identical, wherein characteristic parameter may include: length, sample rate and sound channel etc..Above-mentioned target voice is the voice variable References object, the voice variable are finally needed as close possible to above-mentioned target voice.In addition to guaranteeing the by the way of formatting The characteristic parameter of one voice variable is identical as the characteristic parameter of target voice, can also just generate spy when generating the voice variable A parameter Duan Yuyin identical with the characteristic parameter of target voice is levied, to simplify the process being formatted to this section of voice.

Step S200: the mel-frequency characteristic value and the mesh using the first voice variable described in neural computing The loss function of the mel-frequency characteristic value of poster sound.

Specifically, being calculated after the mel-frequency characteristic value of the first voice variable in the step s 100, in order to judge The error of the first voice variable and target voice at this time can use a loss function to indicate the first voice variable The error of mel-frequency characteristic value and the mel-frequency characteristic value of target voice.The process can be by neural fusion, benefit This is asked with log logarithm loss function, quadratic loss function, figure penalties function or the unknown losses function in neural network The error of the mel-frequency characteristic value of the mel-frequency characteristic value and target voice of first voice variable.

It should be noted that for those skilled in the art's in the way of the loss function solution error in neural network Conventional means can be acquired by many modes, and the embodiment of the present invention does not limit specifically.

Step S300: using the optimization algorithm in the neural network by adjusting sampled point in the first voice variable Value optimize the loss function, until the value of the loss function after optimization is less than preset threshold, meet the loss letter The voice variable that several values is less than the preset threshold is target voice sample.

Specifically, the Meier frequency of the first voice variable is given expression to using the loss function in neural network in step s 200 After the error of rate characteristic value and the mel-frequency characteristic value of target voice, the optimization algorithm in neural network can be continued with Above-mentioned loss function is optimized, when each calculates an error amount, by the error amount calculated and default threshold Value is compared, and when error amount is greater than preset threshold, by changing the value of several sampled points in voice variable, is calculated new The mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of new voice variable and the Meier of target voice frequency The loss function of rate characteristic value, repeated optimization process and the error amount for obtaining optimization compared with preset threshold, work as error again When value is greater than preset threshold, repeat the above process；When error amount is less than preset threshold, terminate optimization, exports error amount at this time Corresponding voice variable, the voice variable are the satisfactory voice sample closest to target voice in the embodiment of the present invention This.Wherein, above-mentioned preset threshold is user's preset fixed value according to the actual situation, in neural network to loss letter During number optimizes, it is believed that when optimum results are less than the fixed value, loss function reaches minimum, to avoid The process of optimization is excessively tediously long, cannot obtain a satisfactory voice variable always.

It is sufficiently small to obtain meeting loss function it should be noted that optimized using neural network to loss function In the case where optimal solution mode, be those skilled in the art conventional means, can be by many modes to loss function It optimizes, such as gradient decline, least square method etc., the embodiment of the present invention does not limit specifically.

In embodiments of the present invention, the inverse Meier transformation based on Neural Networks Solution voice variable, and pass through neural network The error of the mel-frequency characteristic value of mel-frequency characteristic value and target voice to voice variable optimizes, in the hope of making Voice variable of the error when being less than preset threshold, and using voice variable at this time as one to resisting sample, thus abundant The speech samples collection of speech recognition system.

Further, referring to figure 2., Fig. 2 is the stream of another speech samples generation method provided in an embodiment of the present invention Cheng Tu, step S100 include the following steps:

Step S110: Fourier transformation is carried out to each frame of the first voice variable and obtains the second voice variable.

Specifically, can become first to the first voice during solving the first voice variable mel-frequency characteristic value Amount carries out the processing of Fourier transformation, and each frame in the first voice variable is carried out Short Time Fourier Transform, is calculated each The power spectrum of frame obtains the second voice variable, so that the information of the first voice variable is transformed into frequency domain from time domain.

Step S120: Meier is carried out to the second voice variable and filters to obtain third voice variable.

Specifically, the second voice variable after Fourier transformation is carried out Meier filtering by a Meier filter.? , can be using n triangle bandpass filtering as Meier filter in the embodiment of the present invention, this n triangle bandpass filter is in plum It can be above your frequency equally distributed.Second voice variable signal obtains every after multiplied by this n triangle bandpass filter The logarithmic energy of one filter output, as third voice variable.

Step S130: carrying out discrete cosine transform to the third voice variable and obtain melscale cepstrum, and by institute State mel-frequency characteristic value of the melscale cepstrum as the first voice variable.

Specifically, by third voice variable, i.e. n logarithmic energy carries out discrete cosine transformation, so as to find out the plum of L rank Now cepstrum is spent, which is the mel-frequency characteristic value of the first voice variable.

In embodiments of the present invention, the process for extracting the mel-frequency characteristic value of voice variable can be with are as follows: carries out Fourier Transformation carries out Meier filtering, carries out discrete cosine transform, thus using obtained melscale cepstrum as the plum of voice variable That frequecy characteristic value, makes voice variable have better expression.

Further, referring to figure 3., Fig. 3 is the stream of another speech samples generation method provided in an embodiment of the present invention Cheng Tu, step S110 could alternatively be following steps to step S130:

Step S131: discrete cosine transform is carried out to the third voice variable and obtains melscale cepstrum.

Step S140: calculus of differences is carried out to the melscale cepstrum.

Specifically, in step S131 by discrete cosine transform obtain the first voice variable melscale cepstrum it Afterwards, discrete differential operation can be carried out to melscale cepstrum, it can carry out discrete first-order difference calculating, can also carry out Discrete second differnce calculates or discrete first-order difference calculates and the calculating of discrete second differnce is calculated, and obtains difference Value after calculating.

Step S150: the result of the calculus of differences is inserted into the melscale cepstrum and obtains first voice The mel-frequency characteristic value of variable.

Specifically, the melscale that the first voice variable will be inserted into step S140 by the value that discrete differential is calculated The mel-frequency characteristic value that the first voice variable is obtained among cepstrum, using the multidate information as the first voice variable interframe. It should be noted that the value that discrete first-order difference is calculated can be only inserted into, discrete second differnce also can be inserted and calculate The value arrived, or it is inserted into the value that discrete first-order difference calculates and discrete second differnce is calculated simultaneously.

In embodiments of the present invention, the difference of melscale cepstrum before and after frames in voice variable extracted, as generation The parameter of predicative change of tune amount interframe multidate information, adds in melscale cepstrum, together as the Meier frequency of voice variable Rate characteristic value, to have bigger application range after speech recognition system is trained using the voice variable.

Further, further include following steps before step S110:

Step S160: high-pass filtering processing is carried out to the first voice variable, and by described first after filtering processing Voice variable is divided into continuous frame, carries out windowing process to each frame.

Specifically, solve the first voice variable mel-frequency characteristic value before, can first to the first voice variable into The a series of processing of row, treatment process may include: firstly, carrying out preemphasis processing to the first voice variable, i.e., by the first language Sound signal is by a high-pass filter, to eliminate during generating first cloud because of signal vocal cords and lip to voice It is influenced caused by signal, so that the first voice signal of compensation receives the high frequency section that articulatory system is suppressed.

Secondly, carrying out sub-frame processing to the first voice signal by preemphasis processing, i.e., continuous first voice is believed Number it is divided into continuous multiple frames, the length of each frame can control within the scope of 20-50 millisecond, and corresponding sample point is a Number is equal to the product of the first speech signal samples rate and every frame length.

Finally, the first voice signal after sub-frame processing, in order to keep its each frame two-end-point smooth and Continuity can carry out windowing process to every frame of the first voice signal after sub-frame processing.This is because in subsequent step When carrying out Fourier transform to the first voice signal in rapid, all assume that the signal in a sound frame is to represent a cycle news Number, if this periodicity is not present, Fourier transform can for the discontinuous variation in left and right end to be met, and generate it is some not There are the Energy distributions of former signal, cause the error in analysis.In embodiments of the present invention, using will be after sub-frame processing Each frame of first voice signal keeps the continuous mesh in sound frame left and right ends multiplied by the mode of the Hamming window of same length to reach 's.

It should be noted that the above-mentioned concrete mode handled the first voice signal and data are of the invention real Several schemes of example offer are applied, those skilled in the art can be easy to think of the side of remaining signal processing on this basis Formula and the scheme of protection of the embodiment of the present invention.

In embodiments of the present invention, before the mel-frequency characteristic value for solving voice variable, voice can be become first Amount is filtered, framing, the preemphasis processing such as adding window, and the voice variable for obtaining processing is more conducive to solve mel-frequency feature Value.

Further, referring to figure 4., Fig. 4 is the stream of another speech samples generation method provided in an embodiment of the present invention Cheng Tu, step S100 further includes following steps:

Step S170: sound bite is generated.

Specifically, the generation of the sound bite is a random process, it can be using typing one end audio, one section of downloading The modes such as voice, the sound bite of generation can be one section of noise, mute or any voice.

Step S180: being formatted the sound bite to obtain the first voice variable, so that first language The characteristic parameter of change of tune amount and the characteristic parameter of the target voice are identical.

Specifically, can be formatted to the voice variable, after generating sound bite in step S170 so that format The first voice variable after change is identical as the characteristic parameter of one end target voice, and this feature parameter can be length, sample rate, sound The characteristic parameters such as road.Wherein, above-mentioned target voice is the voice sheet of most original being trained in neural network as sample Section.

In embodiments of the present invention, voice variable can be the Duan Yuyin generated at random, length, the sample rate of the voice And the characteristic parameters such as sound channel should be identical as the length of target voice, sample rate and sound channel, thereby may be ensured that final excellent Changing obtained voice variable can be used as the sample of speech recognition system.

Further, referring to figure 5., Fig. 5 is the stream of another speech samples generation method provided in an embodiment of the present invention Cheng Tu further includes following steps before step S100:

Step S400: the target voice is obtained.

Step S500: the mel-frequency characteristic value of the target voice is extracted.

Specifically, target voice is the sound bite of most original being trained in neural network as sample, it can be with Using the mel-frequency characteristic value for extracting target voice with identical mode in step S100, to be calculated by the optimization of neural network Method makes the first voice variable approach target voice as far as possible, so that the first voice variable for guaranteeing that final optimization pass obtains can be made It is trained for the sample of neural network, to achieve the purpose that increase the robustness of speech recognition system.

In embodiments of the present invention, before handling voice variable, one section of target voice can be obtained first, it should Target voice is the target of voice variable optimization.

Further, further include following steps after step S300:

Step S600: speech recognition system is instructed using the neural network using the target voice sample as sample Practice.

Specifically, the first voice variable is speech recognition system after having obtained suitable first voice variable It is new to resisting sample.It can be to have identical length, sample rate, sound channel etc. with target voice by saving the first voice variable The voice of feature, and the amplitude of the speech waveform generally takes normal range (NR), i.e., -2¹⁵To 2¹⁵Between -1, add the voice Enter and carry out dual training into former speech recognition system, to enhance the robustness of existing voice identifying system.

It in embodiments of the present invention, can be by the language after obtaining standard compliant voice variable using neural network Change of tune amount is trained speech recognition system as training sample, to promote the robustness of the voice system.

Second embodiment

The embodiment of the present invention provides a kind of speech samples generating means 600, please refers to Fig. 6, and Fig. 6 mentions for the embodiment of the present invention A kind of structural block diagram of the speech samples generating means supplied, which includes: the first extraction module 610, For extracting the mel-frequency characteristic value of the first voice variable after obtaining the first voice variable；Wherein, first language The characteristic parameter of change of tune amount and the characteristic parameter of target voice are identical, and the characteristic parameter includes: length, sample rate and sound Road；First computing module 620, for using the first voice variable described in neural computing the mel-frequency characteristic value with The loss function of the mel-frequency characteristic value of the target voice；Optimization module 630, it is excellent in the neural network for utilizing Change algorithm and optimize the loss function by adjusting the value of sampled point in the first voice variable, until the damage after optimization The value for losing function is less than preset threshold, and it is target language that the value for meeting the loss function, which is less than the voice variable of the preset threshold, Sound sample.

In embodiments of the present invention, the inverse Meier using the first extraction module 610 based on Neural Networks Solution voice variable Transformation, and pass through neural network to the mel-frequency characteristic value of voice variable and the plum of target voice using optimization module 630 You optimize the error of frequecy characteristic value, in the hope of making voice variable of the error when being less than preset threshold, and will at this time Voice variable as one to resisting sample, thus the speech samples collection of abundant speech recognition system.

Further, first extraction module 610 includes: the first conversion module, for the first voice variable Each frame carry out Fourier transformation obtain the second voice variable；First filter module, for the second voice variable into Row Meier filters to obtain third voice variable；Second conversion module, for carrying out discrete cosine change to the third voice variable Get melscale cepstrum in return, and using the melscale cepstrum as the mel-frequency feature of the first voice variable Value.

In embodiments of the present invention, the process that the first extraction module 610 extracts the mel-frequency characteristic value of voice variable can With are as follows: Fourier transformation is carried out using the first conversion module, Meier filtering is carried out using the first filter module, utilizes the second transformation Module carries out discrete cosine transform, thus using obtained melscale cepstrum as the mel-frequency characteristic value of voice variable, Voice variable is set to have better expression.

Further, described device further include: the second computing module, for carrying out difference to the melscale cepstrum Operation；Second conversion module includes: insertion module, is fallen for the result of the calculus of differences to be inserted into the melscale The mel-frequency characteristic value of the first voice variable is obtained in frequency spectrum.

In embodiments of the present invention, Meier before and after frames in the voice variable being calculated in the second computing module extracted The difference of scale cepstrum adds to Meier as the parameter for representing voice variable interframe multidate information, and using insertion module In scale cepstrum, together as the mel-frequency characteristic value of voice variable, to be become in speech recognition system using the voice Amount has bigger application range after being trained.

Further, described device further include: third filter module, for carrying out high pass filter to the first voice variable Wave processing, and the first voice variable after filtering processing is divided into continuous frame, windowing process is carried out to each frame.

In embodiments of the present invention, using the first extraction module 610 solve voice variable mel-frequency characteristic value it Before, voice variable can be filtered first with third filter module, framing, the preemphasis processing such as adding window, make to handle The voice variable arrived is more conducive to solve mel-frequency characteristic value.

Further, first extraction module 610 includes: generation module, for generating sound bite；Format mould Block obtains the first voice variable for being formatted to the sound bite, so that the institute of the first voice variable It is identical as the characteristic parameter of the target voice to state characteristic parameter.

In embodiments of the present invention, voice variable can be the Duan Yuyin that generation module generates at random, the length of the voice The characteristic parameters such as degree, sample rate and sound channel should be identical as the length of target voice, sample rate and sound channel, so as to protect The voice variable that card final optimization pass obtains can be used as the sample of speech recognition system.

Further, described device further include: module is obtained, for obtaining the target voice；Second extraction module is used In the mel-frequency characteristic value for extracting the target voice.

In embodiments of the present invention, before handling voice variable, one can be obtained first with module is obtained Section target voice, the target voice are the target of voice variable optimization.

Further, described device further include: training module, for using the target voice sample as described in sample utilization Neural network is trained speech recognition system.

In embodiments of the present invention, after obtaining standard compliant voice variable using neural network, it can use instruction Practice module to be trained speech recognition system using the voice variable as training sample, to promote the robust of the voice system Property.

3rd embodiment

The embodiment of the present invention provides a kind of electronic equipment, comprising: processor, memory and bus, the memory are deposited The executable machine readable instructions of the processor are contained, when electronic equipment operation, the processor and the storage It is executed by bus communication between device, when the machine readable instructions are executed by the processor any described in first embodiment Method.

Memory can include but is not limited to random access memory (Random Access Memory, RAM), read-only to deposit Reservoir (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electric erasable Read-only memory (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..

Processor can be a kind of IC chip, have signal handling capacity.Above-mentioned processor can be general place Manage device, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.；It can also be digital signal processor (DSP), specific integrated circuit (ASIC), ready-made programmable gate array Arrange (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components.It can be real Now or execute the embodiment of the present invention disclosed in various methods, step and logic diagram.General processor can be micro process Device or the processor are also possible to any conventional processor etc..

Fourth embodiment

The embodiment of the present invention provides a kind of computer readable storage medium, and meter is stored on the computer readable storage medium Calculation machine program, when which is run by processor execute any optional side of realizationing of first embodiment in it is any described in Method.

In conclusion the present invention provides a kind of speech samples generation method and device, this method comprises: obtaining the first language After change of tune amount, the mel-frequency characteristic value of the first voice variable is extracted；Wherein, the characteristic parameter of the first voice variable Identical as the characteristic parameter of target voice, the characteristic parameter includes: length, sample rate and sound channel；Utilize neural network meter Calculate the loss letter of the mel-frequency characteristic value of the first voice variable and the mel-frequency characteristic value of the target voice Number；Optimize the damage using the optimization algorithm in the neural network by adjusting the value of sampled point in the first voice variable Function is lost, until the value of the loss function after optimization is less than preset threshold, meets the value of the loss function less than described The voice variable of preset threshold is target voice sample.Therefore, the inverse Meier transformation based on Neural Networks Solution voice variable, and It is carried out by error of the neural network to the mel-frequency characteristic value of voice variable and the mel-frequency characteristic value of target voice Optimization in the hope of making voice variable of the error when being less than preset threshold, and is fought voice variable at this time as one Sample, thus the speech samples collection of abundant speech recognition system.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

Claims

1. a kind of speech samples generation method characterized by comprising

After obtaining the first voice variable, the mel-frequency characteristic value of the first voice variable is extracted；Wherein, first language The characteristic parameter of change of tune amount and the characteristic parameter of target voice are identical, and the characteristic parameter includes: length, sample rate and sound Road；

Utilize the Meier of the mel-frequency characteristic value and the target voice of the first voice variable described in neural computing The loss function of frequecy characteristic value；

Using described in value optimization of the optimization algorithm in the neural network by adjusting sampled point in the first voice variable Loss function, until the value of the loss function after optimization is less than preset threshold, the value for meeting the loss function is less than institute The voice variable for stating preset threshold is target voice sample.

2. speech samples generation method according to claim 1, which is characterized in that described to extract the first voice variable Mel-frequency characteristic value, comprising:

Fourier transformation is carried out to each frame of the first voice variable and obtains the second voice variable；

Meier is carried out to the second voice variable to filter to obtain third voice variable；

Discrete cosine transform is carried out to the third voice variable and obtains melscale cepstrum, and by the melscale scramble Compose the mel-frequency characteristic value as the first voice variable.

3. speech samples generation method according to claim 2, which is characterized in that described to the third voice variable After progress discrete cosine transform obtains melscale cepstrum, the method also includes:

Calculus of differences is carried out to the melscale cepstrum；

It is described using the melscale cepstrum as the mel-frequency characteristic value of the first voice variable, comprising:

The result of the calculus of differences is inserted into the melscale cepstrum and obtains the plum of the first voice variable That frequecy characteristic value.

4. speech samples generation method according to claim 2, which is characterized in that the every of the first voice variable Before one frame progress Fourier transformation obtains the second voice variable, the method also includes:

High-pass filtering processing is carried out to the first voice variable, and the first voice variable after filtering processing is divided into company Continuous frame carries out windowing process to each frame.

5. speech samples generation method according to claim 1, which is characterized in that the first voice variable of the acquisition, packet It includes:

Generate sound bite；

The sound bite is formatted to obtain the first voice variable, so that the spy of the first voice variable It is identical as the characteristic parameter of the target voice to levy parameter.

6. speech samples generation method according to claim 1, which is characterized in that extract the first voice change described Before the mel-frequency characteristic value of amount, the method also includes:

Obtain the target voice；

Extract the mel-frequency characteristic value of the target voice.

7. speech samples generation method according to claim 1-6, which is characterized in that meet the damage described Lose function value be less than the preset threshold voice variable be target voice sample after, the method also includes:

Speech recognition system is trained using the neural network using the target voice sample as sample.

8. a kind of speech samples generating means characterized by comprising

First extraction module, for extracting the mel-frequency feature of the first voice variable after obtaining the first voice variable Value；Wherein, the characteristic parameter of the first voice variable is identical as the characteristic parameter of target voice, and the characteristic parameter includes: Length, sample rate and sound channel；

First computing module, for utilizing the mel-frequency characteristic value and the institute of the first voice variable described in neural computing State the loss function of the mel-frequency characteristic value of target voice；

Optimization module, for utilizing the optimization algorithm in the neural network by adjusting sampled point in the first voice variable Value optimize the loss function, until the value of the loss function after optimization is less than preset threshold, meet the loss letter The voice variable that several values is less than the preset threshold is target voice sample.

9. speech samples generating means according to claim 8, which is characterized in that first extraction module includes:

First conversion module carries out Fourier transformation for each frame to the first voice variable and obtains the change of the second voice Amount；

First filter module filters to obtain third voice variable for carrying out Meier to the second voice variable；

Second conversion module obtains melscale cepstrum for carrying out discrete cosine transform to the third voice variable, and Using the melscale cepstrum as the mel-frequency characteristic value of the first voice variable.

10. speech samples generating means according to claim 9, which is characterized in that described device further include:

Second computing module, for carrying out calculus of differences to the melscale cepstrum；

Second conversion module includes:

It is inserted into module, obtains first voice for the result of the calculus of differences to be inserted into the melscale cepstrum The mel-frequency characteristic value of variable.