CN108389577B

CN108389577B - Optimize method, system, equipment and the storage medium of voice recognition acoustic model

Info

Publication number: CN108389577B
Application number: CN201810146221.8A
Authority: CN
Inventors: 雷延强
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2018-02-12
Filing date: 2018-02-12
Publication date: 2019-05-31
Anticipated expiration: 2038-02-12
Also published as: CN108389577A

Abstract

The embodiment of the invention discloses method, system, equipment and the storage mediums of optimization voice recognition acoustic model.This method comprises: obtaining the mark text of sample voice, and obtain the identification text that sample voice is obtained based on current acoustic model；Mark text and the identification text are compared, and determines the error label information of the relatively described identification text of mark text when comparison result is to mismatch；Decision condition is updated according to the corresponding text of the error label information, updates the mark text of the sample voice；Sample voice and current corresponding mark text, re -training based on set amount optimize the current acoustic model.Using this method, the mark quality of the corresponding mark text of sample voice can be effectively improved, to achieve the purpose that optimize acoustic model.

Description

Optimize method, system, equipment and the storage medium of voice recognition acoustic model

Technical field

The present invention relates to field of computer technology, more particularly to method, system, the equipment of optimization voice recognition acoustic model And storage medium.

Background technique

With speech recognition can application range continuous expansion, speech recognition technology has become an emerging high-tech and produces Industry, and obtain the concern of more technical staff.Currently, one of the important composition in speech recognition system is exactly acoustic model, sound The quality for learning model has been largely fixed the superiority and inferiority of speech recognition result, and therefore, it is necessary to constantly to speech recognition acoustic mode Type optimizes.

Generally, a large amount of sample data is needed to support the training of acoustic model, and sample data frequently includes voice Data and mark text (word content that voice data includes) corresponding to voice data.Mark text is typically based on a large amount of people Work mark is realized or is obtained by the identification of third party's identifying system, but obtains mark text by the above method and often exist centainly Mistake influences to mark quality.

For voice recognition acoustic model, the mark quality for promoting mark text, which is equivalent to, carries out acoustic model optimization One of means, but at present not yet find by promoted mark text quality come realize acoustic model optimization technical side Case.

Summary of the invention

The embodiment of the invention provides method, system, equipment and the storage mediums of optimization voice recognition acoustic model, can The promotion of mark text marking quality is realized, to achieve the purpose that optimize acoustic model.

In a first aspect, the embodiment of the invention provides a kind of methods for optimizing voice recognition acoustic model, comprising:

The mark text of sample voice is obtained, and obtains the identification text that the sample voice is obtained based on current acoustic model This；

The mark text and the identification text are compared, and determines the mark text when comparison result is to mismatch The error label information of the relatively described identification text；

Decision condition is updated according to the corresponding text of the error label information, updates the mark text of the sample voice This；

Sample voice and current corresponding mark text, re -training based on set amount optimize the current acoustic Model.

Second aspect, the embodiment of the invention provides a kind of devices for optimizing voice recognition acoustic model, comprising:

Text obtains module, for obtaining the mark text of sample voice, and obtains the sample voice and is based on current sound Learn the identification text that model obtains；

Error label determining module is not for comparing the mark text and the identification text, and in comparison result The error label information of the relatively described identification text of the mark text is determined when matching；

Text update module is marked, for updating decision condition according to the corresponding text of the error label information, is updated The mark text of the sample voice；

Acoustic model optimization module, for based on set amount sample voice and current corresponding mark text, weight New training optimizes the current acoustic model.

The third aspect, the embodiment of the invention provides a kind of computer equipments, comprising:

One or more processors；

Storage device, for storing one or more programs；

One or more of programs are executed by one or more of processors, so that one or more of processors The method that the optimization voice recognition acoustic model provided such as above-mentioned first aspect embodiment is provided.

Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer journey Sequence realizes the side of the optimization voice recognition acoustic model provided such as above-mentioned first aspect embodiment when the program is executed by processor Method.

In the method for above-mentioned optimization voice recognition acoustic model, system, equipment and storage medium, sample language is obtained first The mark text of sound, and obtain the identification text that sample voice is obtained based on current acoustic model；Then compare mark text and It identifies text, and determines the error label information of the opposite identification text of mark text when comparison result is to mismatch；Root later According to error label information and sample voice respectively in mark text and the pronunciation probability under identification text, the mark of sample voice is updated Explanatory notes sheet；The sample voice and current corresponding mark text, re -training for being based ultimately upon set amount optimize current acoustic Model.Using this method, the mark quality of the corresponding mark text of sample voice can be effectively improved, to improve acoustic mode The quality of training data needed for type, and then achieved the purpose that optimize acoustic model, speech recognition is improved to a certain extent Accuracy rate.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the method for optimization voice recognition acoustic model that the embodiment of the present invention one provides；

Fig. 2 is a kind of flow diagram of method for optimizing voice recognition acoustic model provided by Embodiment 2 of the present invention；

Fig. 3 is a kind of structural block diagram of the device for optimization voice recognition acoustic model that the embodiment of the present invention three provides；

Fig. 4 is a kind of hardware structural diagram for computer equipment that the embodiment of the present invention four provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.

Embodiment one

Fig. 1 is a kind of flow diagram of the method for optimization voice recognition acoustic model that the embodiment of the present invention one provides. This method is suitable for the case where optimizing promotion to the acoustic model for speech recognition, and this method can be known by optimization voice The device of other acoustic model executes, which can be by hardware and/or software realization, and is typically integrated in and has speech recognition function In the computer equipment of energy.

As shown in Figure 1, a kind of method for optimization voice recognition acoustic model that the embodiment of the present invention one provides, including it is as follows Operation:

S101, the mark text for obtaining sample voice, and obtain the identification that sample voice is obtained based on current acoustic model Text.

It is understood that the sample voice is equivalent to one that voice data needed for carrying out acoustic training model is concentrated Voice data, meanwhile, when carrying out acoustic training model, every sample voice is all corresponding, and there are a mark texts.It is described current Acoustic model specifically can be regarded as the sample voice concentrated by voice data and its current corresponding mark sample training obtains Acoustic model.

The available voice data of this step concentrates the mark text of a sample voice, and it is logical to obtain the sample voice Cross corresponding identification text after speech recognition system.Wherein, it is believed that current acoustic model is contained in the speech recognition system, Its speech recognition is especially by current acoustic model realization.

S102, mark text and identification text are compared, and determines that mark text is opposite when comparison result is to mismatch and knows The error label information of other text.

It in the present embodiment, can be by the mark text of sample voice and identification after the identification text for obtaining sample voice Text is compared, to determine whether the text for including in two texts matches one by one, if text included in two texts is equal It matches one by one, then can determine that the comparison result of two texts is text matches, if text included in two texts can not be one by one Matching can determine the comparison result of two texts then to mismatch.For unmatched mark text and identification text, it can recognize To exist in mark text and identifying the unmatched text of text, at this time it is believed that there are the texts of marking error in mark text Word it is also contemplated that there is the text of identification mistake in identification text, or is also believed that in two texts and there is wrong text.It needs It is noted that this step herein it is not intended that have because which kind of above-mentioned situation cause two texts text mismatch, and It is to directly determine in mark text with the identification unmatched text of text, may further determine that out the corresponding mismatch of each mismatch text Information (such as position in mark text and affiliated mismatch type), can finally believe each mismatch for mismatching text Breath carries out being aggregated to form error label information of the mark text relative to identification text.

S103, decision condition is updated according to the corresponding text of error label information, updates the mark text of sample voice.

In the present embodiment, the text update decision condition specifically can be regarded as determine how to mark text into The determination decisions rule that row updates.Mark text is diversified compared with there is the unmatched form of text in identification text, e.g., out Now there are diversity for the sum of mismatch text, and there is also diversity for the mismatch type of appearance mismatch text, different as a result, The substantive content of the error label information of the opposite identification text of mark text of sample voice is just more diversified.

The present embodiment can update decision condition in advance for the corresponding text of various forms of error label information settings, this Step can determine that corresponding text updates decision condition according to the substantive content of error label information, then according to text The corresponding replacement criteria of decision condition is updated to realize the update of sample voice mark text.It should be noted that the present embodiment The update mode of mark text can be, decision condition is updated according to determining text, the identification text currently obtained is selected to make For new mark text, or continue to select original mark text as new mark text.

S104, the sample voice based on set amount and current corresponding mark text, re -training optimize current sound Learn model.

Based on the operation of S101 to S103, the increased quality that this paper is marked to sample data may be implemented, it is possible to understand that It is that before the operation for carrying out this step, every sample voice that the present embodiment can concentrate training data all uses above-mentioned The update that step is labeled sample is promoted, and specifically, determines each sample voice while above-mentioned steps can be used arranged side by side It identifies text, then filters out identification text and the unmatched sample voice further progress mark sample of corresponding mark text again Update, in addition it is also possible to which the serial update for being successively labeled text using above-mentioned steps, the present embodiment are not realized it Form is specifically limited.

(it can specifically regard completion mark text as to mention obtaining the current corresponding mark text of sample voice based on above-mentioned steps Mark text after rising) after, acoustic model can be trained again according to sample voice and its current corresponding mark text, Optimization obtains new current acoustic model.Set amount in this step specifically can be regarded as training data and concentrate the sample language for including Sound total quantity.It is understood that the method for optimization voice recognition acoustic model provided in this embodiment is equivalent to a circulation The method of realization, can again return to the operation that S101 restarts next round after having carried out successively operation, and circulation terminates The circulation that condition can be artificial settings terminates number, and the present embodiment is believed that based on the current acoustic model after optimization retraining The recognition accuracy of speech recognition system can be promoted to a certain extent.

A kind of method for optimization voice recognition acoustic model that the embodiment of the present invention one provides obtains sample voice first Text is marked, and obtains the identification text that sample voice is obtained based on current acoustic model；Then mark text and identification are compared Text, and the opposite error label information for identifying text of mark text is determined when comparison result is to mismatch；Later according to mistake Accidentally the pronunciation probability in mark text and under identifying text, the mark for updating sample voice are literary respectively for markup information and sample voice This；The sample voice and current corresponding mark text, re -training for being based ultimately upon set amount optimize current acoustic model. Using this method, the mark quality of the corresponding mark text of sample voice can be effectively improved, to improve acoustic model institute The quality of training data is needed, and then has achieved the purpose that optimize acoustic model, it is a degree of to improve the accurate of speech recognition Rate.

Embodiment two

Fig. 2 is a kind of flow diagram of method for optimizing voice recognition acoustic model provided by Embodiment 2 of the present invention. The embodiment of the present invention is optimized based on above-described embodiment, in the present embodiment, will further compare the mark text With the identification text, and comparison result be mismatch when determine it is described mark text relatively it is described identification text mistake mark Information is infused, is embodied as: comparing the mark text and identification text, obtains the volume of the mark text and the identification text Distance is collected, and when the editing distance is non-zero, determines comparison result to mismatch；When the comparison result is to mismatch, According to the editing distance, the error label sum of the relatively described identification text of the mark text, the institute of error label are determined In position and the type of error of each error label；By the position of error label sum and each error label and Affiliated type of error is denoted as the error label information.

Meanwhile decision condition will be updated according to the corresponding text of the error label information, update the sample voice Text is marked, is embodied as: based on the error label information, the sample language is searched in preset Multi-level information relation table The corresponding text of sound updates decision condition, wherein it is that sample voice is pronounced in the case where marking text that the text, which updates decision condition, Probabilistic information and sample voice pronounce in the case where identifying text probabilistic information judgement compared with；Determine the sample voice and the mark Infuse text justification after first pronunciation probabilistic information, and with it is described identification text justification after second pronunciation probabilistic information；When When determining that the text updates decision condition establishment based on the first pronunciation probabilistic information and the second pronunciation probabilistic information, The identification text is determined as to the new mark text of the sample voice；Otherwise, continue the mark text as described in The new mark text of sample voice.

Specifically, it is provided by Embodiment 2 of the present invention it is a kind of optimize voice recognition acoustic model method, specifically include as Lower operation:

S201, the mark text for obtaining sample voice, and obtain the identification that sample voice is obtained based on current acoustic model Text.

Illustratively, voice data can be directly acquired and concentrate the corresponding mark text of sample voice, further, it is also possible to logical It crosses the speech recognition system comprising current acoustic model to be decoded sample voice, then realizes the phonetic feature of sample voice It extracts, the identification of sample voice is realized hereby based on the phonetic feature of extraction, obtains the identification text of sample voice.

S202, mark text and identification text are compared, obtain mark text and identifies the editing distance of text, and edited When distance is non-zero, comparison result is determined to mismatch.

Illustratively, the present embodiment following S202 to S204 give text compare and error label information determine it is specific Operation, firstly, this step especially by editing distance algorithm by calculate mark text and identify two word string of text editor away from From come carry out two texts comparison matching.There is one to be converted into separately between two word strings it is understood that editing distance refers to One required minimum edit operation times, wherein the edit operation that can be carried out includes that a character is substituted for another word Symbol is inserted into a character or deletes a character.

When matching is compared to two texts based on editing distance in this step, the operation specifically carried out may is that determination will Least edit operation times (editing distance) when marking text conversion into identification text, when mark text conversion is at identification text Edit operation times be 0 when, it is believed that the smallest edit distance of two texts be 0, i.e., it is believed that the text for including in two texts Match；When mark text conversion at identification text minimum edit operation times be 1 when, it is believed that the editor of two texts away from From being 1, i.e., it is believed that there are unmatched texts at one in two texts；When mark text conversion is edited at the minimum of identification text When number of operations is 2, it is believed that the editing distance of two texts is 2, i.e., it is believed that there are unmatched text at two in two texts, Similarly, when marking text conversion at the minimum edit operation times of text are identified greater than 2, it is believed that there are many places in two texts Unmatched text.

Based on foregoing description, when the editing distance of two texts is non-zero, so that it may think that there are unmatched texts in two texts Word can determine the comparison result of two texts to mismatch.

S203, when comparison result is to mismatch, according to editing distance, determine the mistake of the opposite identification text of mark text The type of error of mark sum, the position of error label and each error label.

In the present embodiment, for two text unmatched for comparison result, it is believed that mark text and identification text Between have differences, i.e., it is believed that the opposite identification text of mark text is there are the text of marking error, this step can be above-mentioned When editing distance determines, determine that the opposite identification text of mark text specifically includes how many marks according to the editing distance determined Infuse the text of mistake, moreover it is possible to determine specific position of the text of marking error in mark text, moreover it is possible to determine each mark The type of error of mistake text, specifically, the above-mentioned editing distance value determined can directly regard the mistake having in mark text as Sum is accidentally marked, during will mark text conversion vehicle identification text, if replaced at a text wherein Operation, then can determine the position (position that can regard error label as) of the text, while may further determine that the text Corresponding translation type is text replacement (type of error that can regard error label as is text replacement).

In addition, there are also texts to be inserted into (in the same position of mark text compared with identifying text for the type of error of error label Set and lacked a text) and text insertion (compared with identify text at the same position for marking text more than text Word), the type of error of error label can specifically be determined by the conversion operation actually carried out in conversion process, e.g., with knowledge It is to mark text same that other text, which compares the conversion operation carried out when the same position for marking text has lacked a text, A text insertion operation is carried out at one position；For another example, one more than compared with identify text at the same position of mark text A text, the conversion operation carried out are that a text delete operation is carried out at same position to mark text.

S204, the position and affiliated type of error of error label sum and each error label are denoted as error label Information.

S205, it is based on error label information, the corresponding text of sample voice is searched in preset Multi-level information relation table Update decision condition.

The present embodiment can select different texts to update decision condition according to the difference of error label information, specifically, The text update decision condition for meeting the error label information can be searched directly in preset Multi-level information relation table, as The corresponding text of current sample voice updates decision condition.

Further, described to be based on the error label information, the sample is searched in preset Multi-level information relation table The corresponding text of this voice updates decision condition, comprising:

Obtain the type of error of the error label sum in the error label information and error label；In the multistage letter It ceases in relation table, is index with the error label sum, search wrong with the matched setting of the type of error of the error label Accidentally type；The text that the update decision condition for corresponding to the setting type of error is determined as the sample voice is updated into decision Condition.

It specifically, can be first with the error label in error label information when progress text update decision condition determines Sum is index, first determines the follow-up corresponding to error label sum, is then searched in follow-up for mistake The setting type of error that the error label type of mark matches can get the update decision item corresponding to the setting type of error Part, the text which can be determined as sample voice by the present embodiment update decision condition.

It should be noted that it is that the sample voice probability that pronounces in the case where mark text is believed that it is practical, which to update decision condition, for the text Cease with sample voice pronounce in the case where identifying text probabilistic information judgement compared with.Wherein, pronunciation probabilistic information is equivalent to sample Voice is divided into a certain number of speech signal frames, and determines the pronunciation unit that aligned condition is in each speech signal frame Afterwards, each speech signal frame of acquisition belongs to the pronunciation probability of corresponding pronunciation unit, and can have sample voice to be based on mark text The pronunciation probabilistic information that pronunciation unit corresponding to this is formed can also have sample voice to be based on pronunciation unit corresponding to identification text The pronunciation probabilistic information of formation.Text in the present embodiment updates practical be equivalent to of decision condition and is determined to above two form The judgement of pronunciation probabilistic information compares.

Further, the Multi-level information relation table is constructed based on following step:

Initialization package is arranged containing primary information, second-level message arranges and the Multi-level information relation table of three-level information column；Described one Storage setting error label sum, the setting error label sum include 1 character error, 2 character errors and multiword in grade information column Mistake；Storage corresponds respectively to the setting type of error of 1 character error and 2 character errors in second-level message column, and sets The information of second-level message cell corresponding to the fixed multiword mistake is sky；Storage corresponds to each institute in three-level information column The update decision condition of setting type of error is stated, and the standard update decision condition of setting is stored in the multiword mistake and is corresponded to Three-level information unit lattice in.

It is understood that having relied primarily on preset Multi-level information when above-mentioned carry out text update decision condition determines Relation table, thus the determination of Multi-level information relation table is also crucial.Specifically, the building step based on above-mentioned Multi-level information relation table Suddenly, the Multi-level information relation table of following table 1 form can be formed.

It as shown in table 1, is specially setting error label sum in primary information column therein, and the setting error label is total Number is broadly divided into 1 character error, three kinds of situations of 2 character errors and multiword mistake, is specially setting type of error in second-level message column, According to the method for determination of editing distance, it is known that every conversion operation of progress can have 3 kinds of translation types, and respectively text replaces Change, text insertion and text are deleted, it follows that in only 1 character error, there are three kinds of type of errors, wrong when there are 2 words It mistakes, then corresponds to six kinds of type of errors, when there are multiword mistake, the type of existing type of error is also more, the present embodiment Do not consider one by one.It is specially the update decision condition for corresponding to each setting type of error in three-level information column, wherein due to multiword The type of type of error is not considered when mistake specifically, the present embodiment be the multiword misspecification standard update decision item of appearance Part.

1 Multi-level information relation table of table

Illustratively, this gives the corresponding update decision conditions of above-mentioned various setting type of errors, e.g., 1 Under character error, when type of error is that text is replaced, it can incite somebody to actionAndAs the particular content for updating decision condition 1_1；When type of error is text When insertion, it can incite somebody to actionAndAs update decision condition The particular content of 1_2；When type of error is that text is deleted, can incite somebody to actionAndAs the particular content for updating decision condition 1_3.

It should be noted that in above-mentioned each formula, p₁(q1_t/o_t) indicate that sample voice is divided into the voice of certain amount M After signal frame, the speech signal frame o of t frame_tBelong to the pronunciation unit q1 of t frame in mark text_tPronunciation probability；p₂(q2_t/ o_t) indicate t frame speech signal frame o_tBelong to the pronunciation unit q2 of t frame in identification text_tPronunciation probability.Wherein, t Range is the 1st frame to certain amount M, i.e., it is believed that t ∈ [1, M]；t₁∈ [x1, x2] indicates the error label text in mark text Word has the corresponding start-stop frame number range of multiple pronunciation units；t₁∈ [y1, y2] indicates the calibration text institute in identification text Has the corresponding start-stop frame number range of multiple pronunciation units, wherein calibration text, which is equivalent in identification text, corresponds to mark text The text of error label text in this.In addition, T_IFor pre-set insertion threshold value, T_DFor pre-set deletion threshold value, the two Specific value can be artificial set according to historical experience value.

Meanwhile in the case where 2 character error,

1) it when the type of error of 2 character errors is respectively text replacement and text replacement, can incite somebody to action:

WithAs the specific of update decision condition 2_1 Content；

2) it when the type of error of 2 words mistake is respectively text replacement and text insertion, can incite somebody to action:

WithAs the particular content for updating decision condition 2_2；

3) it when the type of error of 2 character errors is respectively text replacement and text is deleted, can incite somebody to action:

WithUpdate the particular content of decision condition 2_3；

4) it when the type of error of 2 character errors is respectively text insertion and text insertion, can incite somebody to action:

AndAs the particular content for updating decision condition 2_4；

5) it when the type of error of 2 character errors is respectively text insertion and text is deleted, can incite somebody to action:

AndRegard the particular content for updating decision condition 2_5 as；

6) it when the type of error of 2 character errors is respectively that text is deleted with text deletion, can incite somebody to action:

AndRegard the particular content for updating decision condition 2_6 as.

It should be noted that in above-mentioned each formula, p₁(q1_t/o_t) and p₂(q2_t/o_t) represented by meaning retouched with above-mentioned The meaning stated is identical,Indicate that it is corresponding to have multiple pronunciation units for the 1st error label text in mark text Start-stop frame number range；Indicate that the 2nd error label text has multiple pronunciation units pair in mark text The start-stop frame number range answered；Indicate that it is corresponding to have multiple pronunciation units for the 1st calibration text in identification text Start-stop frame number range,Indicate that it is corresponding to have multiple pronunciation units for the 2nd calibration text in identification text Start-stop frame number range, wherein the 1st calibration text and the 2nd calibration text are equivalent in identification text and correspond respectively to mark The text of 1st error label text and the 2nd error label text in explanatory notes sheet.In addition, T_IAnd T_DRepresented meaning with it is upper The meaning for stating description is identical.

In addition, in the case of multiword mistake, the present embodiment settingWithAs standard update decision condition, wherein p₁(q1_t/o_t) and p₂(q2_t/ o_t) represented by meaning it is equally identical as the meaning of foregoing description, k indicates k-th of text present in identification text, wherein k Value at least more than 2；In addition,Indicate that it is corresponding to have multiple pronunciation units for k-th of text in identification text Start-stop frame number range；T_MFor preset multiword detection threshold value, specific value can be manually set, generally, to prevent from marking The mistake of explanatory notes sheet updates, and the present embodiment is to T_MSetting need by a series of test determine.

This step can accurately find out institute based on above-mentioned Multi-level information relation table and the error label information of determination Corresponding text updates decision condition.

S206, determine sample voice and mark text justification after first pronunciation probabilistic information, and with identification text pair The second pronunciation probabilistic information after neat.

It is understood that the text of above-mentioned determination, which updates decision condition, is specifically equivalent to sample voice in the case where marking text Pronunciation probabilistic information and sample voice pronounce in the case where identify text probabilistic information judgement compared with, be thus the above-mentioned determination of judgement Text updates whether decision condition is true, this step needs to further determine that sample voice and marks the first hair after text justification The second pronunciation probabilistic information after sound probabilistic information and sample voice and mark text justification.

Further, the first pronunciation probabilistic information divides the voice of formation based on the sample voice as unit of frame Signal frame and the first pronunciation unit sequence formed to the mark text modeling determine；The second pronunciation probabilistic information base It is determined in each speech signal frame and the second pronunciation unit sequence formed to the identification text modeling.

Specifically, the concrete operations of the above-mentioned pronunciation of determination first probabilistic information and the second pronunciation unit probabilistic information can describe Are as follows: sample voice is divided into the voice signal of setting frame number by the practical pronunciation duration that sample voice 1) is combined as unit of frame Frame, and can determine that the phonetic feature that each speech signal frame has；2) the pronunciation modeling rule based on setting, can obtain pair respectively Should in mark text the first pronunciation unit sequence, and corresponding to identification text the second pronunciation unit sequence, wherein it is above-mentioned Composition mark text has been separately included in two pronunciation unit sequences and identifies the pronunciation unit of text；3) it is calculated using Dynamic Programming Method can determine the first pronunciation unit being aligned respectively with each phonetic feature from the first pronunciation unit sequence, can also be from second The second pronunciation unit being aligned respectively with each phonetic feature is determined in pronunciation unit sequence, is being determined respectively to its pronunciation unit Afterwards, can also obtain each phonetic feature belong to corresponding first pronunciation unit first pronunciation probability and each phonetic feature belong to phase Answer the second pronunciation probability of the second pronunciation unit；4) it may further determine that the first hair for constituting error label text in mark text after The combination of sound unit, and can get that the first pronunciation unit combines corresponding first initial frame number and the first termination frame number (is equivalent to The start-stop frame number range of error label text)；5) it may further determine that the second pronunciation unit group for constituting and demarcating text in identification text It closes, and can get that the second pronunciation unit combines corresponding second initial frame number and the second termination frame number (is equivalent to calibration text Start-stop frame number range), wherein institute of each calibration text mainly according to corresponding each error label text in mark text is in place Set determination；It 6) may finally be by each first pronunciation probability, the combination of each first pronunciation unit and corresponding first initial frame number and the One termination frame number is determined as the first pronunciation probabilistic information；Each second pronunciation probability, each second pronunciation unit can be combined simultaneously And corresponding second initial frame number and the second termination frame number are determined as the second pronunciation probabilistic information.

It should be noted that above-mentioned pronunciation unit combination specifically may be interpreted as: determining that pronunciation is single by initial consonant and simple or compound vowel of a Chinese syllable When first, for a text " in " for, it is known that form the text pronunciation unit include " zh " and " ong " two, thus may be used Think " in " the corresponding pronunciation unit group of word is combined into " zh " and " ong ", but in actually pronunciation, " zh " and " ong " may be occupied The tone period of multiframe has corresponded to the initial frame number of pronunciation and has terminated frame number when thus pronunciation unit combination is pronounced.

In addition, the calibration text in above-mentioned identification text can be regarded as: assuming that being marked in text when carrying out the comparison of two texts X-th of text mismatches in x-th of text and identification text, needs to mark x-th of text in text and carries out text replacement behaviour Make, the text is equivalent to the error label text in mark text at this time, and identifies in text and belong to same position with mark text The text set can then regard calibration text as.

S207, when based on first pronunciation probabilistic information and second pronunciation probabilistic information determine text update decision condition set up When, it will identify that text is determined as the new mark text of sample voice；Otherwise, continue that new mark of the text as sample voice will be marked Explanatory notes sheet.

It is understood that above-mentioned text update decision condition is based primarily upon pronunciation unit information and is formed, therefore in determination Sample voice is respectively under mark text and identification text after corresponding practical pronunciation unit information, so that it may be substituted into and selected The text selected updates in the corresponding formula of decision condition, thus determines that it is whether true that text updates decision condition, if set up, The identification text that can then will identify that is determined as the new mark text of sample voice, if invalid, can continue will be original Mark new mark text of the text as sample voice.

S208, the sample voice based on set amount and current corresponding mark text, re -training optimize current sound Learn model.

Illustratively, the mark instruction that the sample voice concentrated based on aforesaid operations to training data is labeled sample mentions It, can be according to corresponding mark text re -training current acoustic model current after each sample voice, and promotion after rising.

A kind of method optimizing voice recognition acoustic model provided by Embodiment 2 of the present invention, shows in particular error label Information determines operation, while the update for showing in particular mark text determines operation.Using this method, sample can be effectively improved The mark quality of the corresponding mark text of this voice, to improve the instruction of training data needed for acoustic model, and then reaches The purpose of optimization acoustic model, improves the accuracy rate of speech recognition very well.

Embodiment three

Fig. 3 is a kind of structural block diagram of the device for optimization voice recognition acoustic model that the embodiment of the present invention three provides, should Device is suitable for the case where optimizing promotion to the acoustic model for speech recognition, which can be by hardware and/or soft Part is realized, and is typically integrated in the computer equipment for having speech identifying function.As shown in figure 3, the device includes: that text obtains Modulus block 31, error label determining module 32, mark text update module 33 and acoustic model optimization module 34.

Wherein, text obtains module 31, for obtaining the mark text of sample voice, and obtains the sample voice and is based on The identification text that current acoustic model obtains；

Error label determining module 32 is for comparing the mark text and the identification text, and in comparison result The error label information of the relatively described identification text of the mark text is determined when mismatch；

Text update module 33 is marked, for updating decision condition according to the corresponding text of the error label information, more The mark text of the new sample voice；

Acoustic model optimization module 34, for based on set amount sample voice and current corresponding mark text, Re -training optimizes the current acoustic model.

In the present embodiment, which obtains module 31 by text first and obtains the mark text of sample voice, and obtains Then the identification text for taking the sample voice to obtain based on current acoustic model compares institute by error label determining module 32 Mark text and the identification text are stated, and determines the relatively described identification text of the mark text when comparison result is to mismatch This error label information；It is updated later by mark text update module 33 according to the corresponding text of the error label information Decision condition updates the mark text of the sample voice；Sample eventually by acoustic model optimization module 34 based on set amount This voice and current corresponding mark text, re -training optimize the current acoustic model.

The device for the optimization voice recognition acoustic model that the embodiment of the present invention three provides, can effectively improve sample voice institute The mark quality of corresponding mark text, to improve the instruction of training data needed for acoustic model, and then has reached optimization sound The purpose for learning model, improves the accuracy rate of speech recognition very well.

Further, error label determining module 32, is specifically used for:

The mark text and identification text are compared, the editing distance of the mark text and the identification text is obtained, And when the editing distance is non-zero, comparison result is determined to mismatch；When the comparison result is to mismatch, according to institute Editing distance is stated, determines the error label sum of the relatively described identification text of the mark text, the position of error label And the type of error of each error label；By the error label sum and position and the affiliated mistake of each error label Accidentally type is denoted as the error label information.

Further, text update module 33 is marked, comprising:

Decision condition determination unit is looked into preset Multi-level information relation table for being based on the error label information The corresponding text of the sample voice is looked for update decision condition, wherein it is that sample voice is being marked that the text, which updates decision condition, This lower pronunciation probabilistic information of explanatory notes and sample voice pronounce in the case where identifying text probabilistic information judgement compared with；

Probabilistic information determination unit, for determining that the first pronunciation after the sample voice and the mark text justification is general Rate information, and with it is described identification text justification after second pronunciation probabilistic information；

New text determination unit, for when true based on the first pronunciation probabilistic information and the second pronunciation probabilistic information When the fixed text updates decision condition establishment, the identification text is determined as to the new mark text of the sample voice；It is no Then, continue using the mark text as the new mark text of the sample voice.

On the basis of above-mentioned optimization, the decision condition determination unit is specifically used for:

Obtain the type of error of the error label sum in the error label information and error label；

It is index with the error label sum in the Multi-level information relation table, searches and the error label The matched setting type of error of type of error；

The text update that the update decision condition for corresponding to the setting type of error is determined as the sample voice is determined Plan condition.

Initialization package is arranged containing primary information, second-level message arranges and the Multi-level information relation table of three-level information column；

The storage setting error label sum in primary information column, the setting error label sum include that 1 word is wrong Accidentally, 2 character errors and multiword mistake；

Storage corresponds respectively to the setting type of error of 1 character error and 2 character errors in second-level message column, and The information of second-level message cell corresponding to the multiword mistake is set as sky；

Storage corresponds to the update decision condition of each setting type of error in three-level information column, and will setting Standard update decision condition be stored in the corresponding three-level information unit lattice of the multiword mistake.

Example IV

Fig. 4 is a kind of hardware structural diagram for computer equipment that the embodiment of the present invention four provides.As shown in figure 4, this The computer equipment that inventive embodiments four provide, comprising: processor 41 and storage device 42.Processor in the computer equipment It can be one or more, the processor 41 and storage device in Fig. 4 by taking a processor 41 as an example, in the computer equipment 42 can be connected by bus or other modes, in Fig. 4 for being connected by bus.

Storage device 42 in the computer equipment is used as a kind of computer readable storage medium, can be used for storing one or Multiple programs, described program can be software program, computer executable program and module, such as the embodiment of the present invention one or two Corresponding program instruction/the module of method of optimization voice recognition acoustic model is provided (for example, attached optimization voice shown in Fig. 3 Identify the module in the device of acoustic model, comprising: text obtains module 31, error label determining module 32, mark text more New module 33 and acoustic model optimization module 34).Processor 41 by operation be stored in storage device 42 software program, Instruction and module, thereby executing the various function application and data processing of computer equipment, i.e. the realization above method is implemented Optimize the method for voice recognition acoustic model in example.

Storage device 42 may include storing program area and storage data area, wherein storing program area can storage program area, Application program needed at least one function；Storage data area, which can be stored, uses created data etc. according to equipment.In addition, Storage device 42 may include high-speed random access memory, can also include nonvolatile memory, for example, at least a magnetic Disk storage device, flush memory device or other non-volatile solid state memory parts.In some instances, storage device 42 can be into one Step includes the memory remotely located relative to processor 41, these remote memories can pass through network connection to equipment.On The example for stating network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

Also, when one or more included program of above-mentioned computer equipment is by one or more of processors 41 When execution, program is proceeded as follows:

The mark text of sample voice is obtained, and obtains the identification text that the sample voice is obtained based on current acoustic model This；The mark text and the identification text are compared, and determines that the mark text is opposite when comparison result is to mismatch The error label information of the identification text；Decision condition is updated according to the corresponding text of the error label information, updates institute State the mark text of sample voice；Sample voice and current corresponding mark text, re -training based on set amount are excellent Change the current acoustic model.

In addition, the embodiment of the present invention also provides a kind of computer readable storage medium, it is stored thereon with computer program, it should The side for the optimization voice recognition acoustic model that the embodiment of the present invention one or embodiment two provide is realized when program is executed by processor Method this method comprises: obtaining the mark text of sample voice, and obtains what the sample voice was obtained based on current acoustic model Identify text；The mark text and the identification text are compared, and determines the mark text when comparison result is to mismatch The error label information of this relatively described identification text；Decision condition is updated according to the corresponding text of the error label information, Update the mark text of the sample voice；Sample voice and current corresponding mark text based on set amount, again Training optimizes the current acoustic model.

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of method for optimizing voice recognition acoustic model characterized by comprising

The mark text of sample voice is obtained, and obtains the identification text that the sample voice is obtained based on current acoustic model；

The mark text and the identification text are compared, and determines that the mark text is opposite when comparison result is to mismatch The error label information of the identification text；

Decision condition is updated according to the corresponding text of the error label information, updates the mark text of the sample voice；

Sample voice and current corresponding mark text, re -training based on set amount optimize the current acoustic mould Type；Wherein, the comparison mark text and the identification text, and the mark is determined when comparison result is to mismatch The error label information of the relatively described identification text of text, comprising: compare the mark text and identification text, obtain the mark The editing distance of explanatory notes sheet and the identification text, and when the editing distance is non-zero, comparison result is determined to mismatch；

When the comparison result is to mismatch, according to the editing distance, the relatively described identification text of the mark text is determined The type of error of this error label sum, the position of error label and each error label；

The position and affiliated type of error of the error label sum and each error label are denoted as the error label Information.

2. the method according to claim 1, wherein it is described according to the corresponding text of the error label information more New decision condition updates the mark text of the sample voice, comprising:

Based on the error label information, the corresponding text of the sample voice is searched in preset Multi-level information relation table more New decision condition, wherein it is that sample voice is pronounced probabilistic information and sample in the case where marking text that the text, which updates decision condition, Pronounce in the case where the identifying text judgement of probabilistic information of voice is compared；

Determine the sample voice and it is described mark text justification after first pronunciation probabilistic information, and with the identification text The second pronunciation probabilistic information after alignment；

When determining that the text updates decision condition based on the first pronunciation probabilistic information and the second pronunciation probabilistic information When establishment, the identification text is determined as to the new mark text of the sample voice；Otherwise, continue to make in the mark text For the new mark text of the sample voice.

3. according to the method described in claim 2, it is characterized in that, described be based on the error label information, preset more The corresponding text of the sample voice is searched in grade information relationship table updates decision condition, comprising:

In the Multi-level information relation table, it is index with the error label sum, searches the mistake with the error label The setting type of error of type matching；

The text that the update decision condition for corresponding to the setting type of error is determined as the sample voice is updated into decision item Part.

4. according to the method in claim 2 or 3, which is characterized in that the Multi-level information relation table is based on following step structure It builds:

The storage setting error label sum in primary information column, the setting error label sum include 1 character error, 2 Character error and multiword mistake；

Storage corresponds respectively to the setting type of error of 1 character error and 2 character errors in second-level message column, and sets The information of second-level message cell corresponding to the multiword mistake is sky；

Storage corresponds to the update decision condition of each setting type of error in three-level information column, and by the mark of setting Standard updates decision condition and is stored in the corresponding three-level information unit lattice of the multiword mistake.

5. according to the method described in claim 2, it is characterized in that, the first pronunciation probabilistic information is based on the sample voice The the first pronunciation unit sequence for dividing the speech signal frame of formation as unit of frame and being formed to the mark text modeling is true It is fixed；

The second pronunciation probabilistic information is based on each speech signal frame and the second hair formed to the identification text modeling Sound unit sequence determines.

6. a kind of device for optimizing voice recognition acoustic model characterized by comprising

Text obtains module, for obtaining the mark text of sample voice, and obtains the sample voice and is based on current acoustic mould The identification text that type obtains；

Error label determining module is to mismatch for comparing the mark text and the identification text, and in comparison result When determine it is described mark text relatively it is described identification text error label information；

Text update module is marked, for updating decision condition according to the corresponding text of the error label information, described in update The mark text of sample voice；

Acoustic model optimization module, for based on set amount sample voice and current corresponding mark text, instruct again Practice and optimizes the current acoustic model；

The error label determining module, is specifically used for:

It compares the mark text and identifies text, the editing distance of the acquisition mark text and the identification text, and When the editing distance is non-zero, comparison result is determined to mismatch；

7. a kind of computer equipment, which is characterized in that further include:

One or more processors；

Storage device, for storing one or more programs；

One or more of programs are executed by one or more of processors, so that one or more of processors are realized The method of optimization voice recognition acoustic model a method as claimed in any one of claims 1 to 5.

8. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The method of optimization voice recognition acoustic model a method as claimed in any one of claims 1 to 5 is realized when row.