WO2010024052A1

WO2010024052A1 - Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same

Info

Publication number: WO2010024052A1
Application number: PCT/JP2009/062611
Authority: WO
Inventors: 仁山本; 健花沢; 清一三木
Original assignee: 日本電気株式会社
Priority date: 2008-08-27
Filing date: 2009-07-10
Publication date: 2010-03-04
Also published as: JP5447382B2; JPWO2010024052A1

Abstract

A device for verifying a speech recognition hypothesis is provided with a verification unit conversion section (1) which sets one or more verification units expressing time segments which become processing units for verification to an inputted speech recognition hypothesis and a unit determination section (2) which verifies the correctness of the recognition hypothesis in the time segment of each verification unit in accordance with the verification units set by the verification unit conversion section (1). The verification unit conversion section (1) sets the one or more verification units including a verification unit to which a time segment different from the time segment of a word included in the speech recognition hypothesis.

Description

Speech recognition hypothesis verification device, speech recognition device, method and program used therefor

The present invention relates to a speech recognition hypothesis verification device, a speech recognition device, a speech recognition hypothesis verification method, and a speech recognition method used for verifying a speech recognition hypothesis obtained by speech recognition technology that converts speech into electronic data such as text data. The present invention relates to a speech recognition hypothesis verification program and a speech recognition program.

Along with advances in speech recognition technology, there are situations in which speech recognition systems used in practical applications such as telephone / multi-person conference record creation support and voice UI (User Interface) applications such as mobile phones are being built. It has increased.

However, it is not possible to obtain sufficient speech recognition accuracy due to various acoustic and linguistic phenomena in speech called spontaneous utterances (spoken language) in telephones and conferences, and various outdoor noises. difficult. When an error occurs in speech recognition, there arises a problem that an error correction cost is required or a system malfunction occurs. In order to suppress such adverse effects due to voice recognition errors, detection of voice errors is important.

As a method for detecting a speech error, a method for determining whether the speech error is correct by a speech recognition hypothesis verification device is conceivable. In order to verify this speech recognition hypothesis, a method using a reliability measure for each word in the hypothesis has been proposed.

For example, Patent Document 1 describes a verification device that obtains a generalized word posterior probability of each word as a reliability measure used for verification of a speech recognition result, and determines the correctness of each utterance or word based on the value. .

Further, for example, Patent Document 2 discloses a determination unit that determines whether a character string and a word string generated by a speech recognition unit are correct with reference to a word dictionary prepared in advance, and a method that is different when it is determined as erroneous recognition. And a rewrite means for generating a new word string by voice recognition.

JP 2005-164837 A JP 2001-134288 A

However, the verification device described in Patent Document 1 and the method described in Patent Document 2 have a problem that the detection accuracy of recognition errors based on the verification of the speech recognition hypothesis is not sufficient. In the verification device described in Patent Document 1, since the verification of the speech recognition hypothesis is performed in units of words in the hypothesis, the recognition error section can be obtained only by the combination of the units of words in the hypothesis. That is, since only the few word boundaries included in the speech recognition hypothesis are used to detect which section in speech is erroneously recognized, the detection accuracy of the speech recognition error section is not sufficient.

Also, the system described in Patent Document 2 replaces a word string determined to be incorrect as a correct word string as a result of determining whether the speech recognition hypothesis is correct using a word dictionary. As is clear from the fact that a word dictionary is used for correct / incorrect determination, the verification is performed in units of words, and the detection accuracy of the speech recognition error section is not sufficient as in Patent Document 1.

The present invention has been made in view of the above problems, and in verifying a speech recognition hypothesis, a speech recognition hypothesis verification device with improved detection accuracy of a speech recognition error section during speech, a speech recognition device using the speech recognition hypothesis, and a speech It is an object of the present invention to provide a recognition hypothesis verification method, a speech recognition method, a speech recognition hypothesis verification program, and a speech recognition program.

The speech recognition hypothesis verification device according to the present invention includes a verification unit conversion unit that sets one or more verification units representing a time interval that is a verification processing unit for an input speech recognition hypothesis, and a verification unit conversion unit. A unit determination unit that verifies the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit, and the verification unit conversion unit has a time interval that is different from the time interval of the word included in the speech recognition hypothesis. One or more verification units including the set verification unit are set.

The speech recognition apparatus according to the present invention performs speech recognition on input speech and generates a speech recognition hypothesis, and a speech recognition hypothesis generated by the first speech recognition unit. A speech recognition hypothesis verification unit that performs verification, and a second speech recognition unit that performs speech recognition again with reference to the verification result of the speech recognition hypothesis by the speech recognition hypothesis verification unit. A verification unit conversion unit that sets one or more verification units representing a time interval that is a verification processing unit for the voice recognition hypothesis, and a verification unit set according to the verification unit set by the verification unit conversion unit A unit determination unit that verifies the correctness of the recognition hypothesis in the time interval, and the verification unit conversion unit includes one or more verification units including a time unit different from the time interval of the word included in the speech recognition hypothesis. Single verification To set.

The speech recognition hypothesis verification method according to the present invention is a speech recognition hypothesis verification method for verifying a speech recognition hypothesis, wherein one or more time intervals serving as verification processing units are input for the input speech recognition hypothesis. The verification unit is set to include a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and the recognition hypothesis in the time interval of each verification unit is set according to the set verification unit. Verify the correctness of.

The speech recognition method according to the present invention generates speech recognition hypotheses by performing speech recognition on input speech, and represents a time interval that is a unit of verification processing for the generated speech recognition hypotheses. One or more verification units are set so as to include at least a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and in the time interval of each verification unit according to the set verification unit The correctness of the recognition hypothesis is verified, the verification result of the speech recognition hypothesis is referred to, and the speech recognition is performed again using the acoustic model or the language model selected based on the recognition hypothesis of the time interval determined as the recognition correct answer.

In addition, the speech recognition hypothesis verification program according to the present invention includes one or more verification units representing a time interval serving as a verification processing unit for the speech recognition hypothesis input to the computer. A procedure for setting a verification unit in which a time interval different from the time interval of the word is set is included, and a procedure for verifying the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit. .

The speech recognition program according to the present invention includes a procedure for performing speech recognition on an input speech to a computer to generate a speech recognition hypothesis, and a verification processing unit for the generated speech recognition hypothesis. According to a procedure for setting one or more verification units representing a time interval to include at least a verification unit in which a time interval different from the time interval of a word included in the speech recognition hypothesis is set, and the set verification unit The acoustic model or language model selected based on the recognition hypothesis of the time interval determined as the correct answer by referring to the verification hypothesis of the recognition hypothesis in the time interval of each verification unit and the verification result of the speech recognition hypothesis Are used to execute the voice recognition procedure again.

According to the present invention, it is possible to improve the accuracy of detecting a speech recognition error section during speech.

It is a block diagram which shows the structural example of the speech recognition hypothesis verification apparatus of this invention. It is a block diagram which shows the other structural example of the speech recognition hypothesis verification apparatus of this invention. It is a block diagram which shows the structural example of the speech recognition hypothesis verification apparatus concerning the 1st Embodiment of this invention. It is explanatory drawing which shows the example of a verification unit. It is explanatory drawing which shows the example of a response | compatibility of a character, a syllable, a phoneme, the state of HMM, and a speech feature-value. It is explanatory drawing which shows the example of an expression of the feature used by CRF which is an example of a verification model. It is explanatory drawing which shows the example of the change method prescribed | regulated to a section determination rule. It is a flowchart which shows an example of operation | movement of the speech recognition hypothesis verification apparatus shown in FIG. It is a block diagram which shows the structural example of the speech recognition apparatus concerning the 2nd Embodiment of this invention. It is explanatory drawing which shows the example of the verification result by speech, the speech recognition hypothesis by a 1st speech recognition part, and the speech recognition hypothesis verification part.

Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings.

FIG. 1 is a block diagram showing a configuration example of the speech recognition hypothesis verification device of the present invention.

The speech recognition hypothesis verification device shown in FIG. 1 includes a verification unit conversion unit 1 and a unit determination unit 2.

The verification unit conversion unit 1 sets one or more verification units representing a time interval as a verification processing unit for the input speech recognition hypothesis. The verification unit conversion unit 1 sets one or more verification units including a verification unit in which a time interval different from the time interval of words included in the input speech recognition hypothesis is set. For example, the verification unit conversion unit 1 may set one or more verification units including a verification unit in which a time interval smaller than the time interval of words included in the speech recognition hypothesis is set. For example, one or more verification units may be set based on the voice analysis frame unit.

The unit determination unit 2 verifies the correctness of the recognition hypothesis in the time interval of each verification unit according to the verification unit set by the verification unit conversion unit 1. The unit determination unit 2 is extracted from, for example, a verification model composed of a probability model having a plurality of types of features including features related to speech recognition errors in the time interval of the verification unit, and a speech recognition hypothesis to be processed for each verification unit. The correctness / incorrectness of the recognition hypothesis in the time interval of each verification unit may be verified based on the characteristics. For example, the unit determination unit 2 calculates, based on the verification model and the feature extracted for each verification unit, a verification score that indicates the degree to which the recognition hypothesis for the time interval of the verification unit is probable for each verification unit. The correctness of the recognition hypothesis in the time interval of each verification unit may be verified. A CRF model may be used as the verification model.

Thus, the verification unit conversion unit 1 sets one or more verification units including a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and the unit determination unit 2 By verifying the correctness of the recognition hypothesis in the time interval of each verification unit in accordance with the set verification unit, the detection accuracy of the speech recognition error interval during speech can be increased. This is because verification can be performed based on features that are not word-based features by making the verification unit not dependent on the time interval of words in the recognition hypothesis.

FIG. 2 is a block diagram showing another configuration example of the speech recognition hypothesis verification device of the present invention.

As shown in FIG. 2, the speech recognition hypothesis verification apparatus shown in FIG. 1 may further include a section determination unit 3. The section determination unit 3 determines the error section of the speech recognition hypothesis to be processed based on the verification result for each verification unit by the unit determination unit 2. At that time, the section determination unit 3 refers to the verification results of a plurality of verification units (including a verification score, if any), changes the verification result by the unit determination unit 2, and determines an error section.

(First embodiment)
Hereinafter, a more specific embodiment of the above-described speech recognition hypothesis verification device will be described.

FIG. 3 is a block diagram showing a configuration example of the speech recognition hypothesis verification device according to the first exemplary embodiment of the present invention.

The speech recognition hypothesis verification device 101 shown in FIG. 3 includes a speech recognition hypothesis input unit 12, a verification unit conversion unit 13, a unit determination unit 14, a section determination unit 15, a verification model storage unit 16, and a section determination rule storage. Unit 17.

The speech recognition hypothesis verification apparatus 101 is realized as a whole by, for example, an information processing apparatus such as a personal computer (PC) or a server apparatus that processes input data with a computer. In this embodiment, a speech recognition hypothesis as a speech recognition result output from a speech recognition device or the like is input, and a verification result of the input speech recognition hypothesis is output.

Further, the speech recognition hypothesis input unit 12 is realized by various data input devices for inputting data. Specifically, it is realized by a data input device and a control unit that receives the input. The verification unit conversion unit 13, the unit determination unit 14, and the section determination unit 15 are realized by a CPU or the like that operates according to a program. The verification model storage unit 16 and the section determination rule storage unit 17 are realized by a storage unit that stores data.

Each component of the speech recognition hypothesis verification device 101 is based on an arbitrary combination of hardware and software, centering on a CPU of any computer, a memory, a program loaded in the memory, and a storage unit such as a hard disk for storing the program. Realized. In addition, for example, various interfaces such as a network connection interface may be included.

The speech recognition hypothesis input unit 12 receives a speech recognition hypothesis from an external speech recognition device (not shown) and provides (outputs) it to the verification unit conversion unit 13. The speech recognition hypothesis is expressed, for example, in the form of a word graph or N best word sequence including one or more word sequences to which time information associated with a recognition score (likelihood) or recognition target speech is assigned. The

The verification unit conversion unit 13 converts the speech recognition hypothesis input via the speech recognition hypothesis input unit 12 into a data set of verification units. Here, the verification unit refers to a unit of verification performed in the unit determination unit 14 at the subsequent stage. The verification unit conversion unit 13 does not actually generate a data set of verification units, but sets a range (a time interval in speech data to be recognized) for each verification unit for the speech recognition hypothesis. . Hereinafter, when the expression of determining the verification unit is used, it means that one or more verification units are defined for the speech recognition hypothesis.

The verification unit conversion unit 13 determines the verification unit without depending on the time information of the speech recognition hypothesis (the time interval of each word indicated by the speech recognition hypothesis). Specifically, the verification unit may be determined so that at least one of the time intervals as the verification unit includes a time interval different from the time interval of the word indicated by the speech recognition hypothesis. For example, an analysis frame unit of recognition target speech or a segment unit obtained by collecting a plurality of analysis frames may be used as one verification unit. In such a case, the range of each verification unit is obtained by dividing the speech data to be recognized into one analysis frame or one segment time interval. Further, a unit such as a character / syllable / phoneme / HMM state obtained by dividing a word of a speech recognition hypothesis into fine units and a unit based on an analysis frame (an analysis frame unit or a segment unit) can be used together. Note that the time interval used as one verification unit in the speech data does not necessarily have to be constant, such as when used in combination with units such as character, syllable, phoneme, and HMM states.

The verification unit conversion unit 13 includes, for example, an identifier for identifying each verification unit as information indicating the verification unit in the speech data to be recognized, and which section the verification unit corresponds to in the time section of the recognition hypothesis. You may produce | generate the information which matched the information which shows.

4a to 4d are explanatory diagrams showing examples of setting of verification units.

For example, as shown in FIG. 4a, it is assumed that the speech recognition hypothesis corresponding to the analysis frame sections 1 to 100 with the speech to be recognized indicates the word “end of month”.

Here, when the analysis frame unit is used as the verification unit, the verification unit may be determined in correspondence with each analysis frame of the recognition target speech as shown in FIG. 4b. In the case of this example, the verification unit conversion unit 13 may generate information indicating 100 verification units each covering the time interval of the analysis frames 1 to 100.

Also, for example, when using a segment unit in which 10 analysis frames are combined as a verification unit, as shown in FIG. 4c, it is made to correspond to each segment of the recognition target speech having 10 analysis frames as one unit. A verification unit may be determined. In the case of this example, the verification unit conversion unit 13 includes 10 verifications each covering the time interval of the segments 1 to 10 such as the segment 1 combining the analysis frames 1 to 10 and the segment 2 combining the analysis frames 11 to 20. What is necessary is just to produce | generate the information which shows a unit.

Further, for example, in the case of using a unit related to a word such as the beginning, middle, and end of a word, as shown in FIG. 4d, the beginning / middle of the word indicated by the analysis frame boundary in the speech recognition hypothesis. The verification unit may be determined corresponding to each of the part and the end part. In the case of this example, the verification unit conversion unit 13 may generate information indicating three verification units each having the time period of the head part, the intermediate part, and the tail part of the word indicated by the analysis frame boundary.

Also, when determining the verification unit, the states of characters, syllables, phonemes, and HMMs may be used in combination.

FIG. 5 shows an example of correspondence between characters, syllables, phonemes, HMM states, and speech feature values.

As shown in FIG. 5, the verification unit may be determined in correspondence with the states of characters, syllables, phonemes, and HMMs that constitute a word indicated by the analysis frame boundary in the speech recognition hypothesis. For example, a range corresponding to “the head of the character“ now ”” is specified based on the time intervals of the syllable, phoneme, and HMM states, and is determined as one verification unit. In FIG. 5, the audio data is shown as a time series of audio feature amounts. In this case, one analysis frame corresponds to a feature amount (vector) calculated every certain interval (for example, 25 milliseconds) of the audio signal.

The unit determination unit 14 receives the information indicating the verification unit and the speech recognition hypothesis from the verification unit conversion unit 13, extracts predetermined verification features for each verification unit, and stores them in the extracted verification feature value and verification model storage unit 16. Whether the recognition hypothesis is correct or not is determined for each verification unit using the stored verification model. The unit determination unit 14 calculates, for example, a verification score indicating the degree to which the recognition hypothesis for the time interval of the verification unit is likely, and determines whether the recognition hypothesis is correct for each verification unit based on the calculated verification score.

The unit determination unit 14 may include a target unit selection unit 141, a feature extraction unit 142, a score calculation unit 143, and a target unit determination unit 144, for example, as illustrated in FIG.

The target unit selection unit 141 receives the information indicating the verification unit and the speech recognition hypothesis from the verification unit conversion unit 13, and provides the speech recognition hypothesis to the feature extraction unit 142. Further, each verification unit included in the speech data to be recognized is sequentially specified as a verification unit to be processed and provided to the feature extraction unit 142 and the target unit determination unit 144.

The feature extraction unit 142 receives the speech recognition hypothesis and information indicating the verification unit of the processing target from the target unit selection unit 141, extracts a predetermined verification feature related to the verification unit of the processing target, and sends it to the score calculation unit 143. provide.

The verification feature is a feature used when the speech recognition hypothesis is verified, and is extracted for each verification unit. As the verification feature, a feature having a property related to the correctness or erroriness of the speech recognition hypothesis is used. In addition, if various verification features are used, verification accuracy can be increased. For example, structure information of the speech recognition hypothesis, linguistic information of the speech recognition hypothesis, and information related to the recognition calculation may be used. The feature related to the verification unit to be processed is not only the data of the time interval of the verification unit to be processed (hereinafter simply referred to as verification unit data) but also the data including the time interval before and after that and the word including the time interval. It is also possible to extract using the data of the time interval.

The structural information of the speech recognition hypothesis includes, for example, the number of arcs competing in the time interval of the verification unit to be processed indicated by the word graph, the number of nodes included in the same time interval, and the like. When the segment unit of the analysis frame is used as the verification unit, if the number of arcs existing in the segment section is large, there is a possibility that the recognition error probability in the section is high. In addition, if there are a large number of nodes included in the section, the section may be a word boundary in the original utterance, and the possibility of recognition error may be different before and after the section.

Linguistic information includes, for example, the surface layer and part of speech of the word in the hypothesis. By using the word surface layer as a feature, frequent speech recognition error expressions (recognition error patterns of the speech recognition apparatus) can be handled. In particular, by extracting these features in units smaller than the word unit, for example, when a long word such as “the end of the month” appears as a recognition hypothesis, the latter part is particularly likely to be a recognition error. An example can be detected.

As a feature related to recognition calculation, for example, a value representing the plausibility of a hypothesis such as acoustic likelihood and language likelihood can be cited. In the verification unit, when the section is a recognition error, the value may be relatively low or the value with the competition hypothesis may be small. By using a value such as an acoustic likelihood obtained in units of frames in the verification unit, it is possible to refer to the details more than in the case of averaging in units of words. In addition, a reliability score in units of words obtained by the verification device described in Patent Document 1 described above can be used as a verification feature.

The score calculation unit 143 receives information indicating the verification unit to be processed and the verification feature related to the verification unit from the feature extraction unit 142, and uses the verification model stored in the verification model storage unit 16 to calculate the verification score. Calculate and provide to the target unit determination unit 144.

The verification model storage unit 16 stores information on a verification model, which is a model representing the strength of the relationship between the verification feature found in the verification unit data and the correctness or erroriness of the recognition hypothesis.

The score calculation unit 143 may calculate the verification score using, for example, identification processing by CRF (Conditional Random Fields) which is a type of identification model. Here, CRF is described as the following equation (1).

P (Y | X) = exp (Λ · Φ (X, Y)) / Z (1)

In Expression (1), “X” indicates an input to be subjected to identification processing. “Y” is an identification result associated with the input. “Φ (X, Y)” is a feature as a feature used for identification, and “Λ” is a CRF model parameter (weight value) corresponding to each feature. “Z” is a normalization term. Note that “exp ()” indicates a function for obtaining the power of a numerical value with e as the base.

In the embodiment using the identification processing by CRF, the input “X” is verification unit data converted from the speech recognition hypothesis to be verified. The output “Y” is a verification result associated with each input verification unit data. For the feature “(X, Y)”, values taken by verification features such as the number of arcs, the number of nodes, and the appearance frequency are used. During the identification process, an output that maximizes the left side P (Y | X) of the above equation (1) is selected with respect to the input. The model parameter of CRF maximizes the log likelihood of the above equation (1) using a pair of input (X: verification unit data) and output (Y: identification result) associated in advance as learning data. Optimization (learning) may be performed by an iterative calculation method based on a standard. For details on the identification processing and model parameter learning method using CRF, see, for example, the literature “J.Lafferty, A.McCallum, F.Pereira,” Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ", Proceedings of 18th International Conference of Machine Learning, 2001, p.282-289".

The verification model storage unit 16 may hold, for example, feature Φ information and model parameter Λ (weight value) information as CRF information.

The target unit determination unit 144 determines the correctness of the recognition hypothesis for each verification unit by comparing the verification score obtained for the processing target verification unit specified by the target unit selection unit 141 with a predetermined criterion. This determination result corresponds to the verification result in the verification unit for the recognition hypothesis. The target unit determination unit 144 provides the determination result (that is, the verification result of each verification unit) to the section determination unit 15. A verification score may be provided along with the verification result.

Hereinafter, the verification score calculation method and the correctness determination method using identification processing by CRF will be described more specifically.

For example, focusing on one of the verification units set for a certain voice data length, if the recognition hypothesis is incorrect (or correct) in the speech recognition hypothesis in the interval corresponding to the time interval of the verification unit ) Has a verification feature such as the number of arcs = 4 and the number of nodes = 7. In such a case, these features may be expressed as features used in the verification model, as shown in FIG.

FIG. 6 is an explanatory diagram showing an example of the CRF feature Φ.

FIG. 6 shows an example of features of F (number of arcs = 4) = 1 and F (number of nodes = 7) = 1.

The score calculation unit 143 may obtain the score of each verification result by multiplying these features and the respective weight values Λ of the verification results (for example, correct answer and error). Then, the target unit determination unit 144 may determine the verification result for the verification unit by determining the effective verification result from the highest score.

The section determination unit 15 receives a verification result for each verification unit from the target unit determination unit 144 of the unit determination unit 14 and determines a recognition error section included in the speech recognition hypothesis. In the present embodiment, the section determination unit 15 changes the verification result for each verification unit as necessary according to the section determination rule stored in the section determination rule storage unit 17, thereby recognizing the recognition included in the speech recognition hypothesis. Determine the error interval. The section determination rule is a rule for changing the verification result for each verification unit by the unit determination unit 14 in accordance with the usage (information defining a change method or the like). For example, a method of changing based on the reliability of the verification result and the relationship with the verification result of other verification units (for example, the previous and subsequent verification units) may be defined.

FIGS. 7a to 7d are explanatory diagrams showing examples of changing methods defined in the section determination rules. FIG. 7a shows an example of the verification result by the unit determination unit 14 of the verification units 1 to 32 set for the input speech recognition hypothesis.

In the example shown in FIG. 7a, as a verification result by the unit determination unit 14, labeling indicating that the recognition hypothesis for the verification units 1 to 5, 9, 12 to 15, 17 to 19, 25 to 28, and 30 to 32 is correct. ○ ”is displayed. In addition, the recognition hypotheses for the verification units 6 to 8, 10 to 11, 16, 20 to 24, and 29 are labeled “x” indicating that they are incorrect. With respect to such a verification result, as shown in FIG. 7b, first, the verification result of a section in which the same label is equal to or higher than a predetermined score and continues for a predetermined unit or more is determined. In the example shown in FIG. 7b, the verification results are confirmed for the sections of the verification units 1 to 5, 6 to 8, 12 to 15, 17 to 19, 20 to 24, 25 to 28, and 30 to 32 surrounded by a square. ing. As a result, sections of verification units 6 to 8 and 20 to 24 are determined as error sections. Even if the verification result label is the same, if there is a section where the verification score attached to the verification result is not equal to or higher than the predetermined score, the section is fixed if it is not continuous beyond the predetermined unit. Not.

Next, as shown in FIG. 7c, with respect to the unconfirmed section, a predetermined number of reference sections before and after that are referred to, and the verification result is confirmed by majority vote. In the example shown in FIG. 7c, for the verification units 9 to 11, 16, and 29 which are unconfirmed sections, a majority decision based on a confirmed section composed of three verification units before and after is attempted. As a result, as shown by the underline, 29 verification results have been changed to those in which the recognition hypothesis is correct. Regarding the unconfirmed sections of the verification unit data 9 to 11, the number determined to be correct in the sections of the verification units 6 to 8 and the verification units 12 to 14 that are the determined sections before and after the verification section data is determined to be an error. The number is 3 and cannot be decided by majority vote. In addition to this, there may be a case where a predetermined number of confirmed sections cannot be obtained continuously.

Finally, as shown in FIG. 7d, the remaining unconfirmed section is confirmed with the verification result in which the recognition hypothesis is an error. In the example shown in FIG. 7d, the verification units 9 to 11, which are unconfirmed sections, are changed to verification results in which the recognition hypothesis is erroneous as indicated by the underline.

The section determination unit 15 may detect the time section of the verification unit finally determined as the error section as a recognition error section in the recognition hypothesis as a result of changing the verification result for each verification unit according to the section determination rule. In the example shown in FIGS. 7a to 7d, the time interval corresponding to the intervals of the verification units 6 to 11 and 20 to 24 is detected as the recognition error interval.

The section determination rule storage unit 17 includes, as the section determination rule, information specifying what logic is used for the change process, and parameters used for each logic (for example, the number of units used for continuous determination, a threshold value, and the like). You may remember.

Next, the operation of this embodiment will be described.

FIG. 8 is a flowchart showing an example of the operation of the speech recognition hypothesis verification apparatus 101 shown in FIG.

As shown in FIG. 8, when activated, the speech recognition hypothesis verification device 101 reads the verification model and the section determination rule from the storage device that implements the verification model storage unit 16 and the section determination rule storage unit 17, respectively. Initialization processing such as expansion so as to be referred to from the unit determination unit 14 and the section determination unit 15 is performed (step 11).

On the other hand, the speech recognition hypothesis input unit 12 receives (inputs) a speech recognition hypothesis, for example, and provides (outputs) the speech recognition hypothesis to the verification unit conversion unit 13 in response to a notification of the end of speech recognition processing from an external speech recognition device. (Step 12). Note that, for example, the speech recognition hypothesis input unit 12 may input a speech recognition hypothesis in response to an instruction from the user.

When a speech recognition hypothesis to be verified is input via the speech recognition hypothesis input unit 12, the verification unit conversion unit 13 converts the input speech recognition hypothesis into a data set of one or more verification units. It provides to the determination part 14 (step 13). For example, the verification unit conversion unit 13 provides information indicating one or more verification units to the unit determination unit 14 using information on a time interval in the audio data.

The unit determination unit 14 obtains a verification score for each verification unit, and verifies the recognition hypothesis (determines correctness) (step 14). In the unit determination unit 14, first, the target unit selection unit 141 sequentially designates each verification unit set for the recognition hypothesis as a processing target. Then, the feature extraction unit 142 extracts the verification feature of the verification unit designated as the processing target. Next, the score calculation unit 143 calculates a verification score for the verification unit designated as the processing target with reference to the extracted verification feature and the verification model. Finally, the target unit determination unit 144 determines the correctness of the recognition hypothesis for the time interval of the verification unit designated as the processing target, based on the calculated verification score. The verification result (correction determination result) for each verification unit determined in this way is provided to the section determination unit 15 together with the verification score.

The section determination unit 15 detects a recognition error section in the speech recognition hypothesis input as the verification target based on the verification result for each verification unit (step 15). The section determination unit 15 appropriately changes the verification result assigned to each verification unit in accordance with the section determination rule, and uses the time section corresponding to the verification unit finally determined to be an error as the recognition error section in the speech recognition hypothesis. Output a series of speech recognition hypothesis verification processing.

As described above, according to the present embodiment, as the verification unit of the speech recognition hypothesis, a unit smaller than the word unit in the hypothesis or an analysis frame reference unit that does not depend on the word recognition in the hypothesis is used. The speech recognition hypothesis can be verified with reference to features that are not in units, and as a result, speech recognition error intervals can be detected with higher accuracy.

Also, since the section determination unit 15 has a function of adjusting (changing) the verification result in the verification unit, it is possible to detect a recognition error section suitable for the intended use. For example, when a speech in a recognition error section is cut out and speech recognition is performed again, a certain length of time section is required. In such a case, a predetermined length or more can be ensured. In addition, based on the verification score, it is possible to take measures such as putting a section having the same level of “correct” and “error” on hold, and it is possible to improve robustness against a determination error in the unit determination unit 14. Further, determining the unconfirmed section based on the preceding and following determined sections corresponds to a kind of smoothing process, and for example, it is possible to correct only one unit that differs from the surroundings.

When the speech recognition hypothesis is expressed in the form of N best word strings, a common verification unit can be set for the N word strings using a segment unit or the like. It is also possible to set different verification units by using units related to each word indicated by. Note that even when the speech recognition hypothesis is expressed in the form of a word graph, it is possible to set a common verification unit for the entire word graph using segment units, etc. It is also possible to set different verification units in combination.

In addition, for each word string indicated by the speech recognition hypothesis, only one type of verification unit is defined using a single criterion such as a segment unit, and verification is performed based on features extracted for each verification unit. For example, it is also possible to determine a plurality of types of verification units, perform verification for each type, and determine the error recognition section after integrating the results. In such a case, a plurality of verification unit conversion units 13 and unit determination units 14 are provided, and an interval determination unit 15 integrates verification results from the plurality of unit determination units 14 to determine an error interval. What should I do?

(Second Embodiment)
Next, a second embodiment of the present invention will be described.

FIG. 9 is a block diagram showing a configuration example of the speech recognition apparatus according to the second embodiment of the present invention.

The speech recognition apparatus 201 shown in FIG. 9 includes a first speech recognition unit 21, a speech recognition hypothesis verification unit 22, a second speech recognition unit 23, a first model storage unit 24, and a second model storage. Part 25.

The speech recognition apparatus 201 is realized as a whole by an information processing apparatus such as a personal computer (PC) or a server apparatus that processes information input by a computer, for example.

The first speech recognition unit 21 performs speech recognition processing on the speech input to the speech recognition device 201 to obtain word string candidates corresponding to the speech, and outputs, for example, a word graph as a speech recognition hypothesis To do. In the first speech recognition unit 21, the first model stored in the first model storage unit 24 (a model for speech recognition, an acoustic model, a language model, Normal speech recognition processing such as searching for a word string that matches speech data may be performed according to the score given by the word dictionary. For example, a hidden Markov model is used as the acoustic model, and a word trigram model is used as the language model.

The speech recognition hypothesis verification unit 22 is a processing unit corresponding to the speech recognition hypothesis verification device 101 shown in FIG. 3, and for the speech recognition hypothesis output by the first speech recognition unit 21, a verification unit setting process, A verification process for each verification unit and an error interval determination process are performed, and the result is output as a verification result. As the verification result, for example, information (frame number or the like) indicating a speech recognition error section in the speech data is output.

Based on the verification result by the speech recognition hypothesis verification unit 22, the second speech recognition unit 23 performs speech recognition again for a section determined as a speech recognition error section of the input speech or a section including the front and back thereof. Process. The second speech recognition unit 23 performs speech recognition processing using the second model stored in the second model storage unit 25. Here, it is assumed that the second model storage unit 25 stores a model different from the first model stored in the first model storage unit 24. In the case of an acoustic model, information indicating the appearance probability distribution of speech feature values may be stored for each unit such as phonemes. For example, when the hidden Markov model is used as the second model, the hidden Markov model in which a predetermined value (a value different from the first model) is derived as the appearance probability distribution of the speech feature amount for each unit such as phonemes. May be stored (such as information on coefficients used in the calculation). In the case of a language model, information indicating the appearance probability and the connection probability may be stored for each unit such as a word. For example, when a word trigram model is used as the second model, a word trigram from which a predetermined value (a value different from the first model) is derived as an appearance probability or connection probability for each unit of a word or the like. You may memorize | store the parameter (information of the coefficient used at the time of calculation, etc.) for prescribing | regulating a model.

FIG. 10 is an explanatory diagram showing an example of an utterance, a speech recognition hypothesis by the first speech recognition unit 21, and a verification result by the speech recognition hypothesis verification unit 22.

As shown in FIG. 10, for example, in response to an utterance “Matsui comes out this month”, the first speech recognition unit 21 performs voice recognition “<End of the month> <Tue> <No> <Out> <Game>”. Assume that a hypothesis is output. Note that “<>” indicates a word break in the speech recognition hypothesis. When the speech recognition hypothesis verification unit 22 extracts and verifies the features of each verification unit for this speech recognition hypothesis, a section corresponding to the end of “Tue” from the latter half of the “Month” at the end of this month. That is, assume that it is determined that the section corresponding to “Matsui” in the utterance is the recognition error section.

For example, the second speech recognition unit 23 determines the section (corresponding to the period from the second half of the “month” at the end of the current month to the end of “fire”) that the speech recognition hypothesis verification unit 22 determines as the recognition error section. The speech recognition process may be performed using the word string “game where the recognition hypothesis” indicated by the recognition hypothesis in the section in which the recognition hypothesis is determined to be correct is used as a linguistic restriction. In this example, by confirming the “game to be played” and using the section before that as a recognition target, as a linguistic restriction, for example, a language model representing the ease of connection of words is used as the second model. In the second half of the subject, words that are easy to connect with “no” “out” should be placed at the top. In the speech recognition process in the first speech recognition unit 21, since “no” and “out” are not determined, all possibilities must be taken into consideration, but recognition accuracy can be improved by adding constraints.

Also, for example, it is estimated that a person's name is likely to appear in the utterance from “a game where the recognition hypothesis is determined to be correct”, and a speech recognition process is performed using a model that easily recognizes the person's name as the second model. Also good. In this example, by obtaining information that “person name” is likely to appear before the “game where the game appears” section, it is only necessary to increase the likelihood that a word used as “person name” will appear in the section like “person name”. . Regarding the selection of the second model, when a model different from the first model is stored in the second model storage unit 25 as the second model in advance, the second model stored as it is. May be used. For example, when a plurality of types of models are stored in the second model storage unit 25, a model different from the first model may be selected as the second model. Note that even a model of the same type as the first model can be used as the second model by giving a value different from the parameter given to the first model.

In this way, speech recognition is performed by adding a temporal constraint that which section of speech (speech) is erroneous and a linguistic constraint or acoustic constraint of what kind of linguistic or acoustic information exists before and after that. Accuracy can be increased.

In the present invention, the processing in the speech recognition hypothesis verification device and the speech recognition device is implemented by the speech recognition hypothesis verification device and the speech recognition program other than those realized by the dedicated hardware described above. The program may be recorded on a recording medium that can be read by the recognition device, and the program recorded on the recording medium may be read and executed by a speech recognition hypothesis verification device or a speech recognition device. Recording media that can be read by the speech recognition hypothesis verification device or the speech recognition device include IC cards, memory cards, and transferable recording media such as floppy disks (registered trademark), magneto-optical disks, DVDs, and CDs. It refers to an HDD or the like built in a speech recognition hypothesis verification device or speech recognition device. The program recorded on this recording medium is read by a control block, for example, and the same processing as described above is performed under the control of the control block.

Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2008-218605 filed on August 27, 2008, the entire disclosure of which is incorporated herein.

The present invention can be suitably applied to a system that uses voice recognition technology.

Claims

A verification unit conversion unit that sets one or more verification units representing a time interval serving as a verification processing unit for the input speech recognition hypothesis;
In accordance with the verification unit set by the verification unit conversion unit, comprising a unit determination unit that verifies the correctness of the recognition hypothesis in the time interval of each verification unit,
The verification unit conversion unit sets one or more verification units including a verification unit in which a time interval different from a time interval of a word included in the speech recognition hypothesis is set.
The speech recognition hypothesis verification device according to claim 1,
The verification unit conversion unit is a speech recognition hypothesis verification device that sets one or more verification units including a verification unit in which a time interval smaller than a time interval of a word included in the speech recognition hypothesis is set.
In the speech recognition hypothesis verification device according to claim 1 or 2,
The unit determination unit is extracted from a verification model composed of a probabilistic model having a plurality of types of features including features related to speech recognition errors in at least a time interval of a verification unit, and a speech recognition hypothesis to be processed for each verification unit A speech recognition hypothesis verification device that verifies the correctness of a recognition hypothesis in the time interval of each verification unit based on characteristics.
The speech recognition hypothesis verification device according to claim 3,
A speech recognition hypothesis verification device using a CRF model as the verification model.
The speech recognition hypothesis verification device according to any one of claims 1 to 4,
Based on the verification result for each verification unit by the unit determination unit, comprising an interval determination unit that determines an error interval of the speech recognition hypothesis to be processed,
The speech recognition hypothesis verification device that determines an error section after the section determination unit changes a verification result by the unit determination section with reference to verification results of a plurality of verification units.
The speech recognition hypothesis verification device according to any one of claims 1 to 5,
The verification unit conversion unit is a speech recognition hypothesis verification device that sets one or more verification units based on a speech analysis frame unit.
A first speech recognition unit that performs speech recognition on the input speech and generates a speech recognition hypothesis;
A speech recognition hypothesis verification unit for verifying a speech recognition hypothesis generated by the first speech recognition unit;
A second speech recognition unit that performs speech recognition again with reference to the verification result of the speech recognition hypothesis by the speech recognition hypothesis verification unit;
The speech recognition hypothesis verification unit
A verification unit conversion unit that sets one or more verification units representing a time interval serving as a verification processing unit for the input speech recognition hypothesis;
In accordance with the verification unit set by the verification unit conversion unit, a unit determination unit that verifies the correctness of the recognition hypothesis in the time interval of each verification unit,
The verification unit conversion unit sets one or more verification units including a verification unit in which a time interval different from a time interval of a word included in the speech recognition hypothesis is set.
The speech recognition apparatus according to claim 7.
The second speech recognition unit refers to a verification result of the speech recognition hypothesis by the speech recognition hypothesis verification unit, and uses an acoustic model or a language model selected based on the recognition hypothesis of the time interval determined to be a correct answer. A speech recognition device that performs speech recognition.
A speech recognition hypothesis verification method for verifying a speech recognition hypothesis,
Verification in which one or more verification units representing a time interval serving as a verification processing unit are set to a time interval different from at least a time interval of a word included in the speech recognition hypothesis for the input speech recognition hypothesis Set to include units,
A speech recognition hypothesis verification method that verifies the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit.
The speech recognition hypothesis verification method according to claim 9,
Based on a verification model consisting of a probabilistic model that features multiple types of features including features related to speech recognition errors at least in the time interval of the verification unit, and features extracted from the speech recognition hypothesis to be processed for each verification unit, The speech recognition hypothesis verification method according to claim 9, wherein the correctness of the recognition hypothesis in the time interval of each verification unit is verified.
In the speech recognition hypothesis verification method according to claim 9 or 10,
Based on the verification result for each verification unit, when determining the error section of the speech recognition hypothesis to be processed, referring to the verification results of a plurality of verification units, after changing the verification results by the unit determination unit, A speech recognition hypothesis verification method for determining an error interval.
Generate speech recognition hypotheses by performing speech recognition on the input speech,
A verification in which one or more verification units representing a time interval serving as a verification processing unit are set to a generated time recognition hypothesis that is different from at least a time interval of a word included in the speech recognition hypothesis. Set to include units,
According to the set verification unit, verify the correctness of the recognition hypothesis in the time interval of each verification unit,
A speech recognition method that performs speech recognition again using an acoustic model or a language model selected based on a recognition hypothesis of a time interval determined to be a correct recognition with reference to a verification result of a speech recognition hypothesis.
On the computer,
A verification unit in which one or more verification units representing a time interval serving as a verification processing unit for the input speech recognition hypothesis are set to a time interval different from the time interval of words included in the speech recognition hypothesis The steps to set to include
A speech recognition hypothesis verification program for executing a procedure for verifying the correctness of a recognition hypothesis in the time interval of each verification unit according to a set verification unit.
In the speech recognition hypothesis verification program according to claim 13,
On the computer,
Based on a verification model consisting of a probabilistic model that features multiple types of features including features related to speech recognition errors at least in the time interval of the verification unit, and features extracted from the speech recognition hypothesis to be processed for each verification unit, A speech recognition hypothesis verification program for executing a procedure for verifying the correctness of a recognition hypothesis in the time interval of each verification unit.
In the speech recognition hypothesis verification program according to claim 13 or 14,
On the computer,
Based on the verification result for each verification unit, when determining the error section of the speech recognition hypothesis to be processed, referring to the verification results of a plurality of verification units, after changing the verification results by the unit determination unit, A speech recognition hypothesis verification program for executing a procedure for determining an error interval.
On the computer,
A procedure for performing speech recognition on the input speech to generate a speech recognition hypothesis,
A verification in which one or more verification units representing a time interval serving as a verification processing unit are set to a generated time recognition hypothesis that is different from at least a time interval of a word included in the speech recognition hypothesis. The steps to set to include units,
A procedure for verifying the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit,
Speech recognition for executing a procedure for performing speech recognition again using an acoustic model or a language model selected based on the recognition hypothesis of a time interval determined as a correct answer with reference to the verification result of the speech recognition hypothesis Program.