WO2010024052A1 - Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same - Google Patents

Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same Download PDF

Info

Publication number
WO2010024052A1
WO2010024052A1 PCT/JP2009/062611 JP2009062611W WO2010024052A1 WO 2010024052 A1 WO2010024052 A1 WO 2010024052A1 JP 2009062611 W JP2009062611 W JP 2009062611W WO 2010024052 A1 WO2010024052 A1 WO 2010024052A1
Authority
WO
WIPO (PCT)
Prior art keywords
verification
speech recognition
unit
recognition hypothesis
hypothesis
Prior art date
Application number
PCT/JP2009/062611
Other languages
French (fr)
Japanese (ja)
Inventor
仁 山本
健 花沢
清一 三木
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2010526623A priority Critical patent/JP5447382B2/en
Publication of WO2010024052A1 publication Critical patent/WO2010024052A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the present invention relates to a speech recognition hypothesis verification device, a speech recognition device, a speech recognition hypothesis verification method, and a speech recognition method used for verifying a speech recognition hypothesis obtained by speech recognition technology that converts speech into electronic data such as text data.
  • the present invention relates to a speech recognition hypothesis verification program and a speech recognition program.
  • Patent Document 1 describes a verification device that obtains a generalized word posterior probability of each word as a reliability measure used for verification of a speech recognition result, and determines the correctness of each utterance or word based on the value. .
  • Patent Document 2 discloses a determination unit that determines whether a character string and a word string generated by a speech recognition unit are correct with reference to a word dictionary prepared in advance, and a method that is different when it is determined as erroneous recognition. And a rewrite means for generating a new word string by voice recognition.
  • the verification device described in Patent Document 1 and the method described in Patent Document 2 have a problem that the detection accuracy of recognition errors based on the verification of the speech recognition hypothesis is not sufficient.
  • the recognition error section can be obtained only by the combination of the units of words in the hypothesis. That is, since only the few word boundaries included in the speech recognition hypothesis are used to detect which section in speech is erroneously recognized, the detection accuracy of the speech recognition error section is not sufficient.
  • Patent Document 2 replaces a word string determined to be incorrect as a correct word string as a result of determining whether the speech recognition hypothesis is correct using a word dictionary.
  • a word dictionary is used for correct / incorrect determination, the verification is performed in units of words, and the detection accuracy of the speech recognition error section is not sufficient as in Patent Document 1.
  • the present invention has been made in view of the above problems, and in verifying a speech recognition hypothesis, a speech recognition hypothesis verification device with improved detection accuracy of a speech recognition error section during speech, a speech recognition device using the speech recognition hypothesis, and a speech It is an object of the present invention to provide a recognition hypothesis verification method, a speech recognition method, a speech recognition hypothesis verification program, and a speech recognition program.
  • the speech recognition hypothesis verification device includes a verification unit conversion unit that sets one or more verification units representing a time interval that is a verification processing unit for an input speech recognition hypothesis, and a verification unit conversion unit.
  • a unit determination unit that verifies the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit, and the verification unit conversion unit has a time interval that is different from the time interval of the word included in the speech recognition hypothesis.
  • One or more verification units including the set verification unit are set.
  • the speech recognition apparatus performs speech recognition on input speech and generates a speech recognition hypothesis, and a speech recognition hypothesis generated by the first speech recognition unit.
  • a speech recognition hypothesis verification unit that performs verification, and a second speech recognition unit that performs speech recognition again with reference to the verification result of the speech recognition hypothesis by the speech recognition hypothesis verification unit.
  • a verification unit conversion unit that sets one or more verification units representing a time interval that is a verification processing unit for the voice recognition hypothesis, and a verification unit set according to the verification unit set by the verification unit conversion unit
  • a unit determination unit that verifies the correctness of the recognition hypothesis in the time interval, and the verification unit conversion unit includes one or more verification units including a time unit different from the time interval of the word included in the speech recognition hypothesis. Single verification To set.
  • the speech recognition hypothesis verification method is a speech recognition hypothesis verification method for verifying a speech recognition hypothesis, wherein one or more time intervals serving as verification processing units are input for the input speech recognition hypothesis.
  • the verification unit is set to include a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and the recognition hypothesis in the time interval of each verification unit is set according to the set verification unit. Verify the correctness of.
  • the speech recognition method generates speech recognition hypotheses by performing speech recognition on input speech, and represents a time interval that is a unit of verification processing for the generated speech recognition hypotheses.
  • One or more verification units are set so as to include at least a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and in the time interval of each verification unit according to the set verification unit.
  • the correctness of the recognition hypothesis is verified, the verification result of the speech recognition hypothesis is referred to, and the speech recognition is performed again using the acoustic model or the language model selected based on the recognition hypothesis of the time interval determined as the recognition correct answer.
  • the speech recognition hypothesis verification program includes one or more verification units representing a time interval serving as a verification processing unit for the speech recognition hypothesis input to the computer.
  • a procedure for setting a verification unit in which a time interval different from the time interval of the word is set is included, and a procedure for verifying the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit. .
  • the speech recognition program includes a procedure for performing speech recognition on an input speech to a computer to generate a speech recognition hypothesis, and a verification processing unit for the generated speech recognition hypothesis.
  • a procedure for setting one or more verification units representing a time interval to include at least a verification unit in which a time interval different from the time interval of a word included in the speech recognition hypothesis is set, and the set verification unit
  • the acoustic model or language model selected based on the recognition hypothesis of the time interval determined as the correct answer by referring to the verification hypothesis of the recognition hypothesis in the time interval of each verification unit and the verification result of the speech recognition hypothesis Are used to execute the voice recognition procedure again.
  • FIG. 1 is a block diagram showing a configuration example of the speech recognition hypothesis verification device of the present invention.
  • the speech recognition hypothesis verification device shown in FIG. 1 includes a verification unit conversion unit 1 and a unit determination unit 2.
  • the verification unit conversion unit 1 sets one or more verification units representing a time interval as a verification processing unit for the input speech recognition hypothesis.
  • the verification unit conversion unit 1 sets one or more verification units including a verification unit in which a time interval different from the time interval of words included in the input speech recognition hypothesis is set.
  • the verification unit conversion unit 1 may set one or more verification units including a verification unit in which a time interval smaller than the time interval of words included in the speech recognition hypothesis is set.
  • one or more verification units may be set based on the voice analysis frame unit.
  • the unit determination unit 2 verifies the correctness of the recognition hypothesis in the time interval of each verification unit according to the verification unit set by the verification unit conversion unit 1.
  • the unit determination unit 2 is extracted from, for example, a verification model composed of a probability model having a plurality of types of features including features related to speech recognition errors in the time interval of the verification unit, and a speech recognition hypothesis to be processed for each verification unit.
  • the correctness / incorrectness of the recognition hypothesis in the time interval of each verification unit may be verified based on the characteristics.
  • the unit determination unit 2 calculates, based on the verification model and the feature extracted for each verification unit, a verification score that indicates the degree to which the recognition hypothesis for the time interval of the verification unit is probable for each verification unit.
  • the correctness of the recognition hypothesis in the time interval of each verification unit may be verified.
  • a CRF model may be used as the verification model.
  • the verification unit conversion unit 1 sets one or more verification units including a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and the unit determination unit 2 By verifying the correctness of the recognition hypothesis in the time interval of each verification unit in accordance with the set verification unit, the detection accuracy of the speech recognition error interval during speech can be increased. This is because verification can be performed based on features that are not word-based features by making the verification unit not dependent on the time interval of words in the recognition hypothesis.
  • FIG. 2 is a block diagram showing another configuration example of the speech recognition hypothesis verification device of the present invention.
  • the speech recognition hypothesis verification apparatus shown in FIG. 1 may further include a section determination unit 3.
  • the section determination unit 3 determines the error section of the speech recognition hypothesis to be processed based on the verification result for each verification unit by the unit determination unit 2. At that time, the section determination unit 3 refers to the verification results of a plurality of verification units (including a verification score, if any), changes the verification result by the unit determination unit 2, and determines an error section.
  • FIG. 3 is a block diagram showing a configuration example of the speech recognition hypothesis verification device according to the first exemplary embodiment of the present invention.
  • the speech recognition hypothesis verification device 101 shown in FIG. 3 includes a speech recognition hypothesis input unit 12, a verification unit conversion unit 13, a unit determination unit 14, a section determination unit 15, a verification model storage unit 16, and a section determination rule storage. Unit 17.
  • the speech recognition hypothesis verification apparatus 101 is realized as a whole by, for example, an information processing apparatus such as a personal computer (PC) or a server apparatus that processes input data with a computer.
  • a speech recognition hypothesis as a speech recognition result output from a speech recognition device or the like is input, and a verification result of the input speech recognition hypothesis is output.
  • the speech recognition hypothesis input unit 12 is realized by various data input devices for inputting data. Specifically, it is realized by a data input device and a control unit that receives the input.
  • the verification unit conversion unit 13, the unit determination unit 14, and the section determination unit 15 are realized by a CPU or the like that operates according to a program.
  • the verification model storage unit 16 and the section determination rule storage unit 17 are realized by a storage unit that stores data.
  • Each component of the speech recognition hypothesis verification device 101 is based on an arbitrary combination of hardware and software, centering on a CPU of any computer, a memory, a program loaded in the memory, and a storage unit such as a hard disk for storing the program. Realized.
  • various interfaces such as a network connection interface may be included.
  • the speech recognition hypothesis input unit 12 receives a speech recognition hypothesis from an external speech recognition device (not shown) and provides (outputs) it to the verification unit conversion unit 13.
  • the speech recognition hypothesis is expressed, for example, in the form of a word graph or N best word sequence including one or more word sequences to which time information associated with a recognition score (likelihood) or recognition target speech is assigned.
  • the verification unit conversion unit 13 converts the speech recognition hypothesis input via the speech recognition hypothesis input unit 12 into a data set of verification units.
  • the verification unit refers to a unit of verification performed in the unit determination unit 14 at the subsequent stage.
  • the verification unit conversion unit 13 does not actually generate a data set of verification units, but sets a range (a time interval in speech data to be recognized) for each verification unit for the speech recognition hypothesis. .
  • the expression of determining the verification unit it means that one or more verification units are defined for the speech recognition hypothesis.
  • the verification unit conversion unit 13 determines the verification unit without depending on the time information of the speech recognition hypothesis (the time interval of each word indicated by the speech recognition hypothesis). Specifically, the verification unit may be determined so that at least one of the time intervals as the verification unit includes a time interval different from the time interval of the word indicated by the speech recognition hypothesis. For example, an analysis frame unit of recognition target speech or a segment unit obtained by collecting a plurality of analysis frames may be used as one verification unit. In such a case, the range of each verification unit is obtained by dividing the speech data to be recognized into one analysis frame or one segment time interval.
  • a unit such as a character / syllable / phoneme / HMM state obtained by dividing a word of a speech recognition hypothesis into fine units and a unit based on an analysis frame (an analysis frame unit or a segment unit) can be used together.
  • the time interval used as one verification unit in the speech data does not necessarily have to be constant, such as when used in combination with units such as character, syllable, phoneme, and HMM states.
  • the verification unit conversion unit 13 includes, for example, an identifier for identifying each verification unit as information indicating the verification unit in the speech data to be recognized, and which section the verification unit corresponds to in the time section of the recognition hypothesis. You may produce
  • 4a to 4d are explanatory diagrams showing examples of setting of verification units.
  • the speech recognition hypothesis corresponding to the analysis frame sections 1 to 100 with the speech to be recognized indicates the word “end of month”.
  • the verification unit may be determined in correspondence with each analysis frame of the recognition target speech as shown in FIG. 4b.
  • the verification unit conversion unit 13 may generate information indicating 100 verification units each covering the time interval of the analysis frames 1 to 100.
  • the verification unit conversion unit 13 includes 10 verifications each covering the time interval of the segments 1 to 10 such as the segment 1 combining the analysis frames 1 to 10 and the segment 2 combining the analysis frames 11 to 20. What is necessary is just to produce
  • the beginning / middle of the word indicated by the analysis frame boundary in the speech recognition hypothesis may be determined corresponding to each of the part and the end part.
  • the verification unit conversion unit 13 may generate information indicating three verification units each having the time period of the head part, the intermediate part, and the tail part of the word indicated by the analysis frame boundary.
  • the states of characters, syllables, phonemes, and HMMs may be used in combination.
  • FIG. 5 shows an example of correspondence between characters, syllables, phonemes, HMM states, and speech feature values.
  • the verification unit may be determined in correspondence with the states of characters, syllables, phonemes, and HMMs that constitute a word indicated by the analysis frame boundary in the speech recognition hypothesis. For example, a range corresponding to “the head of the character“ now ”” is specified based on the time intervals of the syllable, phoneme, and HMM states, and is determined as one verification unit.
  • the audio data is shown as a time series of audio feature amounts. In this case, one analysis frame corresponds to a feature amount (vector) calculated every certain interval (for example, 25 milliseconds) of the audio signal.
  • the unit determination unit 14 receives the information indicating the verification unit and the speech recognition hypothesis from the verification unit conversion unit 13, extracts predetermined verification features for each verification unit, and stores them in the extracted verification feature value and verification model storage unit 16. Whether the recognition hypothesis is correct or not is determined for each verification unit using the stored verification model.
  • the unit determination unit 14 calculates, for example, a verification score indicating the degree to which the recognition hypothesis for the time interval of the verification unit is likely, and determines whether the recognition hypothesis is correct for each verification unit based on the calculated verification score.
  • the unit determination unit 14 may include a target unit selection unit 141, a feature extraction unit 142, a score calculation unit 143, and a target unit determination unit 144, for example, as illustrated in FIG.
  • the target unit selection unit 141 receives the information indicating the verification unit and the speech recognition hypothesis from the verification unit conversion unit 13, and provides the speech recognition hypothesis to the feature extraction unit 142. Further, each verification unit included in the speech data to be recognized is sequentially specified as a verification unit to be processed and provided to the feature extraction unit 142 and the target unit determination unit 144.
  • the feature extraction unit 142 receives the speech recognition hypothesis and information indicating the verification unit of the processing target from the target unit selection unit 141, extracts a predetermined verification feature related to the verification unit of the processing target, and sends it to the score calculation unit 143. provide.
  • the verification feature is a feature used when the speech recognition hypothesis is verified, and is extracted for each verification unit.
  • a feature having a property related to the correctness or erroriness of the speech recognition hypothesis is used.
  • verification accuracy can be increased.
  • structure information of the speech recognition hypothesis, linguistic information of the speech recognition hypothesis, and information related to the recognition calculation may be used.
  • the feature related to the verification unit to be processed is not only the data of the time interval of the verification unit to be processed (hereinafter simply referred to as verification unit data) but also the data including the time interval before and after that and the word including the time interval. It is also possible to extract using the data of the time interval.
  • the structural information of the speech recognition hypothesis includes, for example, the number of arcs competing in the time interval of the verification unit to be processed indicated by the word graph, the number of nodes included in the same time interval, and the like.
  • the segment unit of the analysis frame is used as the verification unit, if the number of arcs existing in the segment section is large, there is a possibility that the recognition error probability in the section is high.
  • the section may be a word boundary in the original utterance, and the possibility of recognition error may be different before and after the section.
  • Linguistic information includes, for example, the surface layer and part of speech of the word in the hypothesis.
  • the word surface layer as a feature, frequent speech recognition error expressions (recognition error patterns of the speech recognition apparatus) can be handled.
  • the latter part is particularly likely to be a recognition error. An example can be detected.
  • a value representing the plausibility of a hypothesis such as acoustic likelihood and language likelihood can be cited.
  • the value may be relatively low or the value with the competition hypothesis may be small.
  • a reliability score in units of words obtained by the verification device described in Patent Document 1 described above can be used as a verification feature.
  • the score calculation unit 143 receives information indicating the verification unit to be processed and the verification feature related to the verification unit from the feature extraction unit 142, and uses the verification model stored in the verification model storage unit 16 to calculate the verification score. Calculate and provide to the target unit determination unit 144.
  • the verification model storage unit 16 stores information on a verification model, which is a model representing the strength of the relationship between the verification feature found in the verification unit data and the correctness or erroriness of the recognition hypothesis.
  • the score calculation unit 143 may calculate the verification score using, for example, identification processing by CRF (Conditional Random Fields) which is a type of identification model.
  • CRF Conditional Random Fields
  • CRF is described as the following equation (1).
  • X indicates an input to be subjected to identification processing.
  • Y is an identification result associated with the input.
  • ⁇ (X, Y) is a feature as a feature used for identification, and “ ⁇ ” is a CRF model parameter (weight value) corresponding to each feature.
  • Z is a normalization term. Note that “exp ()” indicates a function for obtaining the power of a numerical value with e as the base.
  • the input “X” is verification unit data converted from the speech recognition hypothesis to be verified.
  • the output “Y” is a verification result associated with each input verification unit data.
  • values taken by verification features such as the number of arcs, the number of nodes, and the appearance frequency are used.
  • X) of the above equation (1) is selected with respect to the input.
  • the model parameter of CRF maximizes the log likelihood of the above equation (1) using a pair of input (X: verification unit data) and output (Y: identification result) associated in advance as learning data. Optimization (learning) may be performed by an iterative calculation method based on a standard.
  • the verification model storage unit 16 may hold, for example, feature ⁇ information and model parameter ⁇ (weight value) information as CRF information.
  • the target unit determination unit 144 determines the correctness of the recognition hypothesis for each verification unit by comparing the verification score obtained for the processing target verification unit specified by the target unit selection unit 141 with a predetermined criterion. This determination result corresponds to the verification result in the verification unit for the recognition hypothesis.
  • the target unit determination unit 144 provides the determination result (that is, the verification result of each verification unit) to the section determination unit 15. A verification score may be provided along with the verification result.
  • FIG. 6 is an explanatory diagram showing an example of the CRF feature ⁇ .
  • the score calculation unit 143 may obtain the score of each verification result by multiplying these features and the respective weight values ⁇ of the verification results (for example, correct answer and error). Then, the target unit determination unit 144 may determine the verification result for the verification unit by determining the effective verification result from the highest score.
  • the section determination unit 15 receives a verification result for each verification unit from the target unit determination unit 144 of the unit determination unit 14 and determines a recognition error section included in the speech recognition hypothesis. In the present embodiment, the section determination unit 15 changes the verification result for each verification unit as necessary according to the section determination rule stored in the section determination rule storage unit 17, thereby recognizing the recognition included in the speech recognition hypothesis. Determine the error interval.
  • the section determination rule is a rule for changing the verification result for each verification unit by the unit determination unit 14 in accordance with the usage (information defining a change method or the like). For example, a method of changing based on the reliability of the verification result and the relationship with the verification result of other verification units (for example, the previous and subsequent verification units) may be defined.
  • FIGS. 7a to 7d are explanatory diagrams showing examples of changing methods defined in the section determination rules.
  • FIG. 7a shows an example of the verification result by the unit determination unit 14 of the verification units 1 to 32 set for the input speech recognition hypothesis.
  • the verification results are confirmed for the sections of the verification units 1 to 5, 6 to 8, 12 to 15, 17 to 19, 20 to 24, 25 to 28, and 30 to 32 surrounded by a square. ing.
  • sections of verification units 6 to 8 and 20 to 24 are determined as error sections. Even if the verification result label is the same, if there is a section where the verification score attached to the verification result is not equal to or higher than the predetermined score, the section is fixed if it is not continuous beyond the predetermined unit. Not.
  • the remaining unconfirmed section is confirmed with the verification result in which the recognition hypothesis is an error.
  • the verification units 9 to 11 which are unconfirmed sections, are changed to verification results in which the recognition hypothesis is erroneous as indicated by the underline.
  • the section determination unit 15 may detect the time section of the verification unit finally determined as the error section as a recognition error section in the recognition hypothesis as a result of changing the verification result for each verification unit according to the section determination rule.
  • the time interval corresponding to the intervals of the verification units 6 to 11 and 20 to 24 is detected as the recognition error interval.
  • the section determination rule storage unit 17 includes, as the section determination rule, information specifying what logic is used for the change process, and parameters used for each logic (for example, the number of units used for continuous determination, a threshold value, and the like). You may remember.
  • FIG. 8 is a flowchart showing an example of the operation of the speech recognition hypothesis verification apparatus 101 shown in FIG.
  • the speech recognition hypothesis verification device 101 when activated, the speech recognition hypothesis verification device 101 reads the verification model and the section determination rule from the storage device that implements the verification model storage unit 16 and the section determination rule storage unit 17, respectively.
  • Initialization processing such as expansion so as to be referred to from the unit determination unit 14 and the section determination unit 15 is performed (step 11).
  • the speech recognition hypothesis input unit 12 receives (inputs) a speech recognition hypothesis, for example, and provides (outputs) the speech recognition hypothesis to the verification unit conversion unit 13 in response to a notification of the end of speech recognition processing from an external speech recognition device. (Step 12).
  • the speech recognition hypothesis input unit 12 may input a speech recognition hypothesis in response to an instruction from the user.
  • the verification unit conversion unit 13 converts the input speech recognition hypothesis into a data set of one or more verification units. It provides to the determination part 14 (step 13). For example, the verification unit conversion unit 13 provides information indicating one or more verification units to the unit determination unit 14 using information on a time interval in the audio data.
  • the unit determination unit 14 obtains a verification score for each verification unit, and verifies the recognition hypothesis (determines correctness) (step 14).
  • the target unit selection unit 141 sequentially designates each verification unit set for the recognition hypothesis as a processing target.
  • the feature extraction unit 142 extracts the verification feature of the verification unit designated as the processing target.
  • the score calculation unit 143 calculates a verification score for the verification unit designated as the processing target with reference to the extracted verification feature and the verification model.
  • the target unit determination unit 144 determines the correctness of the recognition hypothesis for the time interval of the verification unit designated as the processing target, based on the calculated verification score.
  • the verification result (correction determination result) for each verification unit determined in this way is provided to the section determination unit 15 together with the verification score.
  • the section determination unit 15 detects a recognition error section in the speech recognition hypothesis input as the verification target based on the verification result for each verification unit (step 15).
  • the section determination unit 15 appropriately changes the verification result assigned to each verification unit in accordance with the section determination rule, and uses the time section corresponding to the verification unit finally determined to be an error as the recognition error section in the speech recognition hypothesis. Output a series of speech recognition hypothesis verification processing.
  • the verification unit of the speech recognition hypothesis a unit smaller than the word unit in the hypothesis or an analysis frame reference unit that does not depend on the word recognition in the hypothesis is used.
  • the speech recognition hypothesis can be verified with reference to features that are not in units, and as a result, speech recognition error intervals can be detected with higher accuracy.
  • the section determination unit 15 has a function of adjusting (changing) the verification result in the verification unit, it is possible to detect a recognition error section suitable for the intended use. For example, when a speech in a recognition error section is cut out and speech recognition is performed again, a certain length of time section is required. In such a case, a predetermined length or more can be ensured.
  • a recognition error section suitable for the intended use. For example, when a speech in a recognition error section is cut out and speech recognition is performed again, a certain length of time section is required. In such a case, a predetermined length or more can be ensured.
  • the verification score it is possible to take measures such as putting a section having the same level of “correct” and “error” on hold, and it is possible to improve robustness against a determination error in the unit determination unit 14.
  • determining the unconfirmed section based on the preceding and following determined sections corresponds to a kind of smoothing process, and for example, it is possible to correct only one unit that differs from
  • a common verification unit can be set for the N word strings using a segment unit or the like. It is also possible to set different verification units by using units related to each word indicated by. Note that even when the speech recognition hypothesis is expressed in the form of a word graph, it is possible to set a common verification unit for the entire word graph using segment units, etc. It is also possible to set different verification units in combination.
  • each word string indicated by the speech recognition hypothesis only one type of verification unit is defined using a single criterion such as a segment unit, and verification is performed based on features extracted for each verification unit. For example, it is also possible to determine a plurality of types of verification units, perform verification for each type, and determine the error recognition section after integrating the results. In such a case, a plurality of verification unit conversion units 13 and unit determination units 14 are provided, and an interval determination unit 15 integrates verification results from the plurality of unit determination units 14 to determine an error interval. What should I do?
  • FIG. 9 is a block diagram showing a configuration example of the speech recognition apparatus according to the second embodiment of the present invention.
  • the speech recognition apparatus 201 shown in FIG. 9 includes a first speech recognition unit 21, a speech recognition hypothesis verification unit 22, a second speech recognition unit 23, a first model storage unit 24, and a second model storage. Part 25.
  • the speech recognition apparatus 201 is realized as a whole by an information processing apparatus such as a personal computer (PC) or a server apparatus that processes information input by a computer, for example.
  • an information processing apparatus such as a personal computer (PC) or a server apparatus that processes information input by a computer, for example.
  • the first speech recognition unit 21 performs speech recognition processing on the speech input to the speech recognition device 201 to obtain word string candidates corresponding to the speech, and outputs, for example, a word graph as a speech recognition hypothesis To do.
  • the first model stored in the first model storage unit 24 (a model for speech recognition, an acoustic model, a language model, Normal speech recognition processing such as searching for a word string that matches speech data may be performed according to the score given by the word dictionary.
  • a hidden Markov model is used as the acoustic model
  • a word trigram model is used as the language model.
  • the speech recognition hypothesis verification unit 22 is a processing unit corresponding to the speech recognition hypothesis verification device 101 shown in FIG. 3, and for the speech recognition hypothesis output by the first speech recognition unit 21, a verification unit setting process, A verification process for each verification unit and an error interval determination process are performed, and the result is output as a verification result.
  • the verification result for example, information (frame number or the like) indicating a speech recognition error section in the speech data is output.
  • the second speech recognition unit 23 Based on the verification result by the speech recognition hypothesis verification unit 22, the second speech recognition unit 23 performs speech recognition again for a section determined as a speech recognition error section of the input speech or a section including the front and back thereof. Process.
  • the second speech recognition unit 23 performs speech recognition processing using the second model stored in the second model storage unit 25.
  • the second model storage unit 25 stores a model different from the first model stored in the first model storage unit 24.
  • information indicating the appearance probability distribution of speech feature values may be stored for each unit such as phonemes.
  • the hidden Markov model in which a predetermined value (a value different from the first model) is derived as the appearance probability distribution of the speech feature amount for each unit such as phonemes. May be stored (such as information on coefficients used in the calculation).
  • information indicating the appearance probability and the connection probability may be stored for each unit such as a word.
  • a word trigram model is used as the second model, a word trigram from which a predetermined value (a value different from the first model) is derived as an appearance probability or connection probability for each unit of a word or the like. You may memorize
  • FIG. 10 is an explanatory diagram showing an example of an utterance, a speech recognition hypothesis by the first speech recognition unit 21, and a verification result by the speech recognition hypothesis verification unit 22.
  • the first speech recognition unit 21 performs voice recognition “ ⁇ End of the month> ⁇ Tue> ⁇ No> ⁇ Out> ⁇ Game>”. Assume that a hypothesis is output. Note that “ ⁇ >” indicates a word break in the speech recognition hypothesis.
  • the speech recognition hypothesis verification unit 22 extracts and verifies the features of each verification unit for this speech recognition hypothesis, a section corresponding to the end of “Tue” from the latter half of the “Month” at the end of this month. That is, assume that it is determined that the section corresponding to “Matsui” in the utterance is the recognition error section.
  • the second speech recognition unit 23 determines the section (corresponding to the period from the second half of the “month” at the end of the current month to the end of “fire”) that the speech recognition hypothesis verification unit 22 determines as the recognition error section.
  • the speech recognition process may be performed using the word string “game where the recognition hypothesis” indicated by the recognition hypothesis in the section in which the recognition hypothesis is determined to be correct is used as a linguistic restriction.
  • a linguistic restriction for example, a language model representing the ease of connection of words is used as the second model.
  • words that are easy to connect with “no” “out” should be placed at the top.
  • the speech recognition process in the first speech recognition unit 21 since “no” and “out” are not determined, all possibilities must be taken into consideration, but recognition accuracy can be improved by adding constraints.
  • a person's name is likely to appear in the utterance from “a game where the recognition hypothesis is determined to be correct”, and a speech recognition process is performed using a model that easily recognizes the person's name as the second model. Also good.
  • a speech recognition process is performed using a model that easily recognizes the person's name as the second model.
  • the second model when a model different from the first model is stored in the second model storage unit 25 as the second model in advance, the second model stored as it is. May be used.
  • a model different from the first model may be selected as the second model.
  • a model of the same type as the first model can be used as the second model by giving a value different from the parameter given to the first model.
  • speech recognition is performed by adding a temporal constraint that which section of speech (speech) is erroneous and a linguistic constraint or acoustic constraint of what kind of linguistic or acoustic information exists before and after that. Accuracy can be increased.
  • the processing in the speech recognition hypothesis verification device and the speech recognition device is implemented by the speech recognition hypothesis verification device and the speech recognition program other than those realized by the dedicated hardware described above.
  • the program may be recorded on a recording medium that can be read by the recognition device, and the program recorded on the recording medium may be read and executed by a speech recognition hypothesis verification device or a speech recognition device.
  • Recording media that can be read by the speech recognition hypothesis verification device or the speech recognition device include IC cards, memory cards, and transferable recording media such as floppy disks (registered trademark), magneto-optical disks, DVDs, and CDs. It refers to an HDD or the like built in a speech recognition hypothesis verification device or speech recognition device.
  • the program recorded on this recording medium is read by a control block, for example, and the same processing as described above is performed under the control of the control block.
  • the present invention can be suitably applied to a system that uses voice recognition technology.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A device for verifying a speech recognition hypothesis is provided with a verification unit conversion section (1) which sets one or more verification units expressing time segments which become processing units for verification to an inputted speech recognition hypothesis and a unit determination section (2) which verifies the correctness of the recognition hypothesis in the time segment of each verification unit in accordance with the verification units set by the verification unit conversion section (1).  The verification unit conversion section (1) sets the one or more verification units including a verification unit to which a time segment different from the time segment of a word included in the speech recognition hypothesis.

Description

音声認識仮説検証装置、音声認識装置、それに用いられる方法およびプログラムSpeech recognition hypothesis verification device, speech recognition device, method and program used therefor
 本発明は、音声をテキストデータ等の電子データに変換する音声認識技術で得られる音声認識仮説を検証する音声認識仮説検証装置、音声認識装置、およびそれに用いられる音声認識仮説検証方法、音声認識方法、音声認識仮説検証用プログラム並びに音声認識用プログラムに関する。 The present invention relates to a speech recognition hypothesis verification device, a speech recognition device, a speech recognition hypothesis verification method, and a speech recognition method used for verifying a speech recognition hypothesis obtained by speech recognition technology that converts speech into electronic data such as text data. The present invention relates to a speech recognition hypothesis verification program and a speech recognition program.
 音声認識技術の進歩に伴い、電話・多人数会議の記録作成支援用途や、携帯電話等の音声UI(User Interface)用途など、実応用の場で利用される音声認識システムが構築される場面が増えてきた。 Along with advances in speech recognition technology, there are situations in which speech recognition systems used in practical applications such as telephone / multi-person conference record creation support and voice UI (User Interface) applications such as mobile phones are being built. It has increased.
 しかし、電話や会議での自発発話(話し言葉)と呼ばれる音声において特徴的な音響面および言語面における多様な現象や、屋外の多種多様な雑音などの影響により、十分な音声認識精度を得ることは難しい。音声認識に誤りが発生すると、誤り訂正のコストが必要となったり、システム誤動作が発生するという問題が生じる。このような音声認識誤りによる悪影響を抑えるためには、音声誤りの検出が重要である。 However, it is not possible to obtain sufficient speech recognition accuracy due to various acoustic and linguistic phenomena in speech called spontaneous utterances (spoken language) in telephones and conferences, and various outdoor noises. difficult. When an error occurs in speech recognition, there arises a problem that an error correction cost is required or a system malfunction occurs. In order to suppress such adverse effects due to voice recognition errors, detection of voice errors is important.
 音声誤りを検出する方法として、音声認識仮説の検証装置によってその正誤を判定する方法が考えられる。この音声認識仮説を検証するために、仮説中の各単語の信頼度尺度を用いる方法が提案されている。 As a method for detecting a speech error, a method for determining whether the speech error is correct by a speech recognition hypothesis verification device is conceivable. In order to verify this speech recognition hypothesis, a method using a reliability measure for each word in the hypothesis has been proposed.
 例えば、特許文献1には、音声認識結果の検証に用いる信頼度尺度として各単語の一般化単語事後確率を求め、その値によって発話や単語ごとにその正誤を判定する検証装置が記載されている。 For example, Patent Document 1 describes a verification device that obtains a generalized word posterior probability of each word as a reliability measure used for verification of a speech recognition result, and determines the correctness of each utterance or word based on the value. .
 また、例えば、特許文献2には、音声認識手段の生成した文字列および単語列の正誤を予め用意した単語辞書を参照して判定する判定手段と、誤認識と判定された場合に、異なる方法の音声認識により新たな単語列を生成するリライト手段とを含むシステムが記載されている。 Further, for example, Patent Document 2 discloses a determination unit that determines whether a character string and a word string generated by a speech recognition unit are correct with reference to a word dictionary prepared in advance, and a method that is different when it is determined as erroneous recognition. And a rewrite means for generating a new word string by voice recognition.
特開2005-164837号公報JP 2005-164837 A 特開2001-134288号公報JP 2001-134288 A
 しかし、特許文献1に記載されている検証装置や特許文献2に記載されている方法では、音声認識仮説の検証に基づく認識誤りの検出精度が十分でないという問題点がある。特許文献1に記載されている検証装置では、音声認識仮説の検証を仮説中の単語単位で行うため、認識誤り区間が仮説中の単語単位の組み合わせでしか得られない。すなわち、音声認識仮説に含まれる数少ない単語境界のみを用いて、発話中のどの区間の認識を誤ったかを検出するため、音声認識誤り区間の検出精度が十分ではなくなってしまう。 However, the verification device described in Patent Document 1 and the method described in Patent Document 2 have a problem that the detection accuracy of recognition errors based on the verification of the speech recognition hypothesis is not sufficient. In the verification device described in Patent Document 1, since the verification of the speech recognition hypothesis is performed in units of words in the hypothesis, the recognition error section can be obtained only by the combination of the units of words in the hypothesis. That is, since only the few word boundaries included in the speech recognition hypothesis are used to detect which section in speech is erroneously recognized, the detection accuracy of the speech recognition error section is not sufficient.
 また、特許文献2に記載されているシステムは、単語辞書を用いて音声認識仮説の正誤の判断を行った結果、誤りと判断された単語列を正しい単語列に置き換えるというものである。正誤の判断に単語辞書を用いていることからも明らかなように、検証は単語単位であり、音声認識誤り区間の検出精度が十分ではなくなってしまう点は特許文献1と同様である。 Also, the system described in Patent Document 2 replaces a word string determined to be incorrect as a correct word string as a result of determining whether the speech recognition hypothesis is correct using a word dictionary. As is clear from the fact that a word dictionary is used for correct / incorrect determination, the verification is performed in units of words, and the detection accuracy of the speech recognition error section is not sufficient as in Patent Document 1.
 本発明は、上記課題に鑑みてなされたものであり、音声認識仮説の検証に際し、発話中の音声認識誤り区間の検出精度を高めた音声認識仮説検証装置およびそれを利用する音声認識装置、音声認識仮説検証方法、音声認識方法、音声認識仮説検証用プログラム並びに音声認識用プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and in verifying a speech recognition hypothesis, a speech recognition hypothesis verification device with improved detection accuracy of a speech recognition error section during speech, a speech recognition device using the speech recognition hypothesis, and a speech It is an object of the present invention to provide a recognition hypothesis verification method, a speech recognition method, a speech recognition hypothesis verification program, and a speech recognition program.
 本発明による音声認識仮説検証装置は、入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を設定する検証単位変換部と、検証単位変換部によって設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する単位判定部とを備え、検証単位変換部は、音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含む1つ以上の検証単位を設定する。 The speech recognition hypothesis verification device according to the present invention includes a verification unit conversion unit that sets one or more verification units representing a time interval that is a verification processing unit for an input speech recognition hypothesis, and a verification unit conversion unit. A unit determination unit that verifies the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit, and the verification unit conversion unit has a time interval that is different from the time interval of the word included in the speech recognition hypothesis. One or more verification units including the set verification unit are set.
 また、本発明による音声認識装置は、入力された音声に対して音声認識を行い、音声認識仮説を生成する第1の音声認識部と、第1の音声認識部によって生成された音声認識仮説の検証を行う音声認識仮説検証部と、音声認識仮説検証部による音声認識仮説の検証結果を参照して、再び音声認識を行う第2の音声認識部とを備え、音声認識仮説検証部は、入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を設定する検証単位変換部と、検証単位変換部によって設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する単位判定部とを有し、検証単位変換部は、音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含む1つ以上の検証単位を設定する。 The speech recognition apparatus according to the present invention performs speech recognition on input speech and generates a speech recognition hypothesis, and a speech recognition hypothesis generated by the first speech recognition unit. A speech recognition hypothesis verification unit that performs verification, and a second speech recognition unit that performs speech recognition again with reference to the verification result of the speech recognition hypothesis by the speech recognition hypothesis verification unit. A verification unit conversion unit that sets one or more verification units representing a time interval that is a verification processing unit for the voice recognition hypothesis, and a verification unit set according to the verification unit set by the verification unit conversion unit A unit determination unit that verifies the correctness of the recognition hypothesis in the time interval, and the verification unit conversion unit includes one or more verification units including a time unit different from the time interval of the word included in the speech recognition hypothesis. Single verification To set.
 また、本発明による音声認識仮説検証方法は、音声認識仮説を検証する音声認識仮説検証方法であって、入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を、少なくとも音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含むように設定し、設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する。 The speech recognition hypothesis verification method according to the present invention is a speech recognition hypothesis verification method for verifying a speech recognition hypothesis, wherein one or more time intervals serving as verification processing units are input for the input speech recognition hypothesis. The verification unit is set to include a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and the recognition hypothesis in the time interval of each verification unit is set according to the set verification unit. Verify the correctness of.
 また、本発明による音声認識方法は、入力された音声に対して音声認識を行って音声認識仮説を生成し、生成された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を、少なくとも音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含むように設定し、設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証し、音声認識仮説の検証結果を参照して、認識正解と判定された時間区間の認識仮説に基づいて選定した音響モデルまたは言語モデルを用いて、再度音声認識を行う。 The speech recognition method according to the present invention generates speech recognition hypotheses by performing speech recognition on input speech, and represents a time interval that is a unit of verification processing for the generated speech recognition hypotheses. One or more verification units are set so as to include at least a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and in the time interval of each verification unit according to the set verification unit The correctness of the recognition hypothesis is verified, the verification result of the speech recognition hypothesis is referred to, and the speech recognition is performed again using the acoustic model or the language model selected based on the recognition hypothesis of the time interval determined as the recognition correct answer.
 また、本発明による音声認識仮説検証用プログラムは、コンピュータに、入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を、音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含むように設定する手順と、設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する手順とを実行させる。 In addition, the speech recognition hypothesis verification program according to the present invention includes one or more verification units representing a time interval serving as a verification processing unit for the speech recognition hypothesis input to the computer. A procedure for setting a verification unit in which a time interval different from the time interval of the word is set is included, and a procedure for verifying the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit. .
 また、本発明による音声認識用プログラムは、コンピュータに、入力された音声に対して音声認識を行って音声認識仮説を生成する手順と、生成された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を、少なくとも音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含むように設定する手順と、設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する手順と、音声認識仮説の検証結果を参照して、認識正解と判定された時間区間の認識仮説に基づいて選定した音響モデルまたは言語モデルを用いて、再度音声認識を行う手順とを実行させる。 The speech recognition program according to the present invention includes a procedure for performing speech recognition on an input speech to a computer to generate a speech recognition hypothesis, and a verification processing unit for the generated speech recognition hypothesis. According to a procedure for setting one or more verification units representing a time interval to include at least a verification unit in which a time interval different from the time interval of a word included in the speech recognition hypothesis is set, and the set verification unit The acoustic model or language model selected based on the recognition hypothesis of the time interval determined as the correct answer by referring to the verification hypothesis of the recognition hypothesis in the time interval of each verification unit and the verification result of the speech recognition hypothesis Are used to execute the voice recognition procedure again.
 本発明によれば、発話中の音声認識誤り区間の検出精度を高めることができる。 According to the present invention, it is possible to improve the accuracy of detecting a speech recognition error section during speech.
本発明の音声認識仮説検証装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition hypothesis verification apparatus of this invention. 本発明の音声認識仮説検証装置の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the speech recognition hypothesis verification apparatus of this invention. 本発明の第1の実施形態にかかる音声認識仮説検証装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition hypothesis verification apparatus concerning the 1st Embodiment of this invention. 検証単位の例を示す説明図である。It is explanatory drawing which shows the example of a verification unit. 文字・音節・音素・HMMの状態・音声特徴量の対応例を示す説明図である。It is explanatory drawing which shows the example of a response | compatibility of a character, a syllable, a phoneme, the state of HMM, and a speech feature-value. 検証モデルの一例であるCRFで用いる素性の表現例を示す説明図である。It is explanatory drawing which shows the example of an expression of the feature used by CRF which is an example of a verification model. 区間判定規則に規定する変更方法の例を示す説明図である。It is explanatory drawing which shows the example of the change method prescribed | regulated to a section determination rule. 図3に示した音声認識仮説検証装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech recognition hypothesis verification apparatus shown in FIG. 本発明の第2の実施形態にかかる音声認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition apparatus concerning the 2nd Embodiment of this invention. 発話と、第1の音声認識部による音声認識仮説と、音声認識仮説検証部による検証結果の例を示す説明図である。It is explanatory drawing which shows the example of the verification result by speech, the speech recognition hypothesis by a 1st speech recognition part, and the speech recognition hypothesis verification part.
 以下に、本発明を実施するための形態について図面を参照して詳細に説明する。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings.
 図1は、本発明の音声認識仮説検証装置の構成例を示すブロック図である。 FIG. 1 is a block diagram showing a configuration example of the speech recognition hypothesis verification device of the present invention.
 図1に示す音声認識仮説検証装置は、検証単位変換部1と、単位判定部2とを備える。 The speech recognition hypothesis verification device shown in FIG. 1 includes a verification unit conversion unit 1 and a unit determination unit 2.
 検証単位変換部1は、入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を設定する。検証単位変換部1は、入力された音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含む1つ以上の検証単位を設定する。検証単位変換部1は、例えば、音声認識仮説に含まれる単語の時間区間よりも小さな時間区間が設定された検証単位を含む1つ以上の検証単位を設定してもよい。例えば、音声分析フレーム単位に基づいて1つ以上の検証単位を設定してもよい。 The verification unit conversion unit 1 sets one or more verification units representing a time interval as a verification processing unit for the input speech recognition hypothesis. The verification unit conversion unit 1 sets one or more verification units including a verification unit in which a time interval different from the time interval of words included in the input speech recognition hypothesis is set. For example, the verification unit conversion unit 1 may set one or more verification units including a verification unit in which a time interval smaller than the time interval of words included in the speech recognition hypothesis is set. For example, one or more verification units may be set based on the voice analysis frame unit.
 単位判定部2は、検証単位変換部1によって設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する。単位判定部2は、例えば、検証単位の時間区間における音声認識誤りに関する特徴を含む複数種類の特徴を素性とする確率モデルからなる検証モデルと、検証単位ごとに処理対象の音声認識仮説から抽出される特徴とに基づいて、各検証単位の時間区間における認識仮説の正誤を検証してもよい。例えば、単位判定部2は、検証モデルと検証単位ごとに抽出される特徴とに基づいて、検証単位ごとに当該検証単位の時間区間に対する認識仮説が確からしい度合いを示す検証スコアを算出することにより、各検証単位の時間区間における認識仮説の正誤を検証してもよい。また、検証モデルとして、CRFモデルを用いてもよい。 The unit determination unit 2 verifies the correctness of the recognition hypothesis in the time interval of each verification unit according to the verification unit set by the verification unit conversion unit 1. The unit determination unit 2 is extracted from, for example, a verification model composed of a probability model having a plurality of types of features including features related to speech recognition errors in the time interval of the verification unit, and a speech recognition hypothesis to be processed for each verification unit. The correctness / incorrectness of the recognition hypothesis in the time interval of each verification unit may be verified based on the characteristics. For example, the unit determination unit 2 calculates, based on the verification model and the feature extracted for each verification unit, a verification score that indicates the degree to which the recognition hypothesis for the time interval of the verification unit is probable for each verification unit. The correctness of the recognition hypothesis in the time interval of each verification unit may be verified. A CRF model may be used as the verification model.
 このように、検証単位変換部1が、音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含む1つ以上の検証単位を設定し、単位判定部2が、その設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証することにより、発話中の音声認識誤り区間の検出精度を高めることができる。これは、検証単位を認識仮説の単語の時間区間に依存させないようにしたことにより、単語単位の特徴ではない特徴を元に検証を行うことができるためである。 Thus, the verification unit conversion unit 1 sets one or more verification units including a verification unit in which a time interval different from the time interval of the word included in the speech recognition hypothesis is set, and the unit determination unit 2 By verifying the correctness of the recognition hypothesis in the time interval of each verification unit in accordance with the set verification unit, the detection accuracy of the speech recognition error interval during speech can be increased. This is because verification can be performed based on features that are not word-based features by making the verification unit not dependent on the time interval of words in the recognition hypothesis.
 図2は、本発明の音声認識仮説検証装置の他の構成例を示すブロック図である。 FIG. 2 is a block diagram showing another configuration example of the speech recognition hypothesis verification device of the present invention.
 図2に示すように、図1に示した音声認識仮説検証装置にさらに、区間判定部3を備えていてもよい。区間判定部3は、単位判定部2による検証単位ごとの検証結果に基づいて、処理対象の音声認識仮説の誤り区間を判定する。区間判定部3は、その際に複数の検証単位の検証結果(あれば、検証スコアを含む。)を参照して、単位判定部2による検証結果を変更した上で、誤り区間を判定する。 As shown in FIG. 2, the speech recognition hypothesis verification apparatus shown in FIG. 1 may further include a section determination unit 3. The section determination unit 3 determines the error section of the speech recognition hypothesis to be processed based on the verification result for each verification unit by the unit determination unit 2. At that time, the section determination unit 3 refers to the verification results of a plurality of verification units (including a verification score, if any), changes the verification result by the unit determination unit 2, and determines an error section.
 (第1の実施形態)
 以下に、上述した音声認識仮説検証装置のより具体的な実施形態について説明する。
(First embodiment)
Hereinafter, a more specific embodiment of the above-described speech recognition hypothesis verification device will be described.
 図3は、本発明の第1の実施形態にかかる音声認識仮説検証装置の構成例を示すブロック図である。 FIG. 3 is a block diagram showing a configuration example of the speech recognition hypothesis verification device according to the first exemplary embodiment of the present invention.
 図3に示す音声認識仮説検証装置101は、音声認識仮説入力部12と、検証単位変換部13と、単位判定部14と、区間判定部15と、検証モデル記憶部16と、区間判定規則記憶部17とを備える。 The speech recognition hypothesis verification device 101 shown in FIG. 3 includes a speech recognition hypothesis input unit 12, a verification unit conversion unit 13, a unit determination unit 14, a section determination unit 15, a verification model storage unit 16, and a section determination rule storage. Unit 17.
 音声認識仮説検証装置101は、全体としては、例えば、入力されたデータをコンピュータで情報処理するパーソナルコンピュータ(PC)やサーバ装置などの情報処理装置によって実現される。本実施形態では、音声認識装置などから出力される音声認識結果としての音声認識仮説を入力とし、入力された音声認識仮説の検証結果を出力する。 The speech recognition hypothesis verification apparatus 101 is realized as a whole by, for example, an information processing apparatus such as a personal computer (PC) or a server apparatus that processes input data with a computer. In this embodiment, a speech recognition hypothesis as a speech recognition result output from a speech recognition device or the like is input, and a verification result of the input speech recognition hypothesis is output.
 また、音声認識仮説入力部12は、データを入力するための各種データ入力装置によって実現される。具体的には、データ入力装置およびその入力を受け付ける制御部によって実現される。また、検証単位変換部13、単位判定部14および区間判定部15は、プログラムに従って動作するCPU等によって実現される。また、検証モデル記憶部16および区間判定規則記憶部17は、データを記憶する記憶ユニットによって実現される。 Further, the speech recognition hypothesis input unit 12 is realized by various data input devices for inputting data. Specifically, it is realized by a data input device and a control unit that receives the input. The verification unit conversion unit 13, the unit determination unit 14, and the section determination unit 15 are realized by a CPU or the like that operates according to a program. The verification model storage unit 16 and the section determination rule storage unit 17 are realized by a storage unit that stores data.
 音声認識仮説検証装置101の各構成要素は、任意のコンピュータのCPU、メモリ、メモリにロードされたプログラム、該プログラムを格納するハードディスクなどの記憶ユニットを中心に、ハードウェアとソフトウェアの任意の組み合わせによって実現される。この他にも、例えば、ネットワーク接続用インタフェースといった各種インタフェースを含んでいてもよい。 Each component of the speech recognition hypothesis verification device 101 is based on an arbitrary combination of hardware and software, centering on a CPU of any computer, a memory, a program loaded in the memory, and a storage unit such as a hard disk for storing the program. Realized. In addition, for example, various interfaces such as a network connection interface may be included.
 音声認識仮説入力部12は、外部の音声認識装置(不図示)から音声認識仮説を受け取り、検証単位変換部13に提供(出力)する。音声認識仮説は、例えば、認識時のスコア(尤度)や認識対象音声に対応づけられた時間情報が付与された1つ以上の単語列を含む単語グラフやNベスト単語列の形式で表現される。 The speech recognition hypothesis input unit 12 receives a speech recognition hypothesis from an external speech recognition device (not shown) and provides (outputs) it to the verification unit conversion unit 13. The speech recognition hypothesis is expressed, for example, in the form of a word graph or N best word sequence including one or more word sequences to which time information associated with a recognition score (likelihood) or recognition target speech is assigned. The
 検証単位変換部13は、音声認識仮説入力部12を介して入力された音声認識仮説を、検証単位のデータ集合に変換する。ここで、検証単位とは、後段の単位判定部14において行われる検証の処理単位をいう。検証単位変換部13は、実際に検証単位のデータ集合を生成するのではなく、音声認識仮説に対して検証単位ごとにその範囲(認識対象とされた音声データにおける時間区間)を設定すればよい。以下、検証単位を定めるという表現を用いた場合には、音声認識仮説に対して1つ以上の検証単位とする時間区間を定めることをいう。 The verification unit conversion unit 13 converts the speech recognition hypothesis input via the speech recognition hypothesis input unit 12 into a data set of verification units. Here, the verification unit refers to a unit of verification performed in the unit determination unit 14 at the subsequent stage. The verification unit conversion unit 13 does not actually generate a data set of verification units, but sets a range (a time interval in speech data to be recognized) for each verification unit for the speech recognition hypothesis. . Hereinafter, when the expression of determining the verification unit is used, it means that one or more verification units are defined for the speech recognition hypothesis.
 検証単位変換部13は、音声認識仮説の時間情報(音声認識仮説が示す各単語の時間区間)に依存せずに、検証単位を定める。具体的には、検証単位とする時間区間の少なくとも1つに、音声認識仮説が示す単語の時間区間とは異なる区間を含むように検証単位を定めればよい。例えば、認識対象音声の分析フレーム単位や複数の分析フレームをまとめたセグメント単位を1検証単位としてもよい。そのような場合には、認識対象とされた音声データを1分析フレームまたは1セグメントの時間区間ごとに区切ったものが各検証単位の範囲となる。また、音声認識仮説の単語を細かい単位に分割した文字・音節・音素・HMMの状態のような単位と分析フレームに基づく単位(分析フレーム単位やセグメント単位)とをあわせて用いることもできる。なお、文字・音節・音素・HMMの状態のような単位とあわせて用いる場合など、音声データ内において1検証単位とされる時間区間は必ずしも一定でなくてよい。 The verification unit conversion unit 13 determines the verification unit without depending on the time information of the speech recognition hypothesis (the time interval of each word indicated by the speech recognition hypothesis). Specifically, the verification unit may be determined so that at least one of the time intervals as the verification unit includes a time interval different from the time interval of the word indicated by the speech recognition hypothesis. For example, an analysis frame unit of recognition target speech or a segment unit obtained by collecting a plurality of analysis frames may be used as one verification unit. In such a case, the range of each verification unit is obtained by dividing the speech data to be recognized into one analysis frame or one segment time interval. Further, a unit such as a character / syllable / phoneme / HMM state obtained by dividing a word of a speech recognition hypothesis into fine units and a unit based on an analysis frame (an analysis frame unit or a segment unit) can be used together. Note that the time interval used as one verification unit in the speech data does not necessarily have to be constant, such as when used in combination with units such as character, syllable, phoneme, and HMM states.
 検証単位変換部13は、認識対象とされた音声データにおける検証単位を示す情報として、例えば、各検証単位を識別するための識別子と、その検証単位が認識仮説の時間区間においてどの区間に該当するかを示す情報とを対応づけた情報を生成してもよい。 The verification unit conversion unit 13 includes, for example, an identifier for identifying each verification unit as information indicating the verification unit in the speech data to be recognized, and which section the verification unit corresponds to in the time section of the recognition hypothesis. You may produce | generate the information which matched the information which shows.
 図4a~図4dは、検証単位の設定例を示す説明図である。 4a to 4d are explanatory diagrams showing examples of setting of verification units.
 例えば、図4aに示すように、認識対象音声のある分析フレーム区間1~100に対応する音声認識仮説が「今月末」という単語を示していたとする。 For example, as shown in FIG. 4a, it is assumed that the speech recognition hypothesis corresponding to the analysis frame sections 1 to 100 with the speech to be recognized indicates the word “end of month”.
 ここで、検証単位として分析フレーム単位を用いる場合には、図4bに示すように、認識対象音声の分析フレームそれぞれに対応させて検証単位を定めればよい。本例の場合、検証単位変換部13は、分析フレーム1~100の時間区間をそれぞれ範囲とする100個の検証単位を示す情報を生成すればよい。 Here, when the analysis frame unit is used as the verification unit, the verification unit may be determined in correspondence with each analysis frame of the recognition target speech as shown in FIG. 4b. In the case of this example, the verification unit conversion unit 13 may generate information indicating 100 verification units each covering the time interval of the analysis frames 1 to 100.
 また、例えば、検証単位として10個の分析フレームをまとめたセグメント単位を用いる場合には、図4cに示すように、10個の分析フレームを1単位とする認識対象音声のセグメントそれぞれに対応させて検証単位を定めればよい。本例の場合、検証単位変換部13は、分析フレーム1~10をまとめたセグメント1、分析フレーム11~20をまとめたセグメント2といったセグメント1~10の時間区間をそれぞれ範囲とする10個の検証単位を示す情報を生成すればよい。 Also, for example, when using a segment unit in which 10 analysis frames are combined as a verification unit, as shown in FIG. 4c, it is made to correspond to each segment of the recognition target speech having 10 analysis frames as one unit. A verification unit may be determined. In the case of this example, the verification unit conversion unit 13 includes 10 verifications each covering the time interval of the segments 1 to 10 such as the segment 1 combining the analysis frames 1 to 10 and the segment 2 combining the analysis frames 11 to 20. What is necessary is just to produce | generate the information which shows a unit.
 また、例えば、単語の先頭部・中間部・末尾部といった単語に関する単位を併用して用いる場合には、図4dに示すように、音声認識仮説において分析フレーム境界で示される単語の先頭部・中間部・末尾部それぞれに対応させて検証単位を定めればよい。本例の場合、検証単位変換部13は、分析フレーム境界で示される単語の先頭部・中間部・末尾部の時間区間をそれぞれ範囲とする3個の検証単位を示す情報を生成すればよい。 Further, for example, in the case of using a unit related to a word such as the beginning, middle, and end of a word, as shown in FIG. 4d, the beginning / middle of the word indicated by the analysis frame boundary in the speech recognition hypothesis. The verification unit may be determined corresponding to each of the part and the end part. In the case of this example, the verification unit conversion unit 13 may generate information indicating three verification units each having the time period of the head part, the intermediate part, and the tail part of the word indicated by the analysis frame boundary.
 また、検証単位を定める際に、文字・音節・音素・HMMの状態を併用してもよい。 Also, when determining the verification unit, the states of characters, syllables, phonemes, and HMMs may be used in combination.
 図5に、文字・音節・音素・HMMの状態・音声特徴量の対応例を示す。 FIG. 5 shows an example of correspondence between characters, syllables, phonemes, HMM states, and speech feature values.
 図5に示すように、音声認識仮説において分析フレーム境界で示されるある単語を構成する文字や音節、音素、HMMの状態に対応させて検証単位を定めればよい。例えば、音節、音素、HMMの状態の時間区間に基づいて「文字“今”の先頭部」に該当する範囲を特定し、1検証単位として決定する。なお、図5では、音声データを音声特徴量の時系列として示している。この場合、1分析フレームは、音声信号の一定区間(例えば、25ミリ秒)ごとに計算される特徴量(ベクトル)に相当する。 As shown in FIG. 5, the verification unit may be determined in correspondence with the states of characters, syllables, phonemes, and HMMs that constitute a word indicated by the analysis frame boundary in the speech recognition hypothesis. For example, a range corresponding to “the head of the character“ now ”” is specified based on the time intervals of the syllable, phoneme, and HMM states, and is determined as one verification unit. In FIG. 5, the audio data is shown as a time series of audio feature amounts. In this case, one analysis frame corresponds to a feature amount (vector) calculated every certain interval (for example, 25 milliseconds) of the audio signal.
 単位判定部14は、検証単位変換部13から検証単位を示す情報および音声認識仮説を受け取り、検証単位それぞれについて所定の検証用特徴を抽出し、抽出した検証用特徴値と検証モデル記憶部16に記憶されている検証モデルとを用いて検証単位ごとに認識仮説の正誤を判定する。単位判定部14は、例えば、その検証単位の時間区間に対する認識仮説が確からしい度合いを示す検証スコアを算出し、算出した検証スコアに基づき、検証単位ごとに認識仮説の正誤を判定する。 The unit determination unit 14 receives the information indicating the verification unit and the speech recognition hypothesis from the verification unit conversion unit 13, extracts predetermined verification features for each verification unit, and stores them in the extracted verification feature value and verification model storage unit 16. Whether the recognition hypothesis is correct or not is determined for each verification unit using the stored verification model. The unit determination unit 14 calculates, for example, a verification score indicating the degree to which the recognition hypothesis for the time interval of the verification unit is likely, and determines whether the recognition hypothesis is correct for each verification unit based on the calculated verification score.
 単位判定部14は、例えば図3に示したように、対象単位選択部141と、特徴抽出部142と、スコア算出部143と、対象単位判定部144とを含んでいてもよい。 The unit determination unit 14 may include a target unit selection unit 141, a feature extraction unit 142, a score calculation unit 143, and a target unit determination unit 144, for example, as illustrated in FIG.
 対象単位選択部141は、検証単位変換部13から検証単位を示す情報および音声認識仮説を受け取り、その音声認識仮説を特徴抽出部142に提供する。また、認識対象とされた音声データに含まれる検証単位それぞれについて、処理対象の検証単位として順次指定し、特徴抽出部142および対象単位判定部144に提供する。 The target unit selection unit 141 receives the information indicating the verification unit and the speech recognition hypothesis from the verification unit conversion unit 13, and provides the speech recognition hypothesis to the feature extraction unit 142. Further, each verification unit included in the speech data to be recognized is sequentially specified as a verification unit to be processed and provided to the feature extraction unit 142 and the target unit determination unit 144.
 特徴抽出部142は、対象単位選択部141から音声認識仮説と処理対象の検証単位を示す情報とを受け取り、処理対象の検証単位に係る所定の検証用特徴を抽出して、スコア算出部143に提供する。 The feature extraction unit 142 receives the speech recognition hypothesis and information indicating the verification unit of the processing target from the target unit selection unit 141, extracts a predetermined verification feature related to the verification unit of the processing target, and sends it to the score calculation unit 143. provide.
 検証用特徴とは、音声認識仮説の検証を行う際に用いる特徴のことであり、検証単位ごとに抽出される。検証用特徴としては、音声認識仮説の正解らしさや誤りらしさと関連する性質をもつものを用いる。また、多種の検証用特徴を用いれば、検証の精度を高めることができる。例えば、音声認識仮説の構造情報、音声認識仮説の言語的な情報、認識計算に関する情報を用いてもよい。なお、処理対象の検証単位に係る特徴は、処理対象の検証単位の時間区間のデータ(以下、単に検証単位データという。)だけでなく、その前後の時間区間のデータや当該時間区間を含む単語の時間区間のデータを用いて抽出することも可能である。 The verification feature is a feature used when the speech recognition hypothesis is verified, and is extracted for each verification unit. As the verification feature, a feature having a property related to the correctness or erroriness of the speech recognition hypothesis is used. In addition, if various verification features are used, verification accuracy can be increased. For example, structure information of the speech recognition hypothesis, linguistic information of the speech recognition hypothesis, and information related to the recognition calculation may be used. The feature related to the verification unit to be processed is not only the data of the time interval of the verification unit to be processed (hereinafter simply referred to as verification unit data) but also the data including the time interval before and after that and the word including the time interval. It is also possible to extract using the data of the time interval.
 音声認識仮説の構造情報としては、例えば、単語グラフが示す処理対象とされた検証単位の時間区間において競合するアークの数や、同時間区間に含まれるノードの数などが挙げられる。検証単位として分析フレームのセグメント単位を用いる場合、セグメント区間に存在するアーク数が多いと、その区間の認識誤りらしさが高い可能性がある。また、当該区間に含まれるノード数が多いと、その区間は元の発話中での単語境界であった可能性があり、その前後で認識誤りらしさが異なる可能性がある。 The structural information of the speech recognition hypothesis includes, for example, the number of arcs competing in the time interval of the verification unit to be processed indicated by the word graph, the number of nodes included in the same time interval, and the like. When the segment unit of the analysis frame is used as the verification unit, if the number of arcs existing in the segment section is large, there is a possibility that the recognition error probability in the section is high. In addition, if there are a large number of nodes included in the section, the section may be a word boundary in the original utterance, and the possibility of recognition error may be different before and after the section.
 言語的な情報としては、例えば、仮説中の単語の表層や品詞などが挙げられる。単語表層を特徴に用いることで、頻出する音声認識誤り表現(音声認識装置の認識誤りパターン)を扱うことができるようになる。特に、単語単位よりも細かい単位でこれらの特徴を抽出することで、例えば、認識仮説として「今月末」のような長い単語が出てきた場合にその後半部分が特に認識誤りになりやすい、といった例を検出できるようになる。 Linguistic information includes, for example, the surface layer and part of speech of the word in the hypothesis. By using the word surface layer as a feature, frequent speech recognition error expressions (recognition error patterns of the speech recognition apparatus) can be handled. In particular, by extracting these features in units smaller than the word unit, for example, when a long word such as “the end of the month” appears as a recognition hypothesis, the latter part is particularly likely to be a recognition error. An example can be detected.
 認識計算に関する特徴としては、例えば、音響尤度や言語尤度のような仮説のもっともらしさを表す値が挙げられる。検証単位において、その区間が認識誤りである場合、この値が相対的に低かったり、競合仮説との値が小さいなどの特徴を示すことがある。フレーム単位で得られる音響尤度のような値を検証単位で用いることにより、単語単位で平均化されるのと比べてより詳細に参照することが可能である。また、前述の特許文献1に記載されている検証装置などで求められる単語単位の信頼度スコアを検証用特徴に用いることも可能である。 As a feature related to recognition calculation, for example, a value representing the plausibility of a hypothesis such as acoustic likelihood and language likelihood can be cited. In the verification unit, when the section is a recognition error, the value may be relatively low or the value with the competition hypothesis may be small. By using a value such as an acoustic likelihood obtained in units of frames in the verification unit, it is possible to refer to the details more than in the case of averaging in units of words. In addition, a reliability score in units of words obtained by the verification device described in Patent Document 1 described above can be used as a verification feature.
 スコア算出部143は、特徴抽出部142から処理対象の検証単位を示す情報と該検証単位に係る検証用特徴とを受け取り、検証モデル記憶部16に記憶されている検証モデルを用いて検証スコアを算出し、対象単位判定部144に提供する。 The score calculation unit 143 receives information indicating the verification unit to be processed and the verification feature related to the verification unit from the feature extraction unit 142, and uses the verification model stored in the verification model storage unit 16 to calculate the verification score. Calculate and provide to the target unit determination unit 144.
 検証モデル記憶部16は、検証単位データに見られる検証用特徴と認識仮説の正解らしさまたは誤りらしさとの関連の強さを表すモデルである検証モデルの情報を保持する。 The verification model storage unit 16 stores information on a verification model, which is a model representing the strength of the relationship between the verification feature found in the verification unit data and the correctness or erroriness of the recognition hypothesis.
 スコア算出部143は、例えば、識別モデルの一種であるCRF(Conditional Random Fields)による識別処理を用いて検証スコアを算出してもよい。ここで、CRFは、次の式(1)のように記述される。 The score calculation unit 143 may calculate the verification score using, for example, identification processing by CRF (Conditional Random Fields) which is a type of identification model. Here, CRF is described as the following equation (1).
P(Y|X)=exp(Λ・Φ(X,Y))/Z ・・・式(1) P (Y | X) = exp (Λ · Φ (X, Y)) / Z (1)
 式(1)において、“X”は識別処理の対象となる入力を示している。また、“Y”は入力に対応づけられる識別結果である。また、“Φ(X,Y)”は識別に用いる特徴としての素性であり、“Λ”は素性のそれぞれに対応するCRFのモデルパラメータ(重み値)である。また、“Z”は正規化項である。なお、“exp()”は、eを底とする数値のべき乗を求める関数を示している。 In Expression (1), “X” indicates an input to be subjected to identification processing. “Y” is an identification result associated with the input. “Φ (X, Y)” is a feature as a feature used for identification, and “Λ” is a CRF model parameter (weight value) corresponding to each feature. “Z” is a normalization term. Note that “exp ()” indicates a function for obtaining the power of a numerical value with e as the base.
 CRFによる識別処理を用いる実施形態において、入力“X”は、検証対象である音声認識仮説から変換された検証単位データである。また、出力“Y”は、入力された検証単位データごとに対応づけられた検証結果である。素性“(X,Y)”には、アーク数やノード数、出現頻度といった検証用特徴のとる値を用いる。識別処理時には、入力に対して上記式(1)の左辺P(Y|X)が最大となる出力が選択される。また、CRFのモデルパラメータは、予め対応付けされた入力(X:検証単位データ)と出力(Y:識別結果)との組を学習データとして、上記式(1)の対数尤度を最大化する基準での反復計算法などにより最適化(学習)してもよい。なお、このようなCRFを用いた識別処理やモデルパラメータの学習方法に関する詳細は、例えば、文献「J.Lafferty, A.McCallum, F.Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", Proceedings of 18th International Conference of Machine Learning, 2001年, p.282-289」に記載されている。 In the embodiment using the identification processing by CRF, the input “X” is verification unit data converted from the speech recognition hypothesis to be verified. The output “Y” is a verification result associated with each input verification unit data. For the feature “(X, Y)”, values taken by verification features such as the number of arcs, the number of nodes, and the appearance frequency are used. During the identification process, an output that maximizes the left side P (Y | X) of the above equation (1) is selected with respect to the input. The model parameter of CRF maximizes the log likelihood of the above equation (1) using a pair of input (X: verification unit data) and output (Y: identification result) associated in advance as learning data. Optimization (learning) may be performed by an iterative calculation method based on a standard. For details on the identification processing and model parameter learning method using CRF, see, for example, the literature “J.Lafferty, A.McCallum, F.Pereira,” Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ", Proceedings of 18th International Conference of Machine Learning, 2001, p.282-289".
 検証モデル記憶部16は、例えば、CRFの情報として、素性Φの情報やモデルパラメータΛ(重み値)の情報を保持してもよい。 The verification model storage unit 16 may hold, for example, feature Φ information and model parameter Λ (weight value) information as CRF information.
 対象単位判定部144は、対象単位選択部141で指定された処理対象の検証単位に対して求められた検証スコアを所定の基準と照らし合わせて、各検証単位に対する認識仮説の正誤を判定する。この判定結果は、認識仮説に対する検証単位での検証結果に相当する。対象単位判定部144は、判定結果(すなわち、各検証単位の検証結果)を区間判定部15に提供する。検証結果と併せて検証スコアを提供してもよい。 The target unit determination unit 144 determines the correctness of the recognition hypothesis for each verification unit by comparing the verification score obtained for the processing target verification unit specified by the target unit selection unit 141 with a predetermined criterion. This determination result corresponds to the verification result in the verification unit for the recognition hypothesis. The target unit determination unit 144 provides the determination result (that is, the verification result of each verification unit) to the section determination unit 15. A verification score may be provided along with the verification result.
 以下に、CRFによる識別処理を用いた検証スコアの算出方法および正誤の判定方法についてより具体的に説明する。 Hereinafter, the verification score calculation method and the correctness determination method using identification processing by CRF will be described more specifically.
 例えば、ある音声データ長に対して設定される検証単位の一つに着目すると、その検証単位の時間区間に相当する区間の音声認識仮説には、認識仮説が誤っていた場合(または正しかった場合)に、アーク数=4やノード数=7といった検証用特徴があるとわかったとする。このような場合には、図6に示すように、これらの特徴を検証モデルで用いる素性として表現しておけばよい。 For example, focusing on one of the verification units set for a certain voice data length, if the recognition hypothesis is incorrect (or correct) in the speech recognition hypothesis in the interval corresponding to the time interval of the verification unit ) Has a verification feature such as the number of arcs = 4 and the number of nodes = 7. In such a case, these features may be expressed as features used in the verification model, as shown in FIG.
 図6は、CRFの素性Φの例を示す説明図である。 FIG. 6 is an explanatory diagram showing an example of the CRF feature Φ.
 図6では、F(アーク数=4)=1と、F(ノード数=7)=1という素性の例が示されている。 FIG. 6 shows an example of features of F (number of arcs = 4) = 1 and F (number of nodes = 7) = 1.
 スコア算出部143は、これらの素性と、検証結果(例えば、正解と誤りの2つ)それぞれの重み値Λをかけることで、検証結果それぞれのスコアを求めてもよい。そして、対象単位判定部144は、このスコアの大きいものから有力な検証結果とすることで、その検証単位に対する検証結果として確定させればよい。 The score calculation unit 143 may obtain the score of each verification result by multiplying these features and the respective weight values Λ of the verification results (for example, correct answer and error). Then, the target unit determination unit 144 may determine the verification result for the verification unit by determining the effective verification result from the highest score.
 区間判定部15は、単位判定部14の対象単位判定部144から各検証単位に対する検証結果を受け取り、音声認識仮説に含まれる認識誤り区間を判定する。本実施形態では、区間判定部15は、各検証単位に対する検証結果を、区間判定規則記憶部17に記憶されている区間判定規則に従って必要に応じて変更することにより、音声認識仮説に含まれる認識誤り区間を判定する。区間判定規則は、単位判定部14による検証単位ごとの検証結果を、使用用途に合わせて変更するための規則(変更方法等を規定した情報)である。例えば、検証結果の信頼度や他の検証単位(例えば、前後の検証単位)の検証結果との関係に基づき変更する方法を規定してもよい。 The section determination unit 15 receives a verification result for each verification unit from the target unit determination unit 144 of the unit determination unit 14 and determines a recognition error section included in the speech recognition hypothesis. In the present embodiment, the section determination unit 15 changes the verification result for each verification unit as necessary according to the section determination rule stored in the section determination rule storage unit 17, thereby recognizing the recognition included in the speech recognition hypothesis. Determine the error interval. The section determination rule is a rule for changing the verification result for each verification unit by the unit determination unit 14 in accordance with the usage (information defining a change method or the like). For example, a method of changing based on the reliability of the verification result and the relationship with the verification result of other verification units (for example, the previous and subsequent verification units) may be defined.
 図7a~図7dは、区間判定規則に規定する変更方法の例を示す説明図である。図7aは、入力された音声認識仮説に対して設定した検証単位1~32の単位判定部14による検証結果の一例を示している。 FIGS. 7a to 7d are explanatory diagrams showing examples of changing methods defined in the section determination rules. FIG. 7a shows an example of the verification result by the unit determination unit 14 of the verification units 1 to 32 set for the input speech recognition hypothesis.
 図7aに示す例では、単位判定部14による検証結果として、検証単位1~5,9,12~15,17~19,25~28,30~32に対する認識仮説は正しい旨を示すラベル付け“○”がされている。また、検証単位6~8,10~11,16,20~24,29に対する認識仮説は誤りである旨のラベル付け“×”がされている。このような検証結果に対して、図7bに示すように、まず同じラベルが所定のスコア以上で所定の単位以上連続する区間の検証結果を確定させる。図7bに示す例では、四角で囲った検証単位1~5、6~8、12~15、17~19、20~24、25~28および30~32の区間に対して検証結果を確定させている。これにより、検証単位6~8および20~24の区間が誤り区間として確定することになる。なお、検証結果のラベルが同じであっても、その検証結果に付された検証スコアが所定のスコア以上でない区間があることにより、所定の単位以上の連続とはならない場合にはその区間は確定されない。 In the example shown in FIG. 7a, as a verification result by the unit determination unit 14, labeling indicating that the recognition hypothesis for the verification units 1 to 5, 9, 12 to 15, 17 to 19, 25 to 28, and 30 to 32 is correct. ○ ”is displayed. In addition, the recognition hypotheses for the verification units 6 to 8, 10 to 11, 16, 20 to 24, and 29 are labeled “x” indicating that they are incorrect. With respect to such a verification result, as shown in FIG. 7b, first, the verification result of a section in which the same label is equal to or higher than a predetermined score and continues for a predetermined unit or more is determined. In the example shown in FIG. 7b, the verification results are confirmed for the sections of the verification units 1 to 5, 6 to 8, 12 to 15, 17 to 19, 20 to 24, 25 to 28, and 30 to 32 surrounded by a square. ing. As a result, sections of verification units 6 to 8 and 20 to 24 are determined as error sections. Even if the verification result label is the same, if there is a section where the verification score attached to the verification result is not equal to or higher than the predetermined score, the section is fixed if it is not continuous beyond the predetermined unit. Not.
 次に、図7cに示すように、未確定区間について、その前後の確定区間を所定の数だけ参照し、多数決によって検証結果を確定させる。図7cに示す例では、未確定区間である検証単位9~11,16,29について、前後3つの検証単位からなる確定区間による多数決を試み、その結果、下線によって示すように、検証単位16および29の検証結果を認識仮説が正しいとする方に変更している。なお、検証単位データ9~11の未確定区間については、その前後の確定区間である検証単位6~8および検証単位12~14の区間において、正しいと判定された数が3個、誤りと判定された数が3個であり、多数決では決められない。この他にも、所定の数だけの確定区間が連続して得られない場合などが考えられる。 Next, as shown in FIG. 7c, with respect to the unconfirmed section, a predetermined number of reference sections before and after that are referred to, and the verification result is confirmed by majority vote. In the example shown in FIG. 7c, for the verification units 9 to 11, 16, and 29 which are unconfirmed sections, a majority decision based on a confirmed section composed of three verification units before and after is attempted. As a result, as shown by the underline, 29 verification results have been changed to those in which the recognition hypothesis is correct. Regarding the unconfirmed sections of the verification unit data 9 to 11, the number determined to be correct in the sections of the verification units 6 to 8 and the verification units 12 to 14 that are the determined sections before and after the verification section data is determined to be an error. The number is 3 and cannot be decided by majority vote. In addition to this, there may be a case where a predetermined number of confirmed sections cannot be obtained continuously.
 最後に、図7dに示すように、残っている未確定区間について、認識仮説を誤りとする検証結果で確定させる。図7dに示す例では、未確定区間である検証単位9~11について、下線によって示すように、認識仮説が誤りとする検証結果に変更している。 Finally, as shown in FIG. 7d, the remaining unconfirmed section is confirmed with the verification result in which the recognition hypothesis is an error. In the example shown in FIG. 7d, the verification units 9 to 11, which are unconfirmed sections, are changed to verification results in which the recognition hypothesis is erroneous as indicated by the underline.
 区間判定部15は、区間判定規則に従って各検証単位に対する検証結果を変更した結果、最終的に誤り区間として確定した検証単位の時間区間を、認識仮説における認識誤り区間として検出すればよい。図7a~図7dに示す例では、検証単位6~11および20~24の区間が対応している時間区間が認識誤り区間として検出される。 The section determination unit 15 may detect the time section of the verification unit finally determined as the error section as a recognition error section in the recognition hypothesis as a result of changing the verification result for each verification unit according to the section determination rule. In the example shown in FIGS. 7a to 7d, the time interval corresponding to the intervals of the verification units 6 to 11 and 20 to 24 is detected as the recognition error interval.
 区間判定規則記憶部17は、区間判定規則として、例えば、変更処理としてどのようなロジックを用いるかを指定する情報や、各ロジックに用いるパラメータ(例えば、連続判定に用いる単位数や閾値等)を記憶してもよい。 The section determination rule storage unit 17 includes, as the section determination rule, information specifying what logic is used for the change process, and parameters used for each logic (for example, the number of units used for continuous determination, a threshold value, and the like). You may remember.
 次に、本実施形態の動作について説明する。 Next, the operation of this embodiment will be described.
 図8は、図3に示した音声認識仮説検証装置101の動作の一例を示すフローチャートである。 FIG. 8 is a flowchart showing an example of the operation of the speech recognition hypothesis verification apparatus 101 shown in FIG.
 図8に示すように、音声認識仮説検証装置101は、起動されると、検証モデル記憶部16や区間判定規則記憶部17を実現している記憶デバイスから検証モデルや区間判定規則を読み出し、それぞれ単位判定部14、区間判定部15から参照できるように展開する等の初期化処理を行う(ステップ11)。 As shown in FIG. 8, when activated, the speech recognition hypothesis verification device 101 reads the verification model and the section determination rule from the storage device that implements the verification model storage unit 16 and the section determination rule storage unit 17, respectively. Initialization processing such as expansion so as to be referred to from the unit determination unit 14 and the section determination unit 15 is performed (step 11).
 一方、音声認識仮説入力部12は、例えば、外部の音声認識装置からの音声認識処理終了の通知に応じて、音声認識仮説を受け取り(入力し)、検証単位変換部13に提供(出力)する(ステップ12)。なお、音声認識仮説入力部12は、例えば、ユーザからの指示に応じて音声認識仮説を入力するなども考えられる。 On the other hand, the speech recognition hypothesis input unit 12 receives (inputs) a speech recognition hypothesis, for example, and provides (outputs) the speech recognition hypothesis to the verification unit conversion unit 13 in response to a notification of the end of speech recognition processing from an external speech recognition device. (Step 12). Note that, for example, the speech recognition hypothesis input unit 12 may input a speech recognition hypothesis in response to an instruction from the user.
 音声認識仮説入力部12を介して検証対象である音声認識仮説が入力されると、検証単位変換部13は、入力された音声認識仮説を1つ以上の検証単位のデータ集合に変換して単位判定部14に提供する(ステップ13)。検証単位変換部13は、例えば、音声データにおける時間区間の情報を用いて1つ以上の検証単位を示す情報を単位判定部14に提供する。 When a speech recognition hypothesis to be verified is input via the speech recognition hypothesis input unit 12, the verification unit conversion unit 13 converts the input speech recognition hypothesis into a data set of one or more verification units. It provides to the determination part 14 (step 13). For example, the verification unit conversion unit 13 provides information indicating one or more verification units to the unit determination unit 14 using information on a time interval in the audio data.
 単位判定部14は、検証単位ごとに検証スコアを求め、認識仮説を検証(正誤を判定)する(ステップ14)。単位判定部14では、まず、対象単位選択部141が、認識仮説に対して設定された検証単位それぞれについて順次、処理対象に指定する。そして、特徴抽出部142が、処理対象に指定された検証単位の検証用特徴を抽出する。次に、スコア算出部143が、抽出された検証用特徴と検証モデルとを参照して、処理対象に指定された検証単位に対する検証スコアを算出する。最後に、対象単位判定部144が、算出された検証スコアに基づき、処理対象に指定された検証単位の時間区間に対する認識仮説の正誤を判定する。このようにして判定された検証単位ごとの検証結果(正誤の判定結果)を、検証スコアとともに区間判定部15に提供する。 The unit determination unit 14 obtains a verification score for each verification unit, and verifies the recognition hypothesis (determines correctness) (step 14). In the unit determination unit 14, first, the target unit selection unit 141 sequentially designates each verification unit set for the recognition hypothesis as a processing target. Then, the feature extraction unit 142 extracts the verification feature of the verification unit designated as the processing target. Next, the score calculation unit 143 calculates a verification score for the verification unit designated as the processing target with reference to the extracted verification feature and the verification model. Finally, the target unit determination unit 144 determines the correctness of the recognition hypothesis for the time interval of the verification unit designated as the processing target, based on the calculated verification score. The verification result (correction determination result) for each verification unit determined in this way is provided to the section determination unit 15 together with the verification score.
 区間判定部15は、検証単位ごとの検証結果に基づき、検証対象として入力された音声認識仮説における認識誤り区間を検出する(ステップ15)。区間判定部15は、区間判定規則に従って、検証単位別に付された検証結果を適宜変更し、最終的に誤りと判定された検証単位が対応している時間区間を音声認識仮説における認識誤り区間として出力し、一連の音声認識仮説検証処理を終了する。 The section determination unit 15 detects a recognition error section in the speech recognition hypothesis input as the verification target based on the verification result for each verification unit (step 15). The section determination unit 15 appropriately changes the verification result assigned to each verification unit in accordance with the section determination rule, and uses the time section corresponding to the verification unit finally determined to be an error as the recognition error section in the speech recognition hypothesis. Output a series of speech recognition hypothesis verification processing.
 このように、本実施形態によれば、音声認識仮説の検証単位として、仮説中の単語単位よりも細かい単位や、仮説中の単語認定に依存しない分析フレーム基準の単位を用いているので、単語単位にはない特徴を参照して音声認識仮説の検証を行うことができ、結果として、より高い精度で音声認識誤り区間を検出することができる。 As described above, according to the present embodiment, as the verification unit of the speech recognition hypothesis, a unit smaller than the word unit in the hypothesis or an analysis frame reference unit that does not depend on the word recognition in the hypothesis is used. The speech recognition hypothesis can be verified with reference to features that are not in units, and as a result, speech recognition error intervals can be detected with higher accuracy.
 また、区間判定部15が検証単位での検証結果を調整(変更)する機能を有していることにより、使用用途にあった認識誤り区間の検出が可能になる。例えば、認識誤り区間の音声を切り出して再度音声認識を行う場合には、ある程度の長さの時間区間が必要になる。このような場合に、所定の長さ以上を確保することができる。また、検証スコアに基づき、「正」らしさと「誤」らしさが同程度の区間を保留にするなどの対応ができ、単位判定部14における判定誤りに対する頑健性を高めることができる。また、未確定区間をその前後の確定区間によって判定することは、一種の平滑化処理に相当し、例えば、1単位のみ周囲と結果が異なるものを補正することもできる。 Also, since the section determination unit 15 has a function of adjusting (changing) the verification result in the verification unit, it is possible to detect a recognition error section suitable for the intended use. For example, when a speech in a recognition error section is cut out and speech recognition is performed again, a certain length of time section is required. In such a case, a predetermined length or more can be ensured. In addition, based on the verification score, it is possible to take measures such as putting a section having the same level of “correct” and “error” on hold, and it is possible to improve robustness against a determination error in the unit determination unit 14. Further, determining the unconfirmed section based on the preceding and following determined sections corresponds to a kind of smoothing process, and for example, it is possible to correct only one unit that differs from the surroundings.
 なお、音声認識仮説がNベスト単語列の形式で表現される場合には、そのN個の単語列に対してセグメント単位等を用いて共通の検証単位を設定することも、N個の単語列がそれぞれ示す各単語に関する単位を併用させて別々の検証単位を設定することも可能である。なお、音声認識仮説が単語グラフの形式で表現される場合についても、セグメント単位等を用いてその単語グラフ全体に共通な検証単位を設定することも、また、単語グラフが示す各単語に関する単位を併用させて別々の検証単位を設定することも可能である。 When the speech recognition hypothesis is expressed in the form of N best word strings, a common verification unit can be set for the N word strings using a segment unit or the like. It is also possible to set different verification units by using units related to each word indicated by. Note that even when the speech recognition hypothesis is expressed in the form of a word graph, it is possible to set a common verification unit for the entire word graph using segment units, etc. It is also possible to set different verification units in combination.
 また、音声認識仮説が示す1つの単語列に対して、セグメント単位とする等1つの基準を用いて1種類の検証単位を定めて、その検証単位ごとに抽出した特徴に基づき検証を行うだけでなく、例えば、複数種類の検証単位を定めて、それぞれの種類につき検証を行い、その結果を総合した上で誤り認識区間を判定することも可能である。そのような場合には、検証単位変換部13と単位判定部14とを複数備えた上で、区間判定部15が複数の単位判定部14からの検証結果を統合して、誤り区間を判定するようにすればよい。 In addition, for each word string indicated by the speech recognition hypothesis, only one type of verification unit is defined using a single criterion such as a segment unit, and verification is performed based on features extracted for each verification unit. For example, it is also possible to determine a plurality of types of verification units, perform verification for each type, and determine the error recognition section after integrating the results. In such a case, a plurality of verification unit conversion units 13 and unit determination units 14 are provided, and an interval determination unit 15 integrates verification results from the plurality of unit determination units 14 to determine an error interval. What should I do?
 (第2の実施形態)
 次に、本発明の第2の実施形態について説明する。
(Second Embodiment)
Next, a second embodiment of the present invention will be described.
 図9は、本発明の第2の実施形態にかかる音声認識装置の構成例を示すブロック図である。 FIG. 9 is a block diagram showing a configuration example of the speech recognition apparatus according to the second embodiment of the present invention.
 図9に示す音声認識装置201は、第1の音声認識部21と、音声認識仮説検証部22と、第2の音声認識部23と、第1のモデル記憶部24と、第2のモデル記憶部25とを備える。 The speech recognition apparatus 201 shown in FIG. 9 includes a first speech recognition unit 21, a speech recognition hypothesis verification unit 22, a second speech recognition unit 23, a first model storage unit 24, and a second model storage. Part 25.
 音声認識装置201は、全体としては、例えば、入力されたデータをコンピュータで情報処理するパーソナルコンピュータ(PC)やサーバ装置などの情報処理装置によって実現される。 The speech recognition apparatus 201 is realized as a whole by an information processing apparatus such as a personal computer (PC) or a server apparatus that processes information input by a computer, for example.
 第1の音声認識部21は、当該音声認識装置201に入力される音声に対して音声認識処理を行って該音声に対応する単語列候補を求め、音声認識仮説として、例えば、単語グラフを出力する。第1の音声認識部21では、発話の全区間に対して、第1のモデル記憶部24に記憶されている第1のモデル(音声認識のためのモデルであって、音響モデルや言語モデル、単語辞書等を含む。)の与えるスコアに従って、音声データに適合する単語列を探索するといった通常の音声認識処理を行えばよい。例えば、音響モデルには隠れマルコフモデル、言語モデルには単語トライグラムモデルなどを用いる。 The first speech recognition unit 21 performs speech recognition processing on the speech input to the speech recognition device 201 to obtain word string candidates corresponding to the speech, and outputs, for example, a word graph as a speech recognition hypothesis To do. In the first speech recognition unit 21, the first model stored in the first model storage unit 24 (a model for speech recognition, an acoustic model, a language model, Normal speech recognition processing such as searching for a word string that matches speech data may be performed according to the score given by the word dictionary. For example, a hidden Markov model is used as the acoustic model, and a word trigram model is used as the language model.
 音声認識仮説検証部22は、図3に示した音声認識仮説検証装置101に相当する処理部であって、第1の音声認識部21が出力した音声認識仮説について、検証単位の設定処理と、検証単位ごとの検証処理と、誤り区間の判定処理とを行い、その結果を検証結果として出力する。検証結果として、例えば、音声データにおける音声認識誤り区間を示す情報(フレーム番号等)を出力する。 The speech recognition hypothesis verification unit 22 is a processing unit corresponding to the speech recognition hypothesis verification device 101 shown in FIG. 3, and for the speech recognition hypothesis output by the first speech recognition unit 21, a verification unit setting process, A verification process for each verification unit and an error interval determination process are performed, and the result is output as a verification result. As the verification result, for example, information (frame number or the like) indicating a speech recognition error section in the speech data is output.
 第2の音声認識部23は、音声認識仮説検証部22による検証結果に基づき、入力された音声のうち音声認識誤り区間として判定された区間またはその前後を含めた区間に対して、再び音声認識処理を行う。第2の音声認識部23では、第2のモデル記憶部25に記憶されている第2のモデルを用いて、音声認識処理を行う。ここで、第2のモデル記憶部25には、第1のモデル記憶部24が記憶している第1のモデルとは異なるモデルが記憶されているものとする。音響モデルであれば、音素等の単位ごとに音声特徴量の出現確率分布を示す情報を記憶してもよい。例えば、隠れマルコフモデルを第2のモデルとして用いる場合には、音素等の単位ごとに音声特徴量の出現確率分布として所定の値(第1のモデルとは異なる値)が導出される隠れマルコフモデルを規定するためのパラメータ(計算の際に用いる係数の情報等)を記憶してもよい。また、言語モデルであれば、単語等の単位ごとにその出現確率や接続確率を示す情報を記憶してもよい。例えば、単語トライグラムモデルを第2のモデルとして用いる場合には、単語等の単位ごとにその出現確率や接続確率として所定の値(第1のモデルとは異なる値)が導出される単語トライグラムモデルを規定するためのパラメータ(計算の際に用いる係数の情報等)を記憶してもよい。 Based on the verification result by the speech recognition hypothesis verification unit 22, the second speech recognition unit 23 performs speech recognition again for a section determined as a speech recognition error section of the input speech or a section including the front and back thereof. Process. The second speech recognition unit 23 performs speech recognition processing using the second model stored in the second model storage unit 25. Here, it is assumed that the second model storage unit 25 stores a model different from the first model stored in the first model storage unit 24. In the case of an acoustic model, information indicating the appearance probability distribution of speech feature values may be stored for each unit such as phonemes. For example, when the hidden Markov model is used as the second model, the hidden Markov model in which a predetermined value (a value different from the first model) is derived as the appearance probability distribution of the speech feature amount for each unit such as phonemes. May be stored (such as information on coefficients used in the calculation). In the case of a language model, information indicating the appearance probability and the connection probability may be stored for each unit such as a word. For example, when a word trigram model is used as the second model, a word trigram from which a predetermined value (a value different from the first model) is derived as an appearance probability or connection probability for each unit of a word or the like. You may memorize | store the parameter (information of the coefficient used at the time of calculation, etc.) for prescribing | regulating a model.
 図10は、発話と、第1の音声認識部21による音声認識仮説と、音声認識仮説検証部22による検証結果の例を示す説明図である。 FIG. 10 is an explanatory diagram showing an example of an utterance, a speech recognition hypothesis by the first speech recognition unit 21, and a verification result by the speech recognition hypothesis verification unit 22.
 図10に示すように、例えば、「今月松井の出る試合」という発話に対し、第1の音声認識部21が「<今月末><火><の><出る><試合>」という音声認識仮説を出力したとする。なお、“<>”は音声認識仮説における単語区切りを示している。この音声認識仮説に対して、音声認識仮説検証部22が検証単位ごとの特徴を抽出して検証を行ったところ、今月末の「月」の後半部分から「火」の終わりまでに相当する区間、すなわち発話における「松井」に相当する区間が認識誤り区間であると判定したとする。 As shown in FIG. 10, for example, in response to an utterance “Matsui comes out this month”, the first speech recognition unit 21 performs voice recognition “<End of the month> <Tue> <No> <Out> <Game>”. Assume that a hypothesis is output. Note that “<>” indicates a word break in the speech recognition hypothesis. When the speech recognition hypothesis verification unit 22 extracts and verifies the features of each verification unit for this speech recognition hypothesis, a section corresponding to the end of “Tue” from the latter half of the “Month” at the end of this month. That is, assume that it is determined that the section corresponding to “Matsui” in the utterance is the recognition error section.
 第2の音声認識部23は、音声認識仮説検証部22が認識誤り区間であると判定した区間(今月末の「月」の後半部分から「火」の終わりまでに相当する区間)について、例えば、認識仮説が正しいと判定された区間の認識仮説が示す単語列「の出る試合」を言語的な制約として、音声認識処理を行ってもよい。本例では、「の出る試合」を確定させ、その前の区間を認識対象に、言語的な制約として、例えば、単語のつながりやすさを表す言語モデルを第2のモデルとして用いることにより、認識対象の後半には「の」「出る」とつながりやすい単語が上位にくるようにすればよい。第1の音声認識部21における音声認識処理では、「の」や「出る」も定まっていないため、あらゆる可能性を考慮しなければならないが、制約の追加により、認識精度を高めることができる。 For example, the second speech recognition unit 23 determines the section (corresponding to the period from the second half of the “month” at the end of the current month to the end of “fire”) that the speech recognition hypothesis verification unit 22 determines as the recognition error section. The speech recognition process may be performed using the word string “game where the recognition hypothesis” indicated by the recognition hypothesis in the section in which the recognition hypothesis is determined to be correct is used as a linguistic restriction. In this example, by confirming the “game to be played” and using the section before that as a recognition target, as a linguistic restriction, for example, a language model representing the ease of connection of words is used as the second model. In the second half of the subject, words that are easy to connect with “no” “out” should be placed at the top. In the speech recognition process in the first speech recognition unit 21, since “no” and “out” are not determined, all possibilities must be taken into consideration, but recognition accuracy can be improved by adding constraints.
 また、例えば、認識仮説が正しいと判定された「の出る試合」からその発話に人名が出やすいことを推定し、人名を認識しやすいモデルを第2のモデルとして用いて音声認識処理を行ってもよい。本例では、「の出る試合」の区間の前には「人名」が出やすいという情報を得ることにより、「人名」らしい区間では「人名」として用いられる単語の出やすさを高くすればよい。なお、第2のモデルの選定に関して、予め第1のモデルとは異なるモデルが第2のモデルとして第2のモデル記憶部25に記憶されている場合には、そのまま記憶されている第2のモデルを用いればよい。また、例えば、第2のモデル記憶部25に複数種類のモデルが記憶されている場合には、その中から第1のモデルとは異なるモデルを第2のモデルとして選択すればよい。なお、第1のモデルと同じ種類のモデルであっても、第1のモデルに与えられたパラメータとは異なる値を与えることにより、第2のモデルとして用いることが可能である。 Also, for example, it is estimated that a person's name is likely to appear in the utterance from “a game where the recognition hypothesis is determined to be correct”, and a speech recognition process is performed using a model that easily recognizes the person's name as the second model. Also good. In this example, by obtaining information that “person name” is likely to appear before the “game where the game appears” section, it is only necessary to increase the likelihood that a word used as “person name” will appear in the section like “person name”. . Regarding the selection of the second model, when a model different from the first model is stored in the second model storage unit 25 as the second model in advance, the second model stored as it is. May be used. For example, when a plurality of types of models are stored in the second model storage unit 25, a model different from the first model may be selected as the second model. Note that even a model of the same type as the first model can be used as the second model by giving a value different from the parameter given to the first model.
 このように、発話(音声)のどの区間が誤りであるという時間的制約と、その前後にどのような言語情報または音響情報があるかという言語的制約や音響的制約を加えることで、音声認識精度を高めることができる。 In this way, speech recognition is performed by adding a temporal constraint that which section of speech (speech) is erroneous and a linguistic constraint or acoustic constraint of what kind of linguistic or acoustic information exists before and after that. Accuracy can be increased.
 なお、本発明においては、音声認識仮説検証装置や音声認識装置内の処理は上述の専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを音声認識仮説検証装置や音声認識装置にて読取可能な記録媒体に記録し、この記録媒体に記録されたプログラムを音声認識仮説検証装置や音声認識装置に読み込ませ、実行するものであっても良い。音声認識仮説検証装置や音声認識装置にて読取可能な記録媒体とは、ICカードやメモリカード、あるいは、フロッピーディスク(登録商標)、光磁気ディスク、DVD、CDなどの移設可能な記録媒体の他、音声認識仮説検証装置や音声認識装置に内蔵されたHDD等を指す。この記録媒体に記録されたプログラムは、例えば、制御ブロックにて読み込まれ、制御ブロックの制御によって、上述したものと同様の処理が行われる。 In the present invention, the processing in the speech recognition hypothesis verification device and the speech recognition device is implemented by the speech recognition hypothesis verification device and the speech recognition program other than those realized by the dedicated hardware described above. The program may be recorded on a recording medium that can be read by the recognition device, and the program recorded on the recording medium may be read and executed by a speech recognition hypothesis verification device or a speech recognition device. Recording media that can be read by the speech recognition hypothesis verification device or the speech recognition device include IC cards, memory cards, and transferable recording media such as floppy disks (registered trademark), magneto-optical disks, DVDs, and CDs. It refers to an HDD or the like built in a speech recognition hypothesis verification device or speech recognition device. The program recorded on this recording medium is read by a control block, for example, and the same processing as described above is performed under the control of the control block.
 以上、実施例を参照して本願発明を説明したが、本願発明は上記実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
 この出願は、2008年8月27日に出願された日本出願特願2008-218605を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2008-218605 filed on August 27, 2008, the entire disclosure of which is incorporated herein.
 本発明は、音声認識技術を利用するシステムに好適に適用可能である。 The present invention can be suitably applied to a system that uses voice recognition technology.

Claims (16)

  1.  入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を設定する検証単位変換部と、
     前記検証単位変換部によって設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する単位判定部とを備え、
     前記検証単位変換部は、前記音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含む1つ以上の検証単位を設定する音声認識仮説検証装置。
    A verification unit conversion unit that sets one or more verification units representing a time interval serving as a verification processing unit for the input speech recognition hypothesis;
    In accordance with the verification unit set by the verification unit conversion unit, comprising a unit determination unit that verifies the correctness of the recognition hypothesis in the time interval of each verification unit,
    The verification unit conversion unit sets one or more verification units including a verification unit in which a time interval different from a time interval of a word included in the speech recognition hypothesis is set.
  2.  請求項1に記載の音声認識仮説検証装置において、
     前記検証単位変換部は、音声認識仮説に含まれる単語の時間区間よりも小さな時間区間が設定された検証単位を含む1つ以上の検証単位を設定する音声認識仮説検証装置。
    The speech recognition hypothesis verification device according to claim 1,
    The verification unit conversion unit is a speech recognition hypothesis verification device that sets one or more verification units including a verification unit in which a time interval smaller than a time interval of a word included in the speech recognition hypothesis is set.
  3.  請求項1または請求項2に記載の音声認識仮説検証装置において、
     前記単位判定部は、少なくとも検証単位の時間区間における音声認識誤りに関する特徴を含む複数種類の特徴を素性とする確率モデルからなる検証モデルと、検証単位ごとに処理対象の音声認識仮説から抽出される特徴とに基づいて、各検証単位の時間区間における認識仮説の正誤を検証する音声認識仮説検証装置。
    In the speech recognition hypothesis verification device according to claim 1 or 2,
    The unit determination unit is extracted from a verification model composed of a probabilistic model having a plurality of types of features including features related to speech recognition errors in at least a time interval of a verification unit, and a speech recognition hypothesis to be processed for each verification unit A speech recognition hypothesis verification device that verifies the correctness of a recognition hypothesis in the time interval of each verification unit based on characteristics.
  4.  請求項3に記載の音声認識仮説検証装置において、
     前記検証モデルとして、CRFモデルを用いる音声認識仮説検証装置。
    The speech recognition hypothesis verification device according to claim 3,
    A speech recognition hypothesis verification device using a CRF model as the verification model.
  5.  請求項1乃至4のいずれか1項に記載の音声認識仮説検証装置において、
     前記単位判定部による検証単位ごとの検証結果に基づいて、処理対象の音声認識仮説の誤り区間を判定する区間判定部を備え、
     前記区間判定部は、複数の検証単位の検証結果を参照して、前記単位判定部による検証結果を変更した上で、誤り区間を判定する音声認識仮説検証装置。
    The speech recognition hypothesis verification device according to any one of claims 1 to 4,
    Based on the verification result for each verification unit by the unit determination unit, comprising an interval determination unit that determines an error interval of the speech recognition hypothesis to be processed,
    The speech recognition hypothesis verification device that determines an error section after the section determination unit changes a verification result by the unit determination section with reference to verification results of a plurality of verification units.
  6.  請求項1乃至5のいずれか1項に記載の音声認識仮説検証装置において、
     前記検証単位変換部は、音声分析フレーム単位に基づいて1つ以上の検証単位を設定する音声認識仮説検証装置。
    The speech recognition hypothesis verification device according to any one of claims 1 to 5,
    The verification unit conversion unit is a speech recognition hypothesis verification device that sets one or more verification units based on a speech analysis frame unit.
  7.  入力された音声に対して音声認識を行い、音声認識仮説を生成する第1の音声認識部と、
     前記第1の音声認識部によって生成された音声認識仮説の検証を行う音声認識仮説検証部と、
     前記音声認識仮説検証部による音声認識仮説の検証結果を参照して、再び音声認識を行う第2の音声認識部とを備え、
     前記音声認識仮説検証部は、
     入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を設定する検証単位変換部と、
     前記検証単位変換部によって設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する単位判定部とを有し、
     前記検証単位変換部は、前記音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含む1つ以上の検証単位を設定する音声認識装置。
    A first speech recognition unit that performs speech recognition on the input speech and generates a speech recognition hypothesis;
    A speech recognition hypothesis verification unit for verifying a speech recognition hypothesis generated by the first speech recognition unit;
    A second speech recognition unit that performs speech recognition again with reference to the verification result of the speech recognition hypothesis by the speech recognition hypothesis verification unit;
    The speech recognition hypothesis verification unit
    A verification unit conversion unit that sets one or more verification units representing a time interval serving as a verification processing unit for the input speech recognition hypothesis;
    In accordance with the verification unit set by the verification unit conversion unit, a unit determination unit that verifies the correctness of the recognition hypothesis in the time interval of each verification unit,
    The verification unit conversion unit sets one or more verification units including a verification unit in which a time interval different from a time interval of a word included in the speech recognition hypothesis is set.
  8.  請求項7に記載の音声認識装置において、
     前記第2の音声認識部は、前記音声認識仮説検証部による音声認識仮説の検証結果を参照し、認識正解と判定された時間区間の認識仮説に基づいて選定した音響モデルまたは言語モデルを用いて音声認識を行う音声認識装置。
    The speech recognition apparatus according to claim 7.
    The second speech recognition unit refers to a verification result of the speech recognition hypothesis by the speech recognition hypothesis verification unit, and uses an acoustic model or a language model selected based on the recognition hypothesis of the time interval determined to be a correct answer. A speech recognition device that performs speech recognition.
  9.  音声認識仮説を検証する音声認識仮説検証方法であって、
     入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を、少なくとも前記音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含むように設定し、
     設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する音声認識仮説検証方法。
    A speech recognition hypothesis verification method for verifying a speech recognition hypothesis,
    Verification in which one or more verification units representing a time interval serving as a verification processing unit are set to a time interval different from at least a time interval of a word included in the speech recognition hypothesis for the input speech recognition hypothesis Set to include units,
    A speech recognition hypothesis verification method that verifies the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit.
  10.  請求項9に記載の音声認識仮説検証方法において、
     少なくとも検証単位の時間区間における音声認識誤りに関する特徴を含む複数種類の特徴を素性とする確率モデルからなる検証モデルと、検証単位ごとに処理対象の音声認識仮説から抽出される特徴とに基づいて、各検証単位の時間区間における認識仮説の正誤を検証する請求項9に記載の音声認識仮説検証方法。
    The speech recognition hypothesis verification method according to claim 9,
    Based on a verification model consisting of a probabilistic model that features multiple types of features including features related to speech recognition errors at least in the time interval of the verification unit, and features extracted from the speech recognition hypothesis to be processed for each verification unit, The speech recognition hypothesis verification method according to claim 9, wherein the correctness of the recognition hypothesis in the time interval of each verification unit is verified.
  11.  請求項9または請求項10に記載の音声認識仮説検証方法において、
     検証単位ごとの検証結果に基づいて、処理対象の音声認識仮説の誤り区間を判定する際に、複数の検証単位の検証結果を参照して、前記単位判定部による検証結果を変更した上で、誤り区間を判定する音声認識仮説検証方法。
    In the speech recognition hypothesis verification method according to claim 9 or 10,
    Based on the verification result for each verification unit, when determining the error section of the speech recognition hypothesis to be processed, referring to the verification results of a plurality of verification units, after changing the verification results by the unit determination unit, A speech recognition hypothesis verification method for determining an error interval.
  12.  入力された音声に対して音声認識を行って音声認識仮説を生成し、
     生成された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を、少なくとも前記音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含むように設定し、
     設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証し、
     音声認識仮説の検証結果を参照して、認識正解と判定された時間区間の認識仮説に基づいて選定した音響モデルまたは言語モデルを用いて、再度音声認識を行う音声認識方法。
    Generate speech recognition hypotheses by performing speech recognition on the input speech,
    A verification in which one or more verification units representing a time interval serving as a verification processing unit are set to a generated time recognition hypothesis that is different from at least a time interval of a word included in the speech recognition hypothesis. Set to include units,
    According to the set verification unit, verify the correctness of the recognition hypothesis in the time interval of each verification unit,
    A speech recognition method that performs speech recognition again using an acoustic model or a language model selected based on a recognition hypothesis of a time interval determined to be a correct recognition with reference to a verification result of a speech recognition hypothesis.
  13.  コンピュータに、
     入力された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を、前記音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含むように設定する手順と、
     設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する手順とを実行させるための音声認識仮説検証用プログラム。
    On the computer,
    A verification unit in which one or more verification units representing a time interval serving as a verification processing unit for the input speech recognition hypothesis are set to a time interval different from the time interval of words included in the speech recognition hypothesis The steps to set to include
    A speech recognition hypothesis verification program for executing a procedure for verifying the correctness of a recognition hypothesis in the time interval of each verification unit according to a set verification unit.
  14.  請求項13に記載の音声認識仮説検証用プログラムにおいて、
     コンピュータに、
     少なくとも検証単位の時間区間における音声認識誤りに関する特徴を含む複数種類の特徴を素性とする確率モデルからなる検証モデルと、検証単位ごとに処理対象の音声認識仮説から抽出される特徴とに基づいて、各検証単位の時間区間における認識仮説の正誤を検証させる手順を実行させるための音声認識仮説検証用プログラム。
    In the speech recognition hypothesis verification program according to claim 13,
    On the computer,
    Based on a verification model consisting of a probabilistic model that features multiple types of features including features related to speech recognition errors at least in the time interval of the verification unit, and features extracted from the speech recognition hypothesis to be processed for each verification unit, A speech recognition hypothesis verification program for executing a procedure for verifying the correctness of a recognition hypothesis in the time interval of each verification unit.
  15.  請求項13または請求項14に記載の音声認識仮説検証用プログラムにおいて、
     コンピュータに、
     検証単位ごとの検証結果に基づいて、処理対象の音声認識仮説の誤り区間を判定する際に、複数の検証単位の検証結果を参照して、前記単位判定部による検証結果を変更した上で、誤り区間を判定する手順を実行させるための音声認識仮説検証用プログラム。
    In the speech recognition hypothesis verification program according to claim 13 or 14,
    On the computer,
    Based on the verification result for each verification unit, when determining the error section of the speech recognition hypothesis to be processed, referring to the verification results of a plurality of verification units, after changing the verification results by the unit determination unit, A speech recognition hypothesis verification program for executing a procedure for determining an error interval.
  16.  コンピュータに、
     入力された音声に対して音声認識を行って音声認識仮説を生成する手順と、
     生成された音声認識仮説に対して、検証の処理単位となる時間区間を表す1つ以上の検証単位を、少なくとも前記音声認識仮説に含まれる単語の時間区間とは異なる時間区間が設定された検証単位を含むように設定する手順と、
     設定された検証単位に従い、各検証単位の時間区間における認識仮説の正誤を検証する手順と、
     音声認識仮説の検証結果を参照して、認識正解と判定された時間区間の認識仮説に基づいて選定した音響モデルまたは言語モデルを用いて、再度音声認識を行う手順とを実行させるための音声認識用プログラム。
    On the computer,
    A procedure for performing speech recognition on the input speech to generate a speech recognition hypothesis,
    A verification in which one or more verification units representing a time interval serving as a verification processing unit are set to a generated time recognition hypothesis that is different from at least a time interval of a word included in the speech recognition hypothesis. The steps to set to include units,
    A procedure for verifying the correctness of the recognition hypothesis in the time interval of each verification unit according to the set verification unit,
    Speech recognition for executing a procedure for performing speech recognition again using an acoustic model or a language model selected based on the recognition hypothesis of a time interval determined as a correct answer with reference to the verification result of the speech recognition hypothesis Program.
PCT/JP2009/062611 2008-08-27 2009-07-10 Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same WO2010024052A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2010526623A JP5447382B2 (en) 2008-08-27 2009-07-10 Speech recognition hypothesis verification device, speech recognition device, method and program used therefor

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2008218605 2008-08-27
JP2008-218605 2008-08-27

Publications (1)

Publication Number Publication Date
WO2010024052A1 true WO2010024052A1 (en) 2010-03-04

Family

ID=41721226

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/062611 WO2010024052A1 (en) 2008-08-27 2009-07-10 Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same

Country Status (2)

Country Link
JP (1) JP5447382B2 (en)
WO (1) WO2010024052A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014149490A (en) * 2013-02-04 2014-08-21 Nippon Hoso Kyokai <Nhk> Voice recognition error correction device and program of the same
CN109829162A (en) * 2019-01-30 2019-05-31 新华三大数据技术有限公司 A kind of text segmenting method and device
CN111883109A (en) * 2020-07-01 2020-11-03 北京猎户星空科技有限公司 Voice information processing and verification model training method, device, equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10116094A (en) * 1996-10-01 1998-05-06 Lucent Technol Inc Method and device for voice recognition
JPH1185188A (en) * 1997-09-12 1999-03-30 Nippon Telegr & Teleph Corp <Ntt> Speech recognition method and its program recording medium
JP2001175276A (en) * 1999-12-17 2001-06-29 Denso Corp Speech recognizing device and recording medium
JP2005202165A (en) * 2004-01-15 2005-07-28 Advanced Media Inc Voice recognition system
JP2006227628A (en) * 2005-02-18 2006-08-31 Samsung Electronics Co Ltd Speech recognition method based on confidence level of keyword model which is weighted for respective frames and apparatus using the method
WO2008001486A1 (en) * 2006-06-29 2008-01-03 Nec Corporation Voice processing device and program, and voice processing method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11249688A (en) * 1998-03-05 1999-09-17 Mitsubishi Electric Corp Device and method for recognizing voice
US6785650B2 (en) * 2001-03-16 2004-08-31 International Business Machines Corporation Hierarchical transcription and display of input speech

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10116094A (en) * 1996-10-01 1998-05-06 Lucent Technol Inc Method and device for voice recognition
JPH1185188A (en) * 1997-09-12 1999-03-30 Nippon Telegr & Teleph Corp <Ntt> Speech recognition method and its program recording medium
JP2001175276A (en) * 1999-12-17 2001-06-29 Denso Corp Speech recognizing device and recording medium
JP2005202165A (en) * 2004-01-15 2005-07-28 Advanced Media Inc Voice recognition system
JP2006227628A (en) * 2005-02-18 2006-08-31 Samsung Electronics Co Ltd Speech recognition method based on confidence level of keyword model which is weighted for respective frames and apparatus using the method
WO2008001486A1 (en) * 2006-06-29 2008-01-03 Nec Corporation Voice processing device and program, and voice processing method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014149490A (en) * 2013-02-04 2014-08-21 Nippon Hoso Kyokai <Nhk> Voice recognition error correction device and program of the same
CN109829162A (en) * 2019-01-30 2019-05-31 新华三大数据技术有限公司 A kind of text segmenting method and device
CN109829162B (en) * 2019-01-30 2022-04-08 新华三大数据技术有限公司 Text word segmentation method and device
CN111883109A (en) * 2020-07-01 2020-11-03 北京猎户星空科技有限公司 Voice information processing and verification model training method, device, equipment and medium
CN111883109B (en) * 2020-07-01 2023-09-26 北京猎户星空科技有限公司 Voice information processing and verification model training method, device, equipment and medium

Also Published As

Publication number Publication date
JP5447382B2 (en) 2014-03-19
JPWO2010024052A1 (en) 2012-01-26

Similar Documents

Publication Publication Date Title
US6985863B2 (en) Speech recognition apparatus and method utilizing a language model prepared for expressions unique to spontaneous speech
KR101183344B1 (en) Automatic speech recognition learning using user corrections
JP5229478B2 (en) Statistical model learning apparatus, statistical model learning method, and program
US8069042B2 (en) Using child directed speech to bootstrap a model based speech segmentation and recognition system
US8849668B2 (en) Speech recognition apparatus and method
US8645139B2 (en) Apparatus and method of extending pronunciation dictionary used for speech recognition
JP2011002656A (en) Device for detection of voice recognition result correction candidate, voice transcribing support device, method, and program
JP3834169B2 (en) Continuous speech recognition apparatus and recording medium
US9251808B2 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
WO2008001486A1 (en) Voice processing device and program, and voice processing method
CN112331229B (en) Voice detection method, device, medium and computing equipment
JP2002132287A (en) Speech recording method and speech recorder as well as memory medium
EP1443495A1 (en) Method of speech recognition using hidden trajectory hidden markov models
JP5180800B2 (en) Recording medium for storing statistical pronunciation variation model, automatic speech recognition system, and computer program
CN115985342A (en) Pronunciation error detection method and device, electronic equipment and storage medium
JP5447382B2 (en) Speech recognition hypothesis verification device, speech recognition device, method and program used therefor
JP2004094257A (en) Method and apparatus for generating question of decision tree for speech processing
US20020184019A1 (en) Method of using empirical substitution data in speech recognition
JP5184467B2 (en) Adaptive acoustic model generation apparatus and program
KR20130126570A (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
JP2000352993A (en) Voice recognition system and learning method of hidden markov model
JP4533160B2 (en) Discriminative learning method, apparatus, program, and recording medium on which discriminative learning program is recorded
JP3633254B2 (en) Voice recognition system and recording medium recording the program
CN114299930A (en) End-to-end speech recognition model processing method, speech recognition method and related device
US6438521B1 (en) Speech recognition method and apparatus and computer-readable memory

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09809707

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2010526623

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09809707

Country of ref document: EP

Kind code of ref document: A1