WO2012150658A1 - Voice recognition device and voice recognition method - Google Patents

Voice recognition device and voice recognition method Download PDF

Info

Publication number
WO2012150658A1
WO2012150658A1 PCT/JP2012/002861 JP2012002861W WO2012150658A1 WO 2012150658 A1 WO2012150658 A1 WO 2012150658A1 JP 2012002861 W JP2012002861 W JP 2012002861W WO 2012150658 A1 WO2012150658 A1 WO 2012150658A1
Authority
WO
WIPO (PCT)
Prior art keywords
label
labels
ratio
utterance
speech
Prior art date
Application number
PCT/JP2012/002861
Other languages
French (fr)
Japanese (ja)
Inventor
暁東 王
邦彦 尾和
誠 庄境
Original Assignee
旭化成株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 旭化成株式会社 filed Critical 旭化成株式会社
Publication of WO2012150658A1 publication Critical patent/WO2012150658A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the present invention relates to a voice recognition device and a voice recognition method.
  • word string candidates are selected by pruning using information on the correlation probability model generated in the learning phase before speech recognition is executed.
  • An attempt has been made to perform speech recognition with higher accuracy by weighting each selected word string candidate for each word.
  • This correlation probability model includes statistical information regarding the time length and speech power in recognition word units, the speech speed based on the time length ratio of vowels and consonants, the number of words in sentence units, and the like.
  • the speech recognition apparatus described in Patent Document 2 recognizes an input speech, adds a plurality of recognition result candidates with information on the duration of each syllable, and outputs the speech recognition means, and a syllable boundary from the input speech A recognition result from the plurality of recognition result candidates based on a syllable boundary candidate detection means for obtaining a candidate, an average syllable length estimation means for obtaining an average syllable length from the syllable boundary candidate, and the recognition result candidate and the average syllable length.
  • Candidate selection means for selecting.
  • the speech recognition method described in Patent Document 1 is based on the premise that the spoken sentence is mainly recognized, and the speech is recognized using a correlation probability model including time length and speech power statistical information in units of words. Recognize.
  • the speaking speed of the entire utterance content may change from time to time, and the speaking speed and speaking timing may vary depending on the speaker.
  • the accuracy of voice recognition is greatly affected by how the speaker speaks each time.
  • the speech recognition apparatus described in Patent Document 2 has a problem that it is vulnerable to a change in instantaneous speech speed because it compares information on the duration of each syllable of the recognition result candidate and the average syllable length. . Therefore, the present invention provides a high-accuracy speech recognition apparatus and speech recognition method that are fast in processing speed and low in cost, and that are not easily affected by changes in speech speed or the manner of speech for each speaker. Objective.
  • a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result; and among labels included in the label string
  • An utterance duration output unit that outputs the utterance duration length of at least one label and the preceding and following labels, the utterance duration length of the label and the preceding and following labels, and statistical information relating to the utterance duration length of the label
  • a label correction unit that corrects the label to a correction candidate label that is a correction candidate label based on the first predetermined condition based on the above.
  • label correction speech recognition
  • speech recognition speech recognition
  • the label correction unit includes a ratio between the utterance duration length of the label and the utterance duration length of labels before and after the label, and the utterance continuation of the label.
  • the speech recognition device wherein the label is corrected to the correction candidate label based on the first predetermined condition relating to a ratio between a time length and the utterance duration time of the label before and after the label.
  • label correction speech recognition
  • speech recognition is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.
  • voice power information that outputs voice power information, which is information indicating voice power in an utterance section of at least one of the labels included in the label string and the labels before and after the label.
  • the label correction unit further includes the voice of the label and the labels before and after the label.
  • a speech recognition apparatus that corrects the label to a correction candidate label, which is a correction candidate label, based on power information and a second predetermined condition based on statistical information on the sound power of the label.
  • the label correction unit includes the speech duration time of the label and the labels before and after the label, and the voice of the label in addition to the first predetermined condition. Also based on the ratio of power and the audio power of the labels before and after the label, and the second predetermined condition relating to the ratio of the audio power of the label and the audio power of the labels before and after the label.
  • a speech recognition apparatus is provided that corrects the label to the correction candidate label. According to this configuration, using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label, the label correction is performed based on the speech power in addition to the speech duration time. Therefore, more accurate speech recognition can be performed.
  • a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result, at least one label among labels included in the label string, and A voice power information output unit that outputs voice power information, which is information indicating voice power in an utterance section of the label before and after the label, the voice power information of the label and the labels before and after the label, and statistical information regarding the voice power of the label
  • a label correction unit that corrects the label to a correction candidate label that is a correction candidate label based on the second predetermined condition based on the above.
  • a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result, and outputs an utterance duration time of a label included in the label string Based on a first predetermined condition based on statistical information regarding the ratio of the utterance duration length of adjacent labels and the ratio of the utterance duration length of adjacent labels,
  • a speech recognition device includes a correction unit that corrects at least one of adjacent labels.
  • the label correction (voice recognition) is sequentially performed based on the ratio of the durations of utterances using adjacent labels among the labels included in the label sequence that is the recognition result of the input voice. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.
  • a speech recognition method executed by a speech recognition apparatus having a label sequence output unit, an utterance duration output unit, and a label correction unit, wherein the label sequence output A first step of recognizing input speech by the unit and outputting a label sequence indicating the speech recognition result, and at least one label among the labels included in the label sequence by the utterance duration output unit, and The second step of outputting the utterance duration length of the label before and after the label, and the label correction unit, to the statistical information regarding the utterance duration length of the label and the labels before and after the label, and the utterance duration length of the label And a third step of correcting the label to a correction candidate label, which is a correction candidate label, based on a first predetermined condition based on ⁇ there is provided a method.
  • label correction speech recognition
  • speech recognition speech recognition
  • a second step of outputting utterance duration lengths of one label and the preceding and following labels, a first step based on statistical information relating to the utterance duration length of the labels and preceding and following labels, and the utterance duration length of the labels And a third step of correcting the label to a correction candidate label, which is a correction candidate label, based on the predetermined condition (1).
  • label correction speech recognition
  • speech recognition speech recognition
  • FIG. 1 It is a figure which shows the structural example of the speech recognition apparatus which concerns on one Embodiment of this invention. It is a figure which shows the specific example of a correct answer label and a recognition result label. It is a figure which shows the specific example of the speech duration time of a correct answer label and a recognition result label, Pre_ratio, and Post_ratio. It is a figure which shows an example of the scatter diagram of the sample of the correct answer of number "5", and the sample of an incorrect answer. It is a figure which shows another specific example of a correct answer label and a recognition result label. It is a figure which shows the specific example of the average value of voice power of a correct answer label and a recognition result label, Pre_ratio, and Post_ratio.
  • FIG. 1 is a diagram illustrating a configuration example of a speech recognition apparatus according to the present embodiment.
  • the speech recognition apparatus 1 includes a label string output unit 110, a label selection unit 120, an utterance duration output unit 130, a voice power information output unit 140, a label correction unit 150, And a result output unit 160.
  • the label string output unit 110 recognizes the input voice and outputs a label string indicating the voice recognition result.
  • the “label” refers to one unit obtained by dividing the input voice by syllables. In the present embodiment, one number corresponds to one label. Further, the “label column” includes at least one label.
  • the “speech recognition result (of input speech)” means a result of performing speech recognition using an acoustic analysis of a speech waveform of the input speech and using a feature quantity such as MFCC (Mel Frequency Cepstrum Coefficient), for example.
  • MFCC Mel Frequency Cepstrum Coefficient
  • the label sequence output unit 110 for example, numbers are recognized from the speech waveform of the input speech using MFCC or the like, and a number sequence (that is, a label sequence) including at least one number (ie, a label) is output. Is done. At this time, an utterance start time and an utterance end time indicating the utterance timing of each number included in the number string (hereinafter referred to as “time boundary information” as appropriate) are also recognized together with the label string.
  • Label selection unit 120 selects at least one of the labels included in the label sequence output by the label sequence output unit 110.
  • the label selected by the label selection unit 120 is a label to be corrected by a label correction unit 150 described later. That is, some of the labels included in the label row are sequentially selected by the label selection unit 120, and the selected labels are appropriately corrected by the label correction unit 150.
  • voice recognition of the entire label string is performed.
  • the labels to be corrected by the label correction unit 150 may be all labels included in the label column or only a part thereof.
  • the labels to be corrected by the label correction unit 150 may be all labels included in the label column or only a part thereof.
  • the labels to be corrected by the label correction unit 150 may perform label correction. Thereby, the processing load of voice recognition can be reduced and the processing speed can be increased.
  • the labels included in the label row may be selected and corrected in order from the label with the shorter utterance duration or the one with the lower voice power.
  • label correction is performed preferentially from those with a high possibility of erroneous recognition, and the efficiency of label correction is improved.
  • the utterance duration length output unit 130 outputs the utterance duration length of at least one of the labels included in the label sequence output by the label sequence output unit 110 and the labels before and after the label.
  • the label string output unit 110 outputs a label string by, for example, recognizing a label using an MFCC or the like from the voice waveform of the input voice.
  • the utterance start time and utterance end time indicating the utterance timing of each label included in the label string are also recognized.
  • the utterance duration time output unit 130 calculates the utterance duration time length by subtracting the utterance start time from the utterance end time for each label.
  • the voice power information output unit 140 outputs voice power information, which is information indicating voice power in the utterance section of at least one of the labels included in the label string output by the label string output unit 110 and the labels before and after the label string. Output.
  • the “information indicating the voice power in the utterance section of the label” may be any information that directly or indirectly indicates the voice power of the label. For example, an average value of voice power between the utterance start time and the utterance end time of the label can be used.
  • the label correction unit 150 is based on statistical information relating to the utterance duration of the label selected by the label selection unit 120 (hereinafter referred to as “selected label”) and the labels before and after the label, and the utterance duration of the selected label. Based on the predetermined condition of 1, the selected label is corrected to a correction candidate label which is a correction candidate label. At this time, the label correcting unit 150 also compares the utterance duration length of the selected label with the utterance duration length of the label before and after the selected label, the utterance duration length of the selected label, and the before and after of the selected label. The selected label may be corrected to the correction candidate label based on the first predetermined condition regarding the ratio of the label to the utterance duration time. In addition, the first predetermined condition may be a condition based on statistical information regarding a ratio between the utterance duration time of the selected label and the utterance duration times of labels before and after the selected label.
  • the first predetermined condition is stored in the storage device 170 such as a hard disk included in the voice recognition device 1, but is not limited thereto. For example, it may be received from an external device or the like every time the process in the label correction unit 150 is executed.
  • the statistical information regarding the duration of the utterance for each label may also be stored in an external device or the like, or may be stored in the storage device 170 of the speech recognition device 1.
  • the label correction unit 150 sets the selected label as a correction candidate label based on the audio power information of the selected label and the labels before and after the selected label and the second predetermined condition based on the statistical information on the audio power of the selected label. It may be modified. At this time, the label correcting unit 150 calculates the ratio between the audio power of the selected label and the audio power of the labels before and after the selected label, and the audio power of the selected label and the audio power of the labels before and after the selected label. The selection label may be corrected to the correction candidate label based on the second predetermined condition regarding the ratio. Further, the second predetermined condition may be a condition based on statistical information relating to a ratio between the audio power of the selected label and the audio power of the labels before and after the selected label.
  • the label correcting unit 150 further includes audio power information of the selected label and the labels before and after the selected label, and the selected label.
  • the selected label may be corrected to the correction candidate label based on the second predetermined condition based on the statistical information on the voice power of the voice.
  • the second predetermined condition is stored in the storage device 170 such as a hard disk included in the voice recognition device 1, but is not limited thereto. For example, it may be received from an external device or the like every time the process in the label correction unit 150 is executed. Statistical information regarding the voice power of each label may also be stored in an external device or the like, or may be stored in the storage device 170 of the voice recognition device 1.
  • FIG. 1 shows the case where the speech recognition apparatus 1 has both the utterance duration output unit 130 and the speech power information output unit 140, it may have only one of them.
  • the result output unit 160 outputs the label sequence corrected by the label correction unit 150 to an external device or the like as a final speech recognition result.
  • the functions of the components described above are such that a CPU (Central Processing Unit) (not shown) included in the speech recognition apparatus 1 executes a program stored in a storage device such as a hard disk or a ROM (Read Only Memory). This function is realized by reading out and executing on a memory such as “Memory”.
  • the label string, label, utterance start time, utterance end time, utterance duration time, voice power information, first predetermined condition 10, second predetermined condition 20, and the like are stored in a storage device, a memory, or the like. Data.
  • the process in the speech recognition apparatus will be described with a specific example.
  • the number of labels included in the recognition result recognized from the speech waveform is larger than the number of labels of the input speech because the extra labels are erroneously recognized when recognizing the spoken input speech.
  • a case (insertion recognition error) will be described.
  • the actual input speech is “0, 3, 6, 4”, but is recognized by the label string output unit 110 of the speech recognition device 1.
  • the label string (recognition result) is “0, 3, 6, 5, 4”.
  • the utterance start time and the utterance end time which are time boundary information of each label (number) of the recognition result, are as shown in FIG.
  • the label correction unit 150 calculates a ratio between the utterance duration time length of the selected label selected by the label selection unit 120 and the utterance duration time lengths of the labels before and after the selected label.
  • a threshold for the ratio of the speech duration time to the labels before and after each label is determined in advance based on statistical information (a method for determining the threshold will be described later). Then, the label correction unit 150 compares the calculated ratio with a threshold value, and determines whether or not to correct the selected label to a correction candidate label according to the comparison result.
  • the utterance of each label in the utterance duration time output unit 130 according to the following formula: The duration time is calculated.
  • Speech duration time speech end time ⁇ speech start time
  • Pre_ratio and Post_ratio of each label are calculated based on the calculated speech duration time of each label (FIG. 3).
  • Pre_ratio is the ratio between the utterance duration of the correction target label and the utterance duration of the immediately preceding label.
  • Post_ratio is the ratio between the utterance duration of the correction target label and the utterance duration of the label immediately after that.
  • Pre_ratio Utterance duration length of the numeric label to be corrected / utterance duration length of the immediately preceding numeric label
  • Post_ratio Utterance duration length of the numeric label to be corrected / utterance duration length of the numeric label immediately after that
  • the threshold for Pre_ratio and Post_ratio of each label determined in advance based on statistical information is generally different for each label, In this example, it is assumed that all are “0.5” in order to simplify the description.
  • the numeric label “5 (wu)” is “5 (wu) because Pre_ratio ⁇ 0.5 and Post_ratio ⁇ 0.5.
  • "" Is determined to be a number label that has been erroneously recognized and inserted (hereinafter referred to as an insertion error). Then, “5 (wu)” is deleted in the label correction unit 150, and the final recognition result in the speech recognition apparatus 1 is “0, 3, 6, 4”.
  • threshold determination method The determination of thresholds for Pre_ratio and Post_ratio is determined in the prior learning phase. For example, when the threshold values of Pre_ratio and Post_ratio for the number “5” are determined, speech recognition of the learning speech signal is performed, and the sample of the correct number “5” and the sample of the number “5” that is incorrect And are separated. Also, Pre_ratio and Post_ratio are calculated for each sample.
  • FIG. 4 is a scatter diagram of the correct answer sample ( ⁇ ) and the incorrect answer sample ( ⁇ ) of the number “5” plotted in a two-dimensional space with the Pre_ratio as the horizontal axis and the Post_ratio as the vertical axis. It is.
  • the coordinates of the point R are changed in a certain fixed value unit (for example, 0.05) within a certain range (for example, from 0 to 1), and the difference between the number of incorrect samples and the number of correct samples is maximized.
  • the coordinates Tx and Ty of the point R to become the threshold values of Pre_ratio and Post_ratio, respectively.
  • the label correction unit 150 calculates a ratio between the average value of the audio power of the selected label selected by the label selection unit 120 and the average value of the audio power of the labels before and after the selected label. To do. Further, a threshold (second predetermined condition) for the ratio of the average value of the audio power to the labels before and after each label is determined in advance based on statistical information (a method for determining the threshold will be described later). Then, the label correction unit 150 compares the calculated ratio with a threshold value, and determines whether or not to correct the selected label to a correction candidate label according to the comparison result.
  • a threshold second predetermined condition
  • Pre_ratio is a ratio between the average value of the sound power of the label to be corrected and the average value of the sound power of the label immediately before that.
  • Post_ratio is a ratio between the average value of the sound power of the label to be corrected and the average value of the sound power of the label immediately after that.
  • Pre_ratio Average voice power of the numeric label to be corrected / Average voice power of the numeric label immediately before it
  • Post_ratio Average value of the voice power of the numeric label to be corrected / Average value of the voice power of the numeric label immediately after that
  • the thresholds for the Pre_ratio and Post_ratio of each label determined in advance based on statistical information are generally different for each label. However, in this example, all are assumed to be “0.5” in order to simplify the description.
  • the numeric label “2 (er)” is “2 (er) because Pre_ratio ⁇ 0.5 and Post_ratio ⁇ 0.5.
  • "" Is determined to be a number label that has been misrecognized and inserted (insertion misrecognition). Then, “2 (er)” is deleted in the label correcting unit 150, and the final recognition result in the speech recognition apparatus 1 is “0, 3, 4”.
  • FIG. 7 is a diagram illustrating an example of a relationship between a time series of audio power and a time boundary of each label of a recognition result.
  • the average value of the speech power for each numeric label is the average value of the time series of the speech power in the segment within the time boundary from the speech start time to the speech end time of each number.
  • the threshold values of Pre_ratio and Post_ratio for the number “2” are determined, the learning speech signal is recognized in the learning phase, and the correct number “2” sample and the incorrect number “2” are determined. 2 "sample is separated. Also, Pre_ratio and Post_ratio are calculated for each sample.
  • FIG. 8 is a scatter diagram of the correct answer sample ( ⁇ ) and the incorrect answer sample ( ⁇ ) of the number “2” plotted on a two-dimensional space with the Pre_ratio as the horizontal axis and the Post_ratio as the vertical axis. It is.
  • the coordinates of the point R are changed in a certain fixed value unit (for example, 0.05) within a certain range (for example, from 0 to 1), and the difference between the number of incorrect samples and the number of correct samples is maximized.
  • the coordinates Tx and Ty of the point R to become the threshold values of Pre_ratio and Post_ratio, respectively.
  • the number of labels included in the recognition result recognized from the speech waveform is less than the number of labels of the input speech due to the absence of labels that should be erroneously recognized when recognizing the spoken input speech.
  • a case where the number is reduced (deletion error) is described.
  • the actual input speech (correct answer) is “0, 1, 1, 4”, but has been recognized by the label string output unit 110 of the speech recognition device 1.
  • the label string (recognition result) is “0, 1, 4”.
  • the utterance start time and the utterance end time which are time boundary information of each label (number) of the recognition result, are as shown in FIG.
  • the label correcting unit 150 corrects the label using the utterance duration length of each label.
  • the power information of each label is used.
  • the utterance duration of each label is calculated from the utterance start time and utterance end time of each label “0, 1, 1, 4” of the recognition result by the following expression. The length is calculated.
  • Pre_ratio and Post_ratio of each label are calculated (FIG. 10). Note that the utterance duration time of each label, the Pre_ratio, and the Post_ratio are calculated in the same manner as in the first specific example.
  • the thresholds (first predetermined condition) for the Pre_ratio and Post_ratio of each label determined in advance based on statistical information are generally different values for each label, but may be the same value. In this example, all are assumed to be “1.8” for the sake of simplicity. Comparing the Pre_ratio and Post_ratio of each label shown in FIG. 10 with the threshold “1.8”, the numeric label “1 (yi)” has Pre_ratio> 1.8 and Post_ratio> 1.8. Therefore, the numeric label “1 (yi)” is determined to have been erroneously recognized.
  • a frequently recognized misrecognition pattern is determined in advance from the collation result between the recognition result and the correct answer in the prior learning phase.
  • the determination of thresholds for Pre_ratio and Post_ratio is determined in the prior learning phase. For example, when the threshold values of Pre_ratio and Post_ratio for the number “1” are determined, the result of speech recognition of the learning speech signal is analyzed as follows. First, the recognition result is a sample of the correct number “1”, an incorrect answer (insertion error) in which the number “1” is inserted among the incorrect number “1”, and the number “1”. Three of the incorrect answer (deletion error) samples that result in deletion of the preceding and succeeding numbers “1” because the utterance section of “is recognized to include the preceding and following numbers“ 1 ”. Sort into Also, Pre_ratio and Post_ratio are calculated for each sample.
  • FIG. 11 shows a sample of correct answers ( ⁇ ) plotted on a two-dimensional space with Pre_ratio on the horizontal axis and Post_ratio on the vertical axis, incorrect sample ( ⁇ ) of insertion error (INS), and deletion error (DEL). It is a scatter diagram of an incorrect answer sample (x).
  • the coordinates of the point R are changed within a certain range (for example, 1 to 10) in a certain fixed value unit (for example, 0.05), and the difference between the number of incorrect samples and the number of correct samples of the deletion error is determined.
  • the coordinates Tx and Ty of the maximum point R are set as thresholds for Pre_ratio and Post_ratio, respectively.
  • FIG. 11 is an example in which the coordinates (Tx, Ty) of the point R at which the difference between the number of incorrect samples of deletion errors and the number of correct samples is maximum are (1.8, 1.8).
  • the above processing is performed for all labels that can be output by the label string output unit 110, and the Pre_ratio and Post_ratio threshold values for each label are determined.
  • the insertion error and the deletion error have been individually described, but it is of course possible to process these simultaneously. That is, both the Pre_ratio and Post_ratio thresholds for correcting insertion errors and the Pre_ratio and Post_ratio thresholds for correcting deletion errors are held in the storage device 170, and the Pre_ratio and Post_ratio of each label of the recognition result are stored. The insertion error and the deletion error may be corrected simultaneously by comparing these threshold values.
  • FIG. 12 is a flowchart showing the flow of processing in the speech recognition apparatus according to this embodiment.
  • the label string output unit 110 outputs the label string, the time boundary information of each label, and the audio power 30 (step S101).
  • One label is selected from each label included in the label row (step S102).
  • the label to be corrected is determined in advance as the high-frequency error pattern 40 and stored in the storage device 170 such as a hard disk. (For example, “1 (yi)”, “2 (er)”, “5 (wu)”).
  • step S103 when the selected label selected in step S102 is a verification target of the label correction (step S103), the label existing immediately before this selected label, 1 The label existing behind is acquired (step S104). If the selected label is not a verification target for label correction (step S103), the process returns to step S101 and the process is repeated.
  • the utterance duration length or voice power 50 of the selected label and the labels before and after the selection label are output by the utterance duration length output unit 130 or the voice power information output unit 140, respectively (step S105). Then, based on the prosodic information (speech duration time or voice power) 50 output in step S105, the ratio of the utterance duration time of the selected label to the utterance duration times of the preceding and following labels, or the voice power of the selected label And the ratio of the audio power of the front and rear labels is calculated by the label correction unit 150 (step S106).
  • step S104 The ratio calculated in step S104 is compared with threshold values (first predetermined condition 10 and second predetermined condition 20) determined in advance from statistical information, and recognition errors (insertion error, deletion error, etc.) are compared. Is determined (step S107). If it is a recognition error, the label correction unit 150 corrects the selected label (step S108). After the above processing is repeated for all the labels of the recognition results, the speech recognition of the label sequence ends (step S109), and the final speech recognition result is output by the result output unit 160.
  • threshold values first predetermined condition 10 and second predetermined condition 20
  • the label sequence is output in the first step S101, but without waiting for all the labels in the label sequence to be output, the recognition of the labels proceeds while the labels are sequentially recognized.
  • the processing after step S102 may be executed.
  • FIG. 13 is a diagram illustrating the recognition performance of Chinese continuous numerals of the speech recognition apparatus according to the present embodiment.
  • “Baseline” is the recognition rate evaluation result when this method is not used (speech recognition only from the speech waveform)
  • “2-Dim_time length Ratio” is the continuation of the utterance of the selected label and the preceding and following labels.
  • the recognition rate evaluation result of the correction method using the time length ratio and “2-Dim_PowerRatio” is the recognition rate evaluation result of the correction method using the ratio of the average value of the voice power of the selected label and the preceding and following labels.
  • “Long” is an evaluation result of a continuous numeric string of 11 to 15 digits
  • “Short” is an evaluation result of a continuous numeric sequence of 1 to 8 digits.
  • the recognition rate was higher than Baseline in both cases of using the ratio of duration of utterance and the ratio of average value of voice power. Further, with respect to the utterance duration length and voice power of the selected label, the advantages of performing label correction using the ratio of the preceding and following labels to the utterance duration length or voice power will be described below.
  • FIG. 14 is a diagram showing a change in average utterance duration length of one digit (single syllable unit) by a speaker.
  • the ratio of the duration of the utterance to the labels before and after the selected label is adopted, so that the misrecognition is determined while reducing the influence due to the variation in the speaking speed for each speaker. can do.
  • FIG. 15 is a diagram showing a change in the average utterance duration length for one digit (in syllable units) depending on the length of the digit string. Comparing the shortest 1-digit number sequence on the left side with the longest 15-digit number sequence on the rightmost side, the average utterance duration is different by about 1.6 times. Therefore, for example, when the ratio of the utterance duration time of the selected label with respect to the overall average utterance duration length is used as a reference standard, it is not possible to deal with variations in the utterance duration length due to the length of the numeric string.
  • the speech recognition method since the ratio of the utterance duration time to the labels before and after the selected label is adopted, the influence of the variation in the utterance duration time due to the length of the numeric string is reduced. Misrecognition can be determined.
  • FIG. 16 is calculated from data recorded with the main reading habit of “(front) 3 digits + (middle) 3 digits + (rear) 5 digits” of the Chinese mobile phone number (11 digits). It is a figure which shows syllable average time length. Even for the same three-digit number, the duration of the utterance differs depending on the position of the first three digits and the next three digits in the speech. Thus, even within the same utterance, there is a variation in the duration of the utterance depending on the utterance position, so it is effective to use a local ratio before and after, as in the speech recognition method according to this embodiment. is there.
  • the voice recognition is performed after the utterance is finished to the end.
  • the utterance duration length of each label and the labels before and after the label as in the speech recognition method of the present embodiment, it is possible to process sequentially from the speech that has been uttered. There is also an effect that the processing time is shortened. In addition, the processing load is lower and it can be implemented with a simpler configuration.
  • FIG. 17 is a diagram showing the recognition performance of Chinese continuous numbers of the speech recognition apparatus according to the present embodiment.
  • Baseline is a recognition rate evaluation result when this method is not used (speech recognition only from a speech waveform).
  • 1-Dim_Time Length Ratio is the recognition rate evaluation result of the correction method using the ratio of the utterance duration length of the selected label to the overall average time length
  • 1-Dim_PowerRatio is the overall average power of the voice power of the selected label. It is the recognition rate evaluation result of the correction method using the ratio to.
  • 2-Dim_Time Length Ratio is the recognition rate evaluation result of the correction method using the ratio of the utterance duration length of the selected label and the preceding and following labels
  • 2-Dim_PowerRatio is the voice power of the selected label and the preceding and following labels. It is the recognition rate evaluation result of the correction method using the ratio of the average value.
  • the speech recognition method (2-Dim_time length Ratio, 2-Dim_PowerRatio) is a method based on the overall average time length and overall average power (1 -Dim_time length Ratio, 1-Dim_PowerRatio) It can be seen that the performance is superior.
  • the label correcting unit 150 determines the ratio between the utterance duration time of the selected label and the utterance duration time of the label before and after the selected label, the utterance duration time of the selected label, and the Although the description has been made centering on the example in which the selected label is corrected to the correction candidate label based on the first predetermined condition regarding the ratio of the preceding and following labels to the utterance duration time length, the present invention is not limited to this.
  • the label correction unit 150 is highly likely to be included in the incorrect sample range.
  • the selected label may be corrected to the correction candidate label on the basis of the first predetermined condition regarding. That is, the label correction unit 150 determines whether the adjacent label is based on the first predetermined condition based on the ratio of the utterance duration length of the adjacent label and the statistical information regarding the ratio of the utterance duration length of the adjacent label. At least one of the labels to be modified may be modified. According to this configuration, although the correction accuracy is slightly inferior to the case of correcting the selection label based on the ratio of the utterance duration length of the selected label and the utterance duration length of the label before and after the selection label, The processing speed can be improved.
  • the processing in the speech recognition apparatus includes the first step of recognizing the input speech to the computer and outputting a label sequence indicating the speech recognition result, and the label contained in the label sequence.
  • the processing in the speech recognition apparatus includes a first step of recognizing input speech to a computer and outputting a label sequence indicating the speech recognition result, and speech of a label included in the label sequence. Based on the second step of outputting the duration time, the ratio of the utterance duration length of the adjacent labels, and the first predetermined condition based on the statistical information on the ratio of the utterance duration length of the adjacent labels.
  • the third step of correcting at least one of the adjacent labels may be realized by a speech recognition program for executing the third step.
  • this voice recognition program includes a semiconductor storage medium such as RAM and ROM, a magnetic storage type storage medium such as FD and HD, an optical reading type storage medium such as CD, CDV, LD, and DVD, and a magnetic storage type such as MO. / Optical reading type storage media, etc., which can be stored in a computer-readable storage medium and distributed and sold regardless of electronic, magnetic, optical, etc. reading methods It can be downloaded through a network.
  • a semiconductor storage medium such as RAM and ROM
  • a magnetic storage type storage medium such as FD and HD
  • an optical reading type storage medium such as CD, CDV, LD, and DVD
  • MO magnetic storage type

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The purpose of the present invention is to provide a voice recognition device capable of voice recognition with higher accuracy while reducing the effect caused by a change in speaking speed, in how each speaker speaks and so forth. The voice recognition device is provided with a label sequence output unit (110) which recognizes an input voice and outputs a label sequence indicative of the voice recognition result thereof, a speech duration length output unit (130) which outputs a speech duration length in at least one label among labels included in said label sequence and the labels before and after the at least one label, and a label modification unit (150) which, on the basis of said speech duration length in said label and the labels before and after, and a first given condition based on statistical information associated with the speech duration length in said label, modifies said label to a modification candidate label that is a modification candidate label.

Description

音声認識装置および音声認識方法Speech recognition apparatus and speech recognition method
 本発明は、音声認識装置および音声認識方法に関する。 The present invention relates to a voice recognition device and a voice recognition method.
 従来、より高精度に音声認識を行うための手法が様々に提案されている。例えば、特許文献1に記載の音声認識方法では、音声認識を実行する前の学習フェーズにおいて生成される相関関係確率モデルの情報を用いて、枝刈り(pruning)による単語列候補の選択を行う。そして、この選択した単語列候補に対して単語毎の重み付けを行うことで、より高精度に音声認識を行うことが試みられている。この相関関係確率モデルは、認識単語単位の時間長や音声パワー、母音と子音の時間長比率による話速、文章単位の単語数などに関する統計情報を含むものである。 Conventionally, various methods for performing speech recognition with higher accuracy have been proposed. For example, in the speech recognition method described in Patent Document 1, word string candidates are selected by pruning using information on the correlation probability model generated in the learning phase before speech recognition is executed. An attempt has been made to perform speech recognition with higher accuracy by weighting each selected word string candidate for each word. This correlation probability model includes statistical information regarding the time length and speech power in recognition word units, the speech speed based on the time length ratio of vowels and consonants, the number of words in sentence units, and the like.
 特許文献2に記載の音声認識装置は、入力音声を認識し、複数の認識結果候補を、各音節の継続時間長の情報を付加して、出力する音声認識手段と、前記入力音声から音節境界候補を求める音節境界候補検出手段と、前記音節境界候補から平均音節長を求める平均音節長推定手段と、前記認識結果候補と前記平均音節長とに基づいて前記複数の認識結果候補から認識結果を選択する候補選択手段と、を含むものである。 The speech recognition apparatus described in Patent Document 2 recognizes an input speech, adds a plurality of recognition result candidates with information on the duration of each syllable, and outputs the speech recognition means, and a syllable boundary from the input speech A recognition result from the plurality of recognition result candidates based on a syllable boundary candidate detection means for obtaining a candidate, an average syllable length estimation means for obtaining an average syllable length from the syllable boundary candidate, and the recognition result candidate and the average syllable length. Candidate selection means for selecting.
特開2008-176202号公報JP 2008-176202 A 特開平9-292899号公報JP-A-9-292899
 たしかに、特許文献1に記載の音声認識方法によっても、ある程度、精度の高い音声認識を行うことができるであろう。
 しかしながら、特許文献1に記載の音声認識方法では、主に発話された文章を認識することを前提としており、単語単位での時間長や音声パワーの統計情報を含む相関関係確率モデルを用いて音声認識を行っている。
Certainly, even with the speech recognition method described in Patent Document 1, speech recognition with high accuracy can be performed to some extent.
However, the speech recognition method described in Patent Document 1 is based on the premise that the spoken sentence is mainly recognized, and the speech is recognized using a correlation probability model including time length and speech power statistical information in units of words. Recognize.
 ところで、発話する際には、その時々によって発話内容全体の話速が変化したり、話者によっても話速や発話タイミング等が異なったりする場合がある。この場合、特許文献1のように単語単位での時間長や音声パワーを用いる方法では、その都度、どのように話者が発話するかによって音声認識の精度が大きく影響を受けることになる。 By the way, when speaking, the speaking speed of the entire utterance content may change from time to time, and the speaking speed and speaking timing may vary depending on the speaker. In this case, in the method using time length or voice power in units of words as in Patent Document 1, the accuracy of voice recognition is greatly affected by how the speaker speaks each time.
 また、特許文献2に記載の音声認識装置は、認識結果候補の各音節の継続時間長の情報と平均音節長とを比較しているため、瞬間的な話速の変化に弱いという問題がある。
 そこで、本発明は、処理速度が速く、かつ低コストであり、話速の変化や話者ごとの発話の仕方などによる影響を受けにくい高精度な音声認識装置および音声認識方法を提供することを目的とする。
In addition, the speech recognition apparatus described in Patent Document 2 has a problem that it is vulnerable to a change in instantaneous speech speed because it compares information on the duration of each syllable of the recognition result candidate and the average syllable length. .
Therefore, the present invention provides a high-accuracy speech recognition apparatus and speech recognition method that are fast in processing speed and low in cost, and that are not easily affected by changes in speech speed or the manner of speech for each speaker. Objective.
 上記問題を解決するために、本発明の一態様によれば、入力音声を音声認識し、その音声認識結果を示すラベル列を出力するラベル列出力部と、前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話継続時間長を出力する発話継続時間長出力部と、前記ラベルおよびその前後のラベルの前記発話継続時間長と、前記ラベルの発話継続時間長に関する統計情報に基づく第1の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正するラベル修正部と、を有する音声認識装置が提供される。 In order to solve the above problem, according to an aspect of the present invention, a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result; and among labels included in the label string An utterance duration output unit that outputs the utterance duration length of at least one label and the preceding and following labels, the utterance duration length of the label and the preceding and following labels, and statistical information relating to the utterance duration length of the label And a label correction unit that corrects the label to a correction candidate label that is a correction candidate label based on the first predetermined condition based on the above.
 この構成によれば、入力音声の認識結果であるラベル列に含まれる各ラベルとその前後のラベルのみを用いて、それらの発話継続時間長に基づいてラベル修正(音声認識)を順次、行うことができる。よって、話速の変化や話者ごとの発話の仕方などによる影響を受けにくい高精度な音声認識を行うことができる。 According to this configuration, label correction (speech recognition) is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.
 また、本発明の別の態様によれば、前記ラベル修正部は、前記ラベルの前記発話継続時間長と前記ラベルの前後のラベルの前記発話継続時間長との比と、前記ラベルの前記発話継続時間長と前記ラベルの前後のラベルの前記発話継続時間長との比に関する前記第1の所定の条件と、に基づいて、前記ラベルを前記修正候補ラベルに修正することを特徴とする音声認識装置が提供される。
 この構成によれば、入力音声の認識結果であるラベル列に含まれる各ラベルとその前後のラベルのみを用いて、それらの発話継続時間長に基づいてラベル修正(音声認識)を順次、行うことができる。よって、話速の変化や話者ごとの発話の仕方などによる影響を受けにくい高精度な音声認識を行うことができる。
Further, according to another aspect of the present invention, the label correction unit includes a ratio between the utterance duration length of the label and the utterance duration length of labels before and after the label, and the utterance continuation of the label. The speech recognition device, wherein the label is corrected to the correction candidate label based on the first predetermined condition relating to a ratio between a time length and the utterance duration time of the label before and after the label. Is provided.
According to this configuration, label correction (speech recognition) is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.
 また、本発明の別の態様によれば、前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話区間における音声パワーを示す情報である音声パワー情報を出力する音声パワー情報出力部をさらに有し、前記ラベル修正部は、前記ラベルおよびその前後のラベルの前記発話継続時間長、並びに、前記第1の所定の条件に加えて、前記ラベルおよびその前後のラベルの前記音声パワー情報と、前記ラベルの音声パワーに関する統計情報に基づく第2の所定の条件と、にも基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正することを特徴とする音声認識装置が提供される。
 この構成によれば、入力音声の認識結果であるラベル列に含まれる各ラベルとその前後のラベルのみを用いて、それらの発話継続時間長に加えて、音声パワーにも基づいてラベル修正を行うため、より高精度な音声認識を行うことができる。
According to another aspect of the present invention, voice power information that outputs voice power information, which is information indicating voice power in an utterance section of at least one of the labels included in the label string and the labels before and after the label. In addition to the output duration of the label and the labels before and after the label, and the first predetermined condition, the label correction unit further includes the voice of the label and the labels before and after the label. A speech recognition apparatus that corrects the label to a correction candidate label, which is a correction candidate label, based on power information and a second predetermined condition based on statistical information on the sound power of the label. Is provided.
According to this configuration, using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label, the label correction is performed based on the speech power in addition to the speech duration time. Therefore, more accurate speech recognition can be performed.
 また、本発明の別の態様によれば、前記ラベル修正部は、前記ラベルおよびその前後のラベルの前記発話継続時間長、並びに、前記第1の所定の条件に加えて、前記ラベルの前記音声パワーと前記ラベルの前後のラベルの前記音声パワーとの比と、前記ラベルの前記音声パワーと前記ラベルの前後のラベルの前記音声パワーとの比に関する前記第2の所定の条件と、にも基づいて、前記ラベルを前記修正候補ラベルに修正することを特徴とする音声認識装置が提供される。
 この構成によれば、入力音声の認識結果であるラベル列に含まれる各ラベルとその前後のラベルのみを用いて、それらの発話継続時間長に加えて、音声パワーにも基づいてラベル修正を行うため、より高精度な音声認識を行うことができる。
Further, according to another aspect of the present invention, the label correction unit includes the speech duration time of the label and the labels before and after the label, and the voice of the label in addition to the first predetermined condition. Also based on the ratio of power and the audio power of the labels before and after the label, and the second predetermined condition relating to the ratio of the audio power of the label and the audio power of the labels before and after the label Thus, a speech recognition apparatus is provided that corrects the label to the correction candidate label.
According to this configuration, using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label, the label correction is performed based on the speech power in addition to the speech duration time. Therefore, more accurate speech recognition can be performed.
 また、本発明の別の態様によれば、入力音声を音声認識し、その音声認識結果を示すラベル列を出力するラベル列出力部と、前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話区間における音声パワーを示す情報である音声パワー情報を出力する音声パワー情報出力部と、前記ラベルおよびその前後のラベルの前記音声パワー情報と、前記ラベルの音声パワーに関する統計情報に基づく第2の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正するラベル修正部と、を有する音声認識装置が提供される。
 この構成によれば、入力音声の認識結果であるラベル列に含まれる各ラベルとその前後のラベルのみを用いて、それらの音声パワーに基づいてラベル修正(音声認識)を順次、行うことができる。よって、話速の変化や話者ごとの発話の仕方などによる影響を受けにくい、高精度な音声認識を行うことができる。
According to another aspect of the present invention, a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result, at least one label among labels included in the label string, and A voice power information output unit that outputs voice power information, which is information indicating voice power in an utterance section of the label before and after the label, the voice power information of the label and the labels before and after the label, and statistical information regarding the voice power of the label And a label correction unit that corrects the label to a correction candidate label that is a correction candidate label based on the second predetermined condition based on the above.
According to this configuration, it is possible to sequentially perform label correction (speech recognition) based on the voice power of each label included in the label string that is the recognition result of the input voice and the labels before and after the label. . Therefore, it is possible to perform highly accurate speech recognition that is not easily affected by changes in speech speed or the manner of speech for each speaker.
 また、本発明の別の態様によれば、入力音声を音声認識し、その音声認識結果を示すラベル列を出力するラベル列出力部と、前記ラベル列に含まれるラベルの発話継続時間長を出力する発話継続時間長出力部と、隣接するラベルの発話継続時間長の比と、前記隣接するラベルの発話継続時間長の比に関する統計情報に基づく第1の所定の条件と、に基づいて、前記隣接するラベルの少なくとも一方のラベルを修正する修正部と、を備える音声認識装置が提供される。
 この構成によれば、入力音声の認識結果であるラベル列に含まれるラベルのうち隣接したラベルを用いて、それらの発話継続時間長の比に基づいてラベル修正(音声認識)を順次、行うことができる。よって、話速の変化や話者ごとの発話の仕方などによる影響を受けにくい高精度な音声認識を行うことができる。
According to another aspect of the present invention, a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result, and outputs an utterance duration time of a label included in the label string Based on a first predetermined condition based on statistical information regarding the ratio of the utterance duration length of adjacent labels and the ratio of the utterance duration length of adjacent labels, A speech recognition device is provided that includes a correction unit that corrects at least one of adjacent labels.
According to this configuration, the label correction (voice recognition) is sequentially performed based on the ratio of the durations of utterances using adjacent labels among the labels included in the label sequence that is the recognition result of the input voice. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.
 また、本発明の別の態様によれば、ラベル列出力部と、発話継続時間長出力部と、ラベル修正部と、を有する音声認識装置が実行する音声認識方法であって、前記ラベル列出力部により、入力音声を音声認識し、その音声認識結果を示すラベル列を出力する第1のステップと、前記発話継続時間長出力部により、前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話継続時間長を出力する第2のステップと、前記ラベル修正部により、前記ラベルおよびその前後のラベルの前記発話継続時間長と、前記ラベルの発話継続時間長に関する統計情報に基づく第1の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正する第3のステップと、を含むことを特徴とする音声認識方法が提供される。
 この構成によれば、入力音声の認識結果であるラベル列に含まれる各ラベルとその前後のラベルのみを用いて、それらの発話継続時間長に基づいてラベル修正(音声認識)を順次、行うことができる。よって、話速の変化や話者ごとの発話の仕方などによる影響を受けにくい高精度な音声認識を行うことができる。
According to another aspect of the present invention, there is provided a speech recognition method executed by a speech recognition apparatus having a label sequence output unit, an utterance duration output unit, and a label correction unit, wherein the label sequence output A first step of recognizing input speech by the unit and outputting a label sequence indicating the speech recognition result, and at least one label among the labels included in the label sequence by the utterance duration output unit, and The second step of outputting the utterance duration length of the label before and after the label, and the label correction unit, to the statistical information regarding the utterance duration length of the label and the labels before and after the label, and the utterance duration length of the label And a third step of correcting the label to a correction candidate label, which is a correction candidate label, based on a first predetermined condition based on識方 there is provided a method.
According to this configuration, label correction (speech recognition) is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.
 また、本発明の別の態様によれば、コンピュータに、入力音声を音声認識し、その音声認識結果を示すラベル列を出力する第1のステップと、前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話継続時間長を出力する第2のステップと、前記ラベルおよびその前後のラベルの前記発話継続時間長と、前記ラベルの発話継続時間長に関する統計情報に基づく第1の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正する第3のステップと、を実行させるための音声認識プログラムが提供される。
 この構成によれば、入力音声の認識結果であるラベル列に含まれる各ラベルとその前後のラベルのみを用いて、それらの発話継続時間長に基づいてラベル修正(音声認識)を順次、行うことができる。よって、話速の変化や話者ごとの発話の仕方などによる影響を受けにくい高精度な音声認識を行うことができる。
According to another aspect of the present invention, a first step of recognizing an input voice to a computer and outputting a label string indicating the voice recognition result to at least one of labels included in the label string. A second step of outputting utterance duration lengths of one label and the preceding and following labels, a first step based on statistical information relating to the utterance duration length of the labels and preceding and following labels, and the utterance duration length of the labels And a third step of correcting the label to a correction candidate label, which is a correction candidate label, based on the predetermined condition (1).
According to this configuration, label correction (speech recognition) is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.
 本発明によれば、話速の変化や話者ごとの発話の仕方などによる影響を受けにくい高精度な音声認識を行うことができる。 According to the present invention, it is possible to perform high-accuracy speech recognition that is not easily affected by changes in speech speed or the manner of speech for each speaker.
本発明の一実施形態に係る音声認識装置の構成例を示す図である。It is a figure which shows the structural example of the speech recognition apparatus which concerns on one Embodiment of this invention. 正解ラベルと認識結果ラベルの具体例を示す図である。It is a figure which shows the specific example of a correct answer label and a recognition result label. 正解ラベルと認識結果ラベルの発話継続時間長、Pre_ratio、Post_ratioの具体例を示す図である。It is a figure which shows the specific example of the speech duration time of a correct answer label and a recognition result label, Pre_ratio, and Post_ratio. 数字“5”の正解のサンプルと不正解のサンプルの散布図の一例を示す図である。It is a figure which shows an example of the scatter diagram of the sample of the correct answer of number "5", and the sample of an incorrect answer. 正解ラベルと認識結果ラベルの別の具体例を示す図である。It is a figure which shows another specific example of a correct answer label and a recognition result label. 正解ラベルと認識結果ラベルの音声パワーの平均値、Pre_ratio、Post_ratioの具体例を示す図である。It is a figure which shows the specific example of the average value of voice power of a correct answer label and a recognition result label, Pre_ratio, and Post_ratio. 認識結果ラベルの音声パワーの時系列のイメージを示す図である。It is a figure which shows the time series image of the audio | voice power of a recognition result label. 数字“2”の正解のサンプルと不正解のサンプルの散布図の一例を示す図である。It is a figure which shows an example of the scatter diagram of the sample of the correct answer of number "2", and the sample of an incorrect answer. 正解ラベルと認識結果ラベルの別の具体例を示す図である。It is a figure which shows another specific example of a correct answer label and a recognition result label. 正解ラベルと認識結果ラベルの発話継続時間長、Pre_ratio、Post_ratioの具体例を示す図である。It is a figure which shows the specific example of the speech duration time of a correct answer label and a recognition result label, Pre_ratio, and Post_ratio. 数字“1”の正解のサンプルと不正解のサンプルの散布図の一例を示す図である。It is a figure which shows an example of the scatter diagram of the sample of the correct answer of number "1", and the sample of an incorrect answer. 本発明の一実施形態に係る音声認識装置の処理の流れを示すフロー図である。It is a flowchart which shows the flow of a process of the speech recognition apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る音声認識装置の中国語連続数字の認識性能について示す図である。It is a figure which shows about the recognition performance of the Chinese continuous number of the speech recognition apparatus which concerns on one Embodiment of this invention. 話者による数字一桁分(音節単位)の平均発話継続時間長の変化について示す図である。It is a figure which shows about the change of the average utterance continuation length of one digit number (syllable unit) by a speaker. 数字列の長さによる数字一桁分(音節単位)の平均発話継続時間長の変化について示す図である。It is a figure which shows about the change of the average utterance continuation length of the number for one digit (syllable unit) by the length of a number sequence. 中国の携帯電話番号(11桁)を主な読み習慣で発声した音声を収録したデータから算出された音節平均時間長を示す図である。It is a figure which shows the syllable average time length calculated from the data which recorded the audio | voice which uttered the Chinese mobile phone number (11 digits) by the main reading habits. 本発明の一実施形態に係る音声認識装置の中国語連続数字の認識性能について示す図である。It is a figure which shows about the recognition performance of the Chinese continuous number of the speech recognition apparatus which concerns on one Embodiment of this invention.
 以下、本発明の実施の形態について、図面を参照しながら説明する。なお、以下の説明において参照する各図では、他の図と同等部分は同一符号によって示される。
 なお、以下に説明する実施形態においては、中国語の数字の発音について音声認識を行う場合を一例として説明するが、本実施形態に係る音声認識はこれに限定されるものではない。様々な言語の認識対象について適用可能である。
Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings referred to in the following description, the same parts as those in the other drawings are denoted by the same reference numerals.
In the embodiment described below, a case where speech recognition is performed for pronunciation of Chinese numerals will be described as an example, but the speech recognition according to the present embodiment is not limited to this. It can be applied to recognition objects in various languages.
(音声認識装置の構成)
 図1は、本実施形態に係る音声認識装置の構成例を示す図である。図1に示されるように、音声認識装置1は、ラベル列出力部110と、ラベル選択部120と、発話継続時間長出力部130と、音声パワー情報出力部140と、ラベル修正部150と、結果出力部160とを有する。
(Configuration of voice recognition device)
FIG. 1 is a diagram illustrating a configuration example of a speech recognition apparatus according to the present embodiment. As shown in FIG. 1, the speech recognition apparatus 1 includes a label string output unit 110, a label selection unit 120, an utterance duration output unit 130, a voice power information output unit 140, a label correction unit 150, And a result output unit 160.
(ラベル列出力部110)
 ラベル列出力部110は、入力音声を音声認識し、その音声認識結果を示すラベル列を出力する。ここで、「ラベル」とは、入力音声を音節で区切った1単位をいい、本実施形態においては、一つの数字が一つのラベルに該当する。また、「ラベル列」は、少なくとも一つのラベルを含む。また、「(入力音声の)音声認識結果」とは、入力音声の音声波形を音響解析し、例えばMFCC(Mel Frequency Cepstrum Coefficient)などの特徴量を用いて音声認識を行った結果をいう。
(Label string output unit 110)
The label string output unit 110 recognizes the input voice and outputs a label string indicating the voice recognition result. Here, the “label” refers to one unit obtained by dividing the input voice by syllables. In the present embodiment, one number corresponds to one label. Further, the “label column” includes at least one label. The “speech recognition result (of input speech)” means a result of performing speech recognition using an acoustic analysis of a speech waveform of the input speech and using a feature quantity such as MFCC (Mel Frequency Cepstrum Coefficient), for example.
 すなわち、ラベル列出力部110では、例えば、入力音声の音声波形からMFCC等を用いて数字の認識がなされて、少なくとも一つの数字(すなわち、ラベル)を含む数字列(すなわち、ラベル列)が出力される。また、この時、数字列に含まれている各数字の発話タイミングを示す発話開始時刻および発話終了時刻など(以下、適宜「時間境界情報」という。)もラベル列とともに認識される。 That is, in the label sequence output unit 110, for example, numbers are recognized from the speech waveform of the input speech using MFCC or the like, and a number sequence (that is, a label sequence) including at least one number (ie, a label) is output. Is done. At this time, an utterance start time and an utterance end time indicating the utterance timing of each number included in the number string (hereinafter referred to as “time boundary information” as appropriate) are also recognized together with the label string.
(ラベル選択部120)
 ラベル選択部120は、ラベル列出力部110にて出力されたラベル列に含まれるラベルのうちの少なくとも一つを選択する。このラベル選択部120で選択されるラベルとは、後述するラベル修正部150で修正の対象となるラベルである。つまり、ラベル列に含まれるラベルのうちのいくつかが順次、このラベル選択部120で選択され、選択されたラベルがラベル修正部150にて適宜修正される。このラベルを選択する処理と選択されたラベルを修正する処理とが繰り返されることによって、ラベル列全体の音声認識が行われる。
(Label selection unit 120)
The label selection unit 120 selects at least one of the labels included in the label sequence output by the label sequence output unit 110. The label selected by the label selection unit 120 is a label to be corrected by a label correction unit 150 described later. That is, some of the labels included in the label row are sequentially selected by the label selection unit 120, and the selected labels are appropriately corrected by the label correction unit 150. By repeating the process of selecting the label and the process of correcting the selected label, voice recognition of the entire label string is performed.
 また、ラベル修正部150で修正するラベル(すなわち、ラベル選択部120で選択されるラベル)は、ラベル列に含まれるすべてのラベルであってもよいし、一部のみであってもよい。例えば、中国語の数字を音声認識する場合、一般的に、“1”,“2”,“5”のように、母音のみの一音節で成り立っているものは誤認識されやすいという特性がある。よって、ラベル列出力部110で出力されたラベル列に含まれている全てのラベル(数字)ではなく、“1”,“2”,“5”といった母音のみの一音節で成り立っているラベルに対してのみ、ラベル修正部150でラベル修正を行うようになっていてもよい。これにより、音声認識の処理負荷を軽減するとともに処理速度を速めることができる。 Further, the labels to be corrected by the label correction unit 150 (that is, the labels selected by the label selection unit 120) may be all labels included in the label column or only a part thereof. For example, in the case of voice recognition of Chinese numbers, in general, those composed of only one syllable such as “1”, “2”, “5” are easily misrecognized. . Therefore, not all labels (numbers) included in the label string output by the label string output unit 110 but labels that consist of only one syllable such as “1”, “2”, and “5”. Only the label correction unit 150 may perform label correction. Thereby, the processing load of voice recognition can be reduced and the processing speed can be increased.
 さらに、例えば、ラベル列に含まれるラベルのうち、発話継続時間の短いものや音声パワーの小さいものから順に選択されて修正されるようになっていてもよい。これにより、誤認識の可能性の高いものから優先的にラベルの修正が行われることになり、ラベル修正の効率がよくなる。 Further, for example, the labels included in the label row may be selected and corrected in order from the label with the shorter utterance duration or the one with the lower voice power. As a result, label correction is performed preferentially from those with a high possibility of erroneous recognition, and the efficiency of label correction is improved.
(発話継続時間長出力部130)
 発話継続時間長出力部130は、ラベル列出力部110にて出力されたラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話継続時間長を出力する。上述したように、ラベル列出力部110では、例えば入力音声の音声波形からMFCC等を用いてラベルの認識がなされることによってラベル列が出力される。この時、ラベル列の認識とともに、このラベル列に含まれている各ラベルの発話タイミングを示す発話開始時刻と発話終了時刻も認識される。発話継続時間長出力部130は、例えば、この各ラベルについての発話終了時刻から発話開始時刻を減算すること等によって発話継続時間長を算出する。
(Speech duration time output unit 130)
The utterance duration length output unit 130 outputs the utterance duration length of at least one of the labels included in the label sequence output by the label sequence output unit 110 and the labels before and after the label. As described above, the label string output unit 110 outputs a label string by, for example, recognizing a label using an MFCC or the like from the voice waveform of the input voice. At this time, along with the recognition of the label string, the utterance start time and utterance end time indicating the utterance timing of each label included in the label string are also recognized. For example, the utterance duration time output unit 130 calculates the utterance duration time length by subtracting the utterance start time from the utterance end time for each label.
(音声パワー情報出力部140)
 音声パワー情報出力部140は、ラベル列出力部110にて出力されたラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話区間における音声パワーを示す情報である音声パワー情報を出力する。「ラベルの発話区間における音声パワーを示す情報」とは、ラベルの音声パワーを直接的もしくは間接的に示すものであればよい。例えば、ラベルの発話開始時刻と発話終了時刻との間の音声パワーの平均値などが挙げられる。
(Audio power information output unit 140)
The voice power information output unit 140 outputs voice power information, which is information indicating voice power in the utterance section of at least one of the labels included in the label string output by the label string output unit 110 and the labels before and after the label string. Output. The “information indicating the voice power in the utterance section of the label” may be any information that directly or indirectly indicates the voice power of the label. For example, an average value of voice power between the utterance start time and the utterance end time of the label can be used.
(ラベル修正部150)
 ラベル修正部150は、ラベル選択部120にて選択されたラベル(以下、「選択ラベル」という)およびその前後のラベルの発話継続時間長と、選択ラベルの発話継続時間長に関する統計情報に基づく第1の所定の条件と、に基づいて、選択ラベルを修正候補のラベルである修正候補ラベルに修正する。また、この時、ラベル修正部150は、選択ラベルの発話継続時間長とその選択ラベルの前後のラベルの発話継続時間長との比と、選択ラベルの発話継続時間長とこの選択ラベルの前後のラベルの前記発話継続時間長との比に関する第1の所定の条件と、に基づいて、選択ラベルを修正候補ラベルに修正するようになっていてもよい。また、第1の所定の条件は、選択ラベルの発話継続時間長とその選択ラベルの前後のラベルの発話継続時間長との比に関する統計情報に基づく条件であってもよい。
(Label Correction Unit 150)
The label correction unit 150 is based on statistical information relating to the utterance duration of the label selected by the label selection unit 120 (hereinafter referred to as “selected label”) and the labels before and after the label, and the utterance duration of the selected label. Based on the predetermined condition of 1, the selected label is corrected to a correction candidate label which is a correction candidate label. At this time, the label correcting unit 150 also compares the utterance duration length of the selected label with the utterance duration length of the label before and after the selected label, the utterance duration length of the selected label, and the before and after of the selected label. The selected label may be corrected to the correction candidate label based on the first predetermined condition regarding the ratio of the label to the utterance duration time. In addition, the first predetermined condition may be a condition based on statistical information regarding a ratio between the utterance duration time of the selected label and the utterance duration times of labels before and after the selected label.
 なお、本実施形態においては、第1の所定の条件は、音声認識装置1が備えるハードディスク等の記憶装置170に記憶されているが、これに限定されるものではない。例えば、ラベル修正部150での処理が実行される度に外部の装置等から受信するようになっていてもよい。各ラベルの発話継続時間長に関する統計情報についても、外部の装置等に保存されていてもよいし、音声認識装置1の記憶装置170に記憶されていてもよい。 In the present embodiment, the first predetermined condition is stored in the storage device 170 such as a hard disk included in the voice recognition device 1, but is not limited thereto. For example, it may be received from an external device or the like every time the process in the label correction unit 150 is executed. The statistical information regarding the duration of the utterance for each label may also be stored in an external device or the like, or may be stored in the storage device 170 of the speech recognition device 1.
 さらに、ラベル修正部150は、選択ラベルおよびその前後のラベルの音声パワー情報と、選択ラベルの音声パワーに関する統計情報に基づく第2の所定の条件と、に基づいて、選択ラベルを修正候補ラベルに修正するようになっていてもよい。また、この時、ラベル修正部150は、選択ラベルの音声パワーとその選択ラベルの前後のラベルの音声パワーとの比と、選択ラベルの音声パワーとこの選択ラベルの前後のラベルの音声パワーとの比に関する第2の所定の条件と、に基づいて、選択ラベルを修正候補ラベルに修正するようになっていてもよい。また、第2の所定の条件は、選択ラベルの音声パワーとこの選択ラベルの前後のラベルの音声パワーとの比に関する統計情報に基づく条件であってもよい。 Further, the label correction unit 150 sets the selected label as a correction candidate label based on the audio power information of the selected label and the labels before and after the selected label and the second predetermined condition based on the statistical information on the audio power of the selected label. It may be modified. At this time, the label correcting unit 150 calculates the ratio between the audio power of the selected label and the audio power of the labels before and after the selected label, and the audio power of the selected label and the audio power of the labels before and after the selected label. The selection label may be corrected to the correction candidate label based on the second predetermined condition regarding the ratio. Further, the second predetermined condition may be a condition based on statistical information relating to a ratio between the audio power of the selected label and the audio power of the labels before and after the selected label.
 さらに、ラベル修正部150は、選択ラベルおよびその前後のラベルの発話継続時間長、並びに、第1の所定の条件に加えて、さらに、選択ラベルおよびその前後のラベルの音声パワー情報と、選択ラベルの音声パワーに関する統計情報に基づく第2の所定の条件と、にも基づいて、選択ラベルを修正候補ラベルに修正するようになっていてもよい。これにより、より精度の高い音声認識を行うことができる。 Furthermore, in addition to the utterance duration length of the selected label and the labels before and after the selected label and the first predetermined condition, the label correcting unit 150 further includes audio power information of the selected label and the labels before and after the selected label, and the selected label. The selected label may be corrected to the correction candidate label based on the second predetermined condition based on the statistical information on the voice power of the voice. Thereby, voice recognition with higher accuracy can be performed.
 なお、本実施形態においては、第2の所定の条件は、音声認識装置1が備えるハードディスク等の記憶装置170に記憶されているが、これに限定されるものではない。例えば、ラベル修正部150での処理が実行される度に外部の装置等から受信するようになっていてもよい。各ラベルの音声パワーに関する統計情報についても外部の装置等に保存されていてもよいし、音声認識装置1の記憶装置170に記憶されていてもよい。
 なお、図1においては、音声認識装置1が発話継続時間長出力部130と音声パワー情報出力部140の両方を有する場合が示されているが、いずれか一方のみを有していてもよい。
In the present embodiment, the second predetermined condition is stored in the storage device 170 such as a hard disk included in the voice recognition device 1, but is not limited thereto. For example, it may be received from an external device or the like every time the process in the label correction unit 150 is executed. Statistical information regarding the voice power of each label may also be stored in an external device or the like, or may be stored in the storage device 170 of the voice recognition device 1.
Although FIG. 1 shows the case where the speech recognition apparatus 1 has both the utterance duration output unit 130 and the speech power information output unit 140, it may have only one of them.
(結果出力部160)
 結果出力部160は、ラベル修正部150で修正済みのラベル列を最終的な音声認識結果として外部の装置等に出力する。
 また、以上説明した各構成の機能は、音声認識装置1が備える図示せぬCPU(Central Processing Unit)が、ハードディスクやROM(Read Only Memory)等の記憶装置に記憶されたプログラムをRAM(Random Access Memory)等のメモリ上に読み出して実行することにより実現される機能である。また、ラベル列、ラベル、発話開始時刻、発話終了時刻、発話継続時間長、音声パワー情報、第1の所定の条件10、第2の所定の条件20などは、記憶装置やメモリ等に記憶されるデータである。
(Result output unit 160)
The result output unit 160 outputs the label sequence corrected by the label correction unit 150 to an external device or the like as a final speech recognition result.
The functions of the components described above are such that a CPU (Central Processing Unit) (not shown) included in the speech recognition apparatus 1 executes a program stored in a storage device such as a hard disk or a ROM (Read Only Memory). This function is realized by reading out and executing on a memory such as “Memory”. The label string, label, utterance start time, utterance end time, utterance duration time, voice power information, first predetermined condition 10, second predetermined condition 20, and the like are stored in a storage device, a memory, or the like. Data.
(具体例1)
 以下、具体例を挙げて、本実施形態に係る音声認識装置における処理について説明する。本具体例では、発話された入力音声を認識する際に誤って余計なラベルが認識されることにより、入力音声のラベル数よりも、音声波形から認識された認識結果に含まれるラベル数が多くなる場合(挿入誤認識)について説明する。
 本具体例では、図2に示されるように、実際の入力音声(正解)は、「0,3,6,4」であるが、音声認識装置1のラベル列出力部110においては認識されたラベル列(認識結果)は、「0,3,6,5,4」であったとする。また、認識結果の各ラベル(数字)の時間境界情報である発話開始時刻および発話終了時刻は、図3に示されるとおりであるとする。
(Specific example 1)
Hereinafter, the process in the speech recognition apparatus according to the present embodiment will be described with a specific example. In this specific example, the number of labels included in the recognition result recognized from the speech waveform is larger than the number of labels of the input speech because the extra labels are erroneously recognized when recognizing the spoken input speech. A case (insertion recognition error) will be described.
In this specific example, as shown in FIG. 2, the actual input speech (correct answer) is “0, 3, 6, 4”, but is recognized by the label string output unit 110 of the speech recognition device 1. It is assumed that the label string (recognition result) is “0, 3, 6, 5, 4”. Further, it is assumed that the utterance start time and the utterance end time, which are time boundary information of each label (number) of the recognition result, are as shown in FIG.
 また、本具体例においては、ラベル修正部150は、ラベル選択部120で選択された選択ラベルの発話継続時間長と、その選択ラベルの前後のラベルの発話継続時間長との比を算出する。また、あらかじめ統計情報に基づいて、各ラベルの前後のラベルに対する発話継続時間長の比についての閾値(第1の所定の条件)が決定されている(閾値の決定方法については後述する)。そして、ラベル修正部150は、この算出した比と閾値とを比較し、その比較結果に応じて、選択ラベルを修正候補ラベルに修正するか否かを決定する。 In this specific example, the label correction unit 150 calculates a ratio between the utterance duration time length of the selected label selected by the label selection unit 120 and the utterance duration time lengths of the labels before and after the selected label. In addition, a threshold (first predetermined condition) for the ratio of the speech duration time to the labels before and after each label is determined in advance based on statistical information (a method for determining the threshold will be described later). Then, the label correction unit 150 compares the calculated ratio with a threshold value, and determines whether or not to correct the selected label to a correction candidate label according to the comparison result.
 具体的には、まず、認識結果の各ラベル「0,3,6,5,4」の発話開始時刻および発話終了時刻から、以下の式によって、発話継続時間長出力部130において各ラベルの発話継続時間長が算出される。
 発話継続時間長=発話終了時刻-発話開始時刻
 そして、算出された各ラベルの発話継続時間長に基づいて、各ラベルのPre_ratioとPost_ratioとが計算される(図3)。Pre_ratioは、修正対象ラベルの発話継続時間長とその直前のラベルの発話継続時間長との比率である。
Specifically, first, from the utterance start time and utterance end time of each label “0, 3, 6, 5, 4” of the recognition result, the utterance of each label in the utterance duration time output unit 130 according to the following formula: The duration time is calculated.
Speech duration time = speech end time−speech start time Then, Pre_ratio and Post_ratio of each label are calculated based on the calculated speech duration time of each label (FIG. 3). Pre_ratio is the ratio between the utterance duration of the correction target label and the utterance duration of the immediately preceding label.
 また、Post_ratioは、修正対象ラベルの発話継続時間長とその直後のラベルの発話継続時間長との比率である。
 Pre_ratio
=修正対象の数字ラベルの発話継続時間長/その直前の数字ラベルの発話継続時間長
 Post_ratio
=修正対象の数字ラベルの発話継続時間長/その直後の数字ラベルの発話継続時間長
 統計情報に基づいてあらかじめ決定される各ラベルのPre_ratio、Post_ratioに対する閾値は一般にラベルごとに異なる値となるが、本例では説明を簡単にするため、すべて“0.5”であるとする。
Further, Post_ratio is the ratio between the utterance duration of the correction target label and the utterance duration of the label immediately after that.
Pre_ratio
= Utterance duration length of the numeric label to be corrected / utterance duration length of the immediately preceding numeric label Post_ratio
= Utterance duration length of the numeric label to be corrected / utterance duration length of the numeric label immediately after that The threshold for Pre_ratio and Post_ratio of each label determined in advance based on statistical information is generally different for each label, In this example, it is assumed that all are “0.5” in order to simplify the description.
 図3に示される各ラベルのPre_ratioおよびPost_ratioと、閾値“0.5”とを比較すると、数字ラベル“5(wu)”は、Pre_ratio<0.5、かつ、Post_ratio<0.5であるため、“5(wu)”は誤認識されて挿入(以下、挿入誤りとする。)された数字ラベルであると判断される。そして、ラベル修正部150では“5(wu)”が削除され、音声認識装置1における最終的な認識結果は「0,3,6,4」となる。 Comparing the Pre_ratio and Post_ratio of each label shown in FIG. 3 with the threshold value “0.5”, the numeric label “5 (wu)” is “5 (wu) because Pre_ratio <0.5 and Post_ratio <0.5. "" Is determined to be a number label that has been erroneously recognized and inserted (hereinafter referred to as an insertion error). Then, “5 (wu)” is deleted in the label correction unit 150, and the final recognition result in the speech recognition apparatus 1 is “0, 3, 6, 4”.
(閾値の決定方法)
 Pre_ratioおよびPost_ratioに対する閾値の決定は、事前の学習フェーズにおいて決定される。例えば、数字“5”についてのPre_ratioとPost_ratioの閾値を決定する場合、学習用音声信号の音声認識を行い、正解となった数字“5”のサンプルと不正解となった数字“5”のサンプルとを分別する。また、それぞれのサンプルでPre_ratioとPost_ratioを計算する。
 そして、Pre_ratioを横軸、Post_ratioを縦軸とする2次元空間上にプロットしたとき、例えば原点ともう一点の点R(Pre_ratio=Tx、Post_ratio=Ty)とを結ぶ線を対角線とする長方形の領域に含まれる正解のサンプル数と不正解のサンプル数の差が最大になる点Rの座標=(Tx,Ty)を求めればよい。
(Threshold determination method)
The determination of thresholds for Pre_ratio and Post_ratio is determined in the prior learning phase. For example, when the threshold values of Pre_ratio and Post_ratio for the number “5” are determined, speech recognition of the learning speech signal is performed, and the sample of the correct number “5” and the sample of the number “5” that is incorrect And are separated. Also, Pre_ratio and Post_ratio are calculated for each sample.
Then, when plotted on a two-dimensional space with Pre_ratio as the horizontal axis and Post_ratio as the vertical axis, for example, a rectangular area whose diagonal line connects the origin and another point R (Pre_ratio = Tx, Post_ratio = Ty) The coordinates of the point R at which the difference between the number of correct samples and the number of incorrect samples included in (2) is maximized may be obtained.
 図4は、Pre_ratioを横軸、Post_ratioを縦軸とした2次元空間上にプロットした数字“5”の正解のサンプル(●)と挿入誤り(INS)の不正解のサンプル(△)の散布図である。この場合、例えば、点Rの座標を、ある固定値単位(例えば0.05)で、ある範囲内(例えば0から1まで)で変化させ、不正解のサンプル数と正解のサンプル数との差が最大になる点Rの座標Tx、Tyを、それぞれPre_ratio、Post_ratioの閾値とする。図4は、正解のサンプル数と不正解のサンプル数の差が最大になる点Rの座標(Tx,Ty)が(0.5,0.5)となった例である。
 以上のような処理をラベル列出力部110が出力しうる全ラベルについて行い、各ラベルについてのPre_ratioとPost_ratioの閾値を決定する。
FIG. 4 is a scatter diagram of the correct answer sample (●) and the incorrect answer sample (△) of the number “5” plotted in a two-dimensional space with the Pre_ratio as the horizontal axis and the Post_ratio as the vertical axis. It is. In this case, for example, the coordinates of the point R are changed in a certain fixed value unit (for example, 0.05) within a certain range (for example, from 0 to 1), and the difference between the number of incorrect samples and the number of correct samples is maximized. The coordinates Tx and Ty of the point R to become the threshold values of Pre_ratio and Post_ratio, respectively. FIG. 4 is an example in which the coordinates (Tx, Ty) of the point R at which the difference between the number of correct samples and the number of incorrect samples is maximum are (0.5, 0.5).
The above processing is performed for all labels that can be output by the label string output unit 110, and the Pre_ratio and Post_ratio threshold values for each label are determined.
(具体例2)
 以下、別の具体例について説明する。本具体例も挿入誤認識の場合について説明する。
 本具体例では、図5に示されるように、実際の入力音声(正解)は、「0,3,4」であるが、音声認識装置1のラベル列出力部110においては認識されたラベル列(認識結果)は、「0,3,2,4」であったとする。また、認識結果の各ラベル(数字)の時間境界情報である発話開始時刻および発話終了時刻は、図6に示されるとおりであるとする。
(Specific example 2)
Hereinafter, another specific example will be described. This specific example will also explain the case of erroneous insertion recognition.
In this specific example, as shown in FIG. 5, the actual input speech (correct answer) is “0, 3, 4”, but the label sequence recognized by the label sequence output unit 110 of the speech recognition apparatus 1. It is assumed that (recognition result) is “0, 3, 2, 4”. Further, it is assumed that the utterance start time and the utterance end time, which are time boundary information of each label (number) of the recognition result, are as shown in FIG.
 また、本具体例においては、ラベル修正部150は、ラベル選択部120で選択された選択ラベルの音声パワーの平均値と、その選択ラベルの前後のラベルの音声パワーの平均値との比を算出する。また、あらかじめ統計情報に基づいて、各ラベルの前後のラベルに対する音声パワーの平均値の比についての閾値(第2の所定の条件)が決定されている(閾値の決定方法については後述する)。そして、ラベル修正部150は、この算出した比と閾値とを比較し、その比較結果に応じて、選択ラベルを修正候補ラベルに修正するか否かを決定する。 In this specific example, the label correction unit 150 calculates a ratio between the average value of the audio power of the selected label selected by the label selection unit 120 and the average value of the audio power of the labels before and after the selected label. To do. Further, a threshold (second predetermined condition) for the ratio of the average value of the audio power to the labels before and after each label is determined in advance based on statistical information (a method for determining the threshold will be described later). Then, the label correction unit 150 compares the calculated ratio with a threshold value, and determines whether or not to correct the selected label to a correction candidate label according to the comparison result.
 具体的には、まず、認識結果の各ラベル「0,3,2,4」の発話開始時刻と発話終了時刻との間における音声パワーの平均値が算出される。そして、算出された各ラベルの音声パワーの平均値に基づいて、各ラベルのPre_ratioとPost_ratioとが計算される(図6)。Pre_ratioは、修正対象のラベルの音声パワーの平均値とその直前のラベルの音声パワーの平均値との比率である。 Specifically, first, the average value of the voice power between the utterance start time and the utterance end time of each label “0, 3, 2, 4” of the recognition result is calculated. Then, Pre_ratio and Post_ratio of each label are calculated based on the calculated average value of the audio power of each label (FIG. 6). Pre_ratio is a ratio between the average value of the sound power of the label to be corrected and the average value of the sound power of the label immediately before that.
 また、Post_ratioは、修正対象のラベルの音声パワーの平均値とその直後のラベルの音声パワーの平均値との比率である。
 Pre_ratio
=修正対象の数字ラベルの音声パワーの平均値/その直前の数字ラベルの音声パワーの平均値
 Post_ratio
=修正対象の数字ラベルの音声パワーの平均値/その直後の数字ラベルの音声パワーの平均値
 統計情報に基づいてあらかじめ決定される各ラベルのPre_ratio、Post_ratioに対する閾値は一般にラベルごとに異なる値となるが、本例では説明を簡単にするため、すべて“0.5”であるとする。
Further, Post_ratio is a ratio between the average value of the sound power of the label to be corrected and the average value of the sound power of the label immediately after that.
Pre_ratio
= Average voice power of the numeric label to be corrected / Average voice power of the numeric label immediately before it Post_ratio
= Average value of the voice power of the numeric label to be corrected / Average value of the voice power of the numeric label immediately after that The thresholds for the Pre_ratio and Post_ratio of each label determined in advance based on statistical information are generally different for each label. However, in this example, all are assumed to be “0.5” in order to simplify the description.
 図6に示される各ラベルのPre_ratioおよびPost_ratioと、閾値“0.5”とを比較すると、数字ラベル“2(er)”は、Pre_ratio<0.5、かつ、Post_ratio<0.5であるため、“2(er)”は誤認識されて挿入された数字ラベルであると判断される(挿入誤認識)。そして、ラベル修正部150では“2(er)”が削除され、音声認識装置1における最終的な認識結果は「0,3,4」となる。 Comparing the Pre_ratio and Post_ratio of each label shown in FIG. 6 with the threshold value “0.5”, the numeric label “2 (er)” is “2 (er) because Pre_ratio <0.5 and Post_ratio <0.5. "" Is determined to be a number label that has been misrecognized and inserted (insertion misrecognition). Then, “2 (er)” is deleted in the label correcting unit 150, and the final recognition result in the speech recognition apparatus 1 is “0, 3, 4”.
(音声パワーと閾値の決定方法)
 図7は、音声パワーの時系列と認識結果の各ラベルの時間境界との関係の一例について示す図である。数字ラベルごとの音声パワーの平均値とは、各数字の発話開始時刻から発話終了時刻までの時間境界内のセグメントにおける音声パワーの時系列の平均値である。
 例えば、数字“2”についてのPre_ratioとPost_ratioの閾値を決定する場合、学習フェーズにおいて、学習用音声信号の音声認識を行い、正解となった数字“2”のサンプルと不正解となった数字“2”のサンプルとを分別する。また、それぞれのサンプルでPre_ratioとPost_ratioを計算する。そして、Pre_ratioを横軸、Post_ratioを縦軸とする2次元空間上にプロットしたとき、例えば、原点ともう一点の点R(Pre_ratio=Tx、Post_ratio=Ty)とを結ぶ線を対角線とする長方形の領域に含まれる正解のサンプル数と不正解のサンプル数の差が最大になるRの座標=(Tx,Ty)を求めればよい。
(Voice power and threshold determination method)
FIG. 7 is a diagram illustrating an example of a relationship between a time series of audio power and a time boundary of each label of a recognition result. The average value of the speech power for each numeric label is the average value of the time series of the speech power in the segment within the time boundary from the speech start time to the speech end time of each number.
For example, when the threshold values of Pre_ratio and Post_ratio for the number “2” are determined, the learning speech signal is recognized in the learning phase, and the correct number “2” sample and the incorrect number “2” are determined. 2 "sample is separated. Also, Pre_ratio and Post_ratio are calculated for each sample. Then, when plotting on a two-dimensional space with Pre_ratio as the horizontal axis and Post_ratio as the vertical axis, for example, a rectangle with a diagonal line connecting the origin and another point R (Pre_ratio = Tx, Post_ratio = Ty) What is necessary is just to obtain | require R coordinate = (Tx, Ty) from which the difference of the sample number of the correct answer contained in an area | region and the sample number of an incorrect answer becomes the maximum.
 図8は、Pre_ratioを横軸、Post_ratioを縦軸とした2次元空間上にプロットした数字“2”の正解のサンプル(●)と挿入誤り(INS)の不正解のサンプル(△)の散布図である。この場合、例えば、点Rの座標を、ある固定値単位(例えば0.05)で、ある範囲内(例えば0から1まで)で変化させ、不正解のサンプル数と正解のサンプル数との差が最大になる点Rの座標Tx、Tyを、それぞれPre_ratio、Post_ratioの閾値とする。図8は、正解のサンプル数と不正解のサンプル数の差が最大になる点Rの座標(Tx,Ty)が(0.5,0.5)となった例である。
 以上のような処理をラベル列出力部110が出力しうる全ラベルについて行い、各ラベルについてのPre_ratioとPost_ratioの閾値を決定する。
FIG. 8 is a scatter diagram of the correct answer sample (●) and the incorrect answer sample (△) of the number “2” plotted on a two-dimensional space with the Pre_ratio as the horizontal axis and the Post_ratio as the vertical axis. It is. In this case, for example, the coordinates of the point R are changed in a certain fixed value unit (for example, 0.05) within a certain range (for example, from 0 to 1), and the difference between the number of incorrect samples and the number of correct samples is maximized. The coordinates Tx and Ty of the point R to become the threshold values of Pre_ratio and Post_ratio, respectively. FIG. 8 is an example in which the coordinates (Tx, Ty) of the point R at which the difference between the number of correct samples and the number of incorrect samples is the maximum are (0.5, 0.5).
The above processing is performed for all labels that can be output by the label string output unit 110, and the Pre_ratio and Post_ratio threshold values for each label are determined.
(具体例3)
 以下、別の具体例について説明する。本具体例では、発話された入力音声を認識する際に誤って認識されるべきラベルが欠落することにより、入力音声のラベル数よりも、音声波形から認識された認識結果に含まれるラベル数が少なくなる場合(削除誤り)について説明する。
 本具体例では、図9に示されるように、実際の入力音声(正解)は、「0,1,1,4」であるが、音声認識装置1のラベル列出力部110においては認識されたラベル列(認識結果)は、「0,1,4」であったとする。また、認識結果の各ラベル(数字)の時間境界情報である発話開始時刻および発話終了時刻は、図10に示されるとおりであるとする。
(Specific example 3)
Hereinafter, another specific example will be described. In this specific example, the number of labels included in the recognition result recognized from the speech waveform is less than the number of labels of the input speech due to the absence of labels that should be erroneously recognized when recognizing the spoken input speech. A case where the number is reduced (deletion error) is described.
In this specific example, as shown in FIG. 9, the actual input speech (correct answer) is “0, 1, 1, 4”, but has been recognized by the label string output unit 110 of the speech recognition device 1. It is assumed that the label string (recognition result) is “0, 1, 4”. Further, it is assumed that the utterance start time and the utterance end time, which are time boundary information of each label (number) of the recognition result, are as shown in FIG.
 なお、以下の説明においては具体例1と同様に、ラベル修正部150は、各ラベルの発話継続時間長を用いてラベル修正を行うものとするが、具体例2のように各ラベルのパワー情報を用いてラベル修正を行う場合も同様である。
 具体的には、まず、発話継続時間長出力部130において、認識結果の各ラベル「0,1,1,4」の発話開始時刻および発話終了時刻から、以下の式によって各ラベルの発話継続時間長が算出される。そして、算出された各ラベルの発話継続時間長に基づいて、各ラベルのPre_ratioとPost_ratioとが計算される(図10)。なお、各ラベルの発話継続時間長、Pre_ratio、Post_ratioの算出方法については、具体例1と同様である。
In the following description, as in the first specific example, the label correcting unit 150 corrects the label using the utterance duration length of each label. However, as in the second specific example, the power information of each label is used. The same applies to the case where label correction is performed using.
Specifically, first, in the utterance duration output unit 130, the utterance duration of each label is calculated from the utterance start time and utterance end time of each label “0, 1, 1, 4” of the recognition result by the following expression. The length is calculated. Based on the calculated utterance duration length of each label, Pre_ratio and Post_ratio of each label are calculated (FIG. 10). Note that the utterance duration time of each label, the Pre_ratio, and the Post_ratio are calculated in the same manner as in the first specific example.
 統計情報に基づいてあらかじめ決定される各ラベルのPre_ratio、Post_ratioに対する閾値(第1の所定の条件)は一般にラベルごとに異なる値となるが同じ値でもよい。本例では説明を簡単にするため、すべて“1.8”であるとする。
 図10に示される各ラベルのPre_ratioおよびPost_ratioと、閾値“1.8”とを比較すると、数字ラベル“1(yi)”は、Pre_ratio>1.8、かつ、Post_ratio>1.8である。よって、数字ラベル“1(yi)”は、誤認識であったと判断される。
The thresholds (first predetermined condition) for the Pre_ratio and Post_ratio of each label determined in advance based on statistical information are generally different values for each label, but may be the same value. In this example, all are assumed to be “1.8” for the sake of simplicity.
Comparing the Pre_ratio and Post_ratio of each label shown in FIG. 10 with the threshold “1.8”, the numeric label “1 (yi)” has Pre_ratio> 1.8 and Post_ratio> 1.8. Therefore, the numeric label “1 (yi)” is determined to have been erroneously recognized.
 また、あらかじめ、事前の学習フェーズにおける認識結果と正解との照合結果から、頻繁に発生する誤認識パターンを決定しておく。ここでは、例えば、「正解=1(yi) 1(yi)、誤認識=1(yi)」という誤認識パターンが存在していたとすると、本具体例の認識結果においては、ラベル修正部150において数字“1(yi)”が“1(yi) 1(yi)”に置き換えられ、音声認識装置1における最終的な認識結果は「0,1,1,4」となる。 In addition, a frequently recognized misrecognition pattern is determined in advance from the collation result between the recognition result and the correct answer in the prior learning phase. Here, for example, if there is a misrecognition pattern of “correct answer = 1 (yi) i1 (yi), misrecognition = 1 (yi)”, in the recognition result of this specific example, the label correcting unit 150 The number “1 (yi)” is replaced with “1 (yi) 1 (yi)”, and the final recognition result in the speech recognition apparatus 1 is “0, 1, 1, 4”.
(閾値の決定方法)
 Pre_ratioおよびPost_ratioに対する閾値の決定は、事前の学習フェーズにおいて決定される。例えば、数字“1”についてのPre_ratioとPost_ratioの閾値を決定する場合、学習用音声信号の音声認識を行った結果を以下のように分析する。
 まず、認識結果を、正解となった数字“1”のサンプルと、不正解となった数字“1”の内、数字“1”が挿入されてしまう不正解(挿入誤り)と、数字“1”の発話区間がその前後の数字“1”の区間をも含むように認識されたたために前後の数字“1”が削除されてしまう結果となる不正解(削除誤り)のサンプルとの三つに分別する。また、それぞれのサンプルでPre_ratioとPost_ratioを計算する。
(Threshold determination method)
The determination of thresholds for Pre_ratio and Post_ratio is determined in the prior learning phase. For example, when the threshold values of Pre_ratio and Post_ratio for the number “1” are determined, the result of speech recognition of the learning speech signal is analyzed as follows.
First, the recognition result is a sample of the correct number “1”, an incorrect answer (insertion error) in which the number “1” is inserted among the incorrect number “1”, and the number “1”. Three of the incorrect answer (deletion error) samples that result in deletion of the preceding and succeeding numbers “1” because the utterance section of “is recognized to include the preceding and following numbers“ 1 ”. Sort into Also, Pre_ratio and Post_ratio are calculated for each sample.
 そして、Pre_ratioを横軸、Post_ratioを縦軸とする2次元空間上にプロットしたとき、例えば、先験的に設定されたPre_ratioとPost_ratioの最大値で定まる点P(Pre_ratio=Mx、Post_ratio=My)と、もう一点の点R(Pre_ratio=Tx、Post_ratio=Ty)とを結ぶ線を対角線とする長方形の領域に含まれる正解のサンプル数と不正解のサンプル数の差が最大になる点Rの座標=(Tx,Ty)を求めればよい。 Then, when plotting on a two-dimensional space with Pre_ratio as the horizontal axis and Post_ratio as the vertical axis, for example, a point P determined by the maximum value of Pre_ratio and Post_ratio set a priori (Pre_ratio = Mx, Post_ratio = My) And the coordinates of the point R where the difference between the number of correct samples and the number of incorrect samples contained in the rectangular area whose diagonal line connects the point R and the other point R (Pre_ratio = Tx, Post_ratio = Ty) = (Tx, Ty) may be obtained.
 図11は、Pre_ratioを横軸、Post_ratioを縦軸とした2次元空間上にプロットした正解のサンプル(●)と、挿入誤り(INS)の不正解サンプル(△)と、削除誤り(DEL)の不正解サンプル(×)の散布図である。この場合、例えば、点Rの座標を、ある固定値単位(例えば0.05)で、ある範囲内(例えば1から10まで)で変化させ、削除誤りの不正解サンプル数と正解のサンプル数の差が最大になる点Rの座標Tx、Tyを、それぞれPre_ratio、Post_ratioの閾値とする。図11は、削除誤りの不正解サンプル数と正解のサンプル数の差が最大になる点Rの座標(Tx,Ty)が(1.8,1.8)となった例である。 FIG. 11 shows a sample of correct answers (●) plotted on a two-dimensional space with Pre_ratio on the horizontal axis and Post_ratio on the vertical axis, incorrect sample (△) of insertion error (INS), and deletion error (DEL). It is a scatter diagram of an incorrect answer sample (x). In this case, for example, the coordinates of the point R are changed within a certain range (for example, 1 to 10) in a certain fixed value unit (for example, 0.05), and the difference between the number of incorrect samples and the number of correct samples of the deletion error is determined. The coordinates Tx and Ty of the maximum point R are set as thresholds for Pre_ratio and Post_ratio, respectively. FIG. 11 is an example in which the coordinates (Tx, Ty) of the point R at which the difference between the number of incorrect samples of deletion errors and the number of correct samples is maximum are (1.8, 1.8).
 以上のような処理をラベル列出力部110が出力しうる全ラベルについて行い、各ラベルについてのPre_ratioとPost_ratioの閾値を決定する。
 なお、上記の具体例では、挿入誤りと削除誤りについて個別に説明したが、これらは同時に処理することが勿論可能である。つまり、挿入誤りを修正するためのPre_ratioおよびPost_ratioの閾値と、削除誤りを修正するためのPre_ratioおよびPost_ratioの閾値の双方を記憶装置170に保持しておき、認識結果の各ラベルのPre_ratioおよびPost_ratioと、これらの閾値とを比較して、挿入誤りと削除誤りの修正を同時に行えばよい。
The above processing is performed for all labels that can be output by the label string output unit 110, and the Pre_ratio and Post_ratio threshold values for each label are determined.
In the above specific example, the insertion error and the deletion error have been individually described, but it is of course possible to process these simultaneously. That is, both the Pre_ratio and Post_ratio thresholds for correcting insertion errors and the Pre_ratio and Post_ratio thresholds for correcting deletion errors are held in the storage device 170, and the Pre_ratio and Post_ratio of each label of the recognition result are stored. The insertion error and the deletion error may be corrected simultaneously by comparing these threshold values.
(処理フロー)
 図12は、本実施形態に係る音声認識装置における処理の流れを示すフロー図である。
 ラベル列出力部110において、ラベル列と各ラベルの時間境界情報および音声パワー30が出力される(ステップS101)。このラベル列に含まれる各ラベルから、1つのラベルが選択される(ステップS102)。
 また、ステップS102において選択されたラベルのうち特定のラベルのみを修正対象とする場合には、修正対象とするラベルはあらかじめ高頻度エラーパターン40として決定され、ハードディスク等の記憶装置170に記憶されている(例えば、“1(yi)”,“2(er)”,“5(wu)”)。そして、この高頻度エラーパターン40を参照し、ステップS102で選択された選択ラベルがラベル修正の検証対象である場合には(ステップS103)、この選択ラベルの1つ前に存在するラベルと、1つ後ろに存在するラベルが取得される(ステップS104)。選択ラベルがラベル修正の検証対象ではない場合には(ステップS103)、ステップS101に戻り、処理が繰り返される。
(Processing flow)
FIG. 12 is a flowchart showing the flow of processing in the speech recognition apparatus according to this embodiment.
The label string output unit 110 outputs the label string, the time boundary information of each label, and the audio power 30 (step S101). One label is selected from each label included in the label row (step S102).
When only a specific label among the labels selected in step S102 is to be corrected, the label to be corrected is determined in advance as the high-frequency error pattern 40 and stored in the storage device 170 such as a hard disk. (For example, “1 (yi)”, “2 (er)”, “5 (wu)”). Then, with reference to the high-frequency error pattern 40, when the selected label selected in step S102 is a verification target of the label correction (step S103), the label existing immediately before this selected label, 1 The label existing behind is acquired (step S104). If the selected label is not a verification target for label correction (step S103), the process returns to step S101 and the process is repeated.
 次に、選択ラベルおよび選択ラベルの前後のラベルの発話継続時間長又は音声パワー50が、それぞれ、発話継続時間長出力部130又は音声パワー情報出力部140により出力される(ステップS105)。
 そして、ステップS105にて出力された韻律情報(発話継続時間長又は音声パワー)50により、選択ラベルの発話継続時間長と前後のラベルの各発話継続時間長との比率、もしくは選択ラベルの音声パワーと前後のラベルの各音声パワーとの比率が、ラベル修正部150にて計算される(ステップS106)。
Next, the utterance duration length or voice power 50 of the selected label and the labels before and after the selection label are output by the utterance duration length output unit 130 or the voice power information output unit 140, respectively (step S105).
Then, based on the prosodic information (speech duration time or voice power) 50 output in step S105, the ratio of the utterance duration time of the selected label to the utterance duration times of the preceding and following labels, or the voice power of the selected label And the ratio of the audio power of the front and rear labels is calculated by the label correction unit 150 (step S106).
 ステップS104にて算出された比率と、あらかじめ統計情報から決定された閾値(第1の所定の条件10、第2の所定の条件20)とが比較され、認識誤り(挿入誤り、削除誤り等)であるか否かが判断される(ステップS107)。認識誤りである場合には、ラベル修正部150にてこの選択ラベルは修正される(ステップS108)。
 認識結果の全ラベルについて、以上の処理が繰り返された後、ラベル列の音声認識が終了し(ステップS109)、最終的な音声認識結果が結果出力部160にて出力される。
The ratio calculated in step S104 is compared with threshold values (first predetermined condition 10 and second predetermined condition 20) determined in advance from statistical information, and recognition errors (insertion error, deletion error, etc.) are compared. Is determined (step S107). If it is a recognition error, the label correction unit 150 corrects the selected label (step S108).
After the above processing is repeated for all the labels of the recognition results, the speech recognition of the label sequence ends (step S109), and the final speech recognition result is output by the result output unit 160.
 なお、図12のフローでは、最初のステップS101においてラベル列を出力するが、ラベル列中のすべてのラベルが出力されるのを待たずに、ラベルの認識を進めつつ、順次認識されたラベルからステップS102以降の処理が実行されるようになっていてもよい。 In the flow of FIG. 12, the label sequence is output in the first step S101, but without waiting for all the labels in the label sequence to be output, the recognition of the labels proceeds while the labels are sequentially recognized. The processing after step S102 may be executed.
(本実施形態に係る音声認識方法の利点)
 図13は、本実施形態に係る音声認識装置の中国語連続数字の認識性能について示す図である。図13において、「Baseline」は本手法を用いない場合(音声波形からのみの音声認識)の認識率評価結果であり、「2-Dim_時間長Ratio」は選択ラベルと前後のラベルの発話継続時間長の比率を用いた訂正手法の認識率評価結果であり、「2-Dim_PowerRatio」は選択ラベルと前後のラベルの音声パワーの平均値の比率を用いた訂正手法の認識率評価結果である。また、「Long」は11~15桁の連続数字列の評価結果であり、「Short」は1~8桁の連続数字系列の評価結果である。
(Advantages of the speech recognition method according to the present embodiment)
FIG. 13 is a diagram illustrating the recognition performance of Chinese continuous numerals of the speech recognition apparatus according to the present embodiment. In FIG. 13, “Baseline” is the recognition rate evaluation result when this method is not used (speech recognition only from the speech waveform), and “2-Dim_time length Ratio” is the continuation of the utterance of the selected label and the preceding and following labels. The recognition rate evaluation result of the correction method using the time length ratio, and “2-Dim_PowerRatio” is the recognition rate evaluation result of the correction method using the ratio of the average value of the voice power of the selected label and the preceding and following labels. “Long” is an evaluation result of a continuous numeric string of 11 to 15 digits, and “Short” is an evaluation result of a continuous numeric sequence of 1 to 8 digits.
 発話継続時間長の比率を用いた場合と音声パワーの平均値の比率を用いた場合のいずれの場合でもBaselineより高い認識率となった。
 さらに、選択ラベルの発話継続時間長および音声パワーについて、その前後のラベルの発話継続時間長又は音声パワーに対する比率を用いてラベル修正を行うことの利点について、以下、説明する。
The recognition rate was higher than Baseline in both cases of using the ratio of duration of utterance and the ratio of average value of voice power.
Further, with respect to the utterance duration length and voice power of the selected label, the advantages of performing label correction using the ratio of the preceding and following labels to the utterance duration length or voice power will be described below.
 図14は、話者による数字一桁分(音節単位)の平均発話継続時間長の変化について示す図である。最も左側の話者(話者ID=CTM008_4B)と最も右側の話者(話者ID=CTF006_3A)とを比較すると、平均発話継続時間長には1.7倍程度の違いがある。このため、例えば、全話者の平均発話継続時間を参照基準として、これに対する選択ラベルの発話継続時間長の比率を採用する場合には、話者ごとの話速のばらつきに対応できない。
 これに対し、本実施形態に係る音声認識方法では、選択ラベルの前後のラベルに対する発話継続時間長の比率を採用するため、話者ごとの話速のばらつきによる影響を低減させつつ誤認識を判定することができる。
FIG. 14 is a diagram showing a change in average utterance duration length of one digit (single syllable unit) by a speaker. When comparing the leftmost speaker (speaker ID = CTM008_4B) and the rightmost speaker (speaker ID = CTF006_3A), there is a difference of about 1.7 times in the average utterance duration. For this reason, for example, when the average utterance duration of all speakers is used as a reference standard and the ratio of the utterance duration length of the selected label to this is adopted, it is not possible to deal with variations in speech speed for each speaker.
On the other hand, in the speech recognition method according to the present embodiment, the ratio of the duration of the utterance to the labels before and after the selected label is adopted, so that the misrecognition is determined while reducing the influence due to the variation in the speaking speed for each speaker. can do.
 図15は、数字列の長さによる数字一桁分(音節単位)の平均発話継続時間長の変化について示す図である。最も左側の短い1桁の数字列と、最も右側の長い15桁の数字列とを比較すると、平均発話継続時間長が1.6倍程度異なっている。よって、例えば、全体の平均発話継続時間長を参照基準として、これに対する選択ラベルの発話継続時間長の比率を採用する場合には、数字列の長さによる発話継続時間長のばらつきに対応できない。
 これに対し、本実施形態に係る音声認識方法では、選択ラベルの前後のラベルに対する発話継続時間長の比率を採用するため、数字列の長さによる発話継続時間長のばらつきによる影響を低減させつつ誤認識を判定することができる。
FIG. 15 is a diagram showing a change in the average utterance duration length for one digit (in syllable units) depending on the length of the digit string. Comparing the shortest 1-digit number sequence on the left side with the longest 15-digit number sequence on the rightmost side, the average utterance duration is different by about 1.6 times. Therefore, for example, when the ratio of the utterance duration time of the selected label with respect to the overall average utterance duration length is used as a reference standard, it is not possible to deal with variations in the utterance duration length due to the length of the numeric string.
On the other hand, in the speech recognition method according to the present embodiment, since the ratio of the utterance duration time to the labels before and after the selected label is adopted, the influence of the variation in the utterance duration time due to the length of the numeric string is reduced. Misrecognition can be determined.
 図16は、中国の携帯電話番号(11桁)を「(前)3桁+(中)3桁+(後)5桁」の主な読み習慣で発声した音声を収録したデータから算出された音節平均時間長を示す図である。同じ3桁数字であっても、最初の3桁とその次の3桁という発話音声中における位置の違いにより、発話継続時間長が異なっている。このように同一発話内であっても、発話位置により発話継続時間長のばらつきが存在することから、本実施形態に係る音声認識方法のように局所的な前後の比率を利用することが有効である。 FIG. 16 is calculated from data recorded with the main reading habit of “(front) 3 digits + (middle) 3 digits + (rear) 5 digits” of the Chinese mobile phone number (11 digits). It is a figure which shows syllable average time length. Even for the same three-digit number, the duration of the utterance differs depending on the position of the first three digits and the next three digits in the speech. Thus, even within the same utterance, there is a variation in the duration of the utterance depending on the utterance position, so it is effective to use a local ratio before and after, as in the speech recognition method according to this embodiment. is there.
 さらに、発話音声全体の平均時間長や平均音声パワー値を用いる場合、発話が最後まで終了するのを待って音声認識を行うことになる。これに対し、本実施形態の音声認識方法のように各ラベルとその前後のラベルの発話継続時間長を用いる場合には、発話がなされた音声から順次処理していくことができるため、全体の処理時間が短縮されるという効果もある。また、処理負荷がより低く、よりシンプルな構成で実装できる。 Furthermore, when using the average time length and average voice power value of the entire uttered voice, the voice recognition is performed after the utterance is finished to the end. On the other hand, in the case of using the utterance duration length of each label and the labels before and after the label as in the speech recognition method of the present embodiment, it is possible to process sequentially from the speech that has been uttered. There is also an effect that the processing time is shortened. In addition, the processing load is lower and it can be implemented with a simpler configuration.
 図17は、本実施形態に係る音声認識装置の中国語連続数字の認識性能について示す図である。Baselineは本手法を用いない場合(音声波形からのみの音声認識)の認識率評価結果である。また、1-Dim_時間長Ratioは選択ラベルの発話継続時間長の全体平均時間長に対する比率を用いた訂正手法の認識率評価結果であり、1-Dim_PowerRatioは選択ラベルの音声パワーの全体平均パワーに対する比率を用いた訂正手法の認識率評価結果である。また、2-Dim_時間長Ratioは選択ラベルと前後のラベルの発話継続時間長の比率を用いた訂正手法の認識率評価結果であり、2-Dim_PowerRatioは選択ラベルと前後のラベルの音声パワーの平均値の比率を用いた訂正方法の認識率評価結果である。 FIG. 17 is a diagram showing the recognition performance of Chinese continuous numbers of the speech recognition apparatus according to the present embodiment. Baseline is a recognition rate evaluation result when this method is not used (speech recognition only from a speech waveform). Also, 1-Dim_Time Length Ratio is the recognition rate evaluation result of the correction method using the ratio of the utterance duration length of the selected label to the overall average time length, and 1-Dim_PowerRatio is the overall average power of the voice power of the selected label. It is the recognition rate evaluation result of the correction method using the ratio to. In addition, 2-Dim_Time Length Ratio is the recognition rate evaluation result of the correction method using the ratio of the utterance duration length of the selected label and the preceding and following labels, and 2-Dim_PowerRatio is the voice power of the selected label and the preceding and following labels. It is the recognition rate evaluation result of the correction method using the ratio of the average value.
 図17に示される評価結果によれば、本実施形態に係る音声認識手法(2-Dim_時間長Ratio、2-Dim_PowerRatio)のほうが、全体平均時間長や全体平均パワーを基準とする方法(1-Dim_時間長Ratio、1-Dim_PowerRatio)よりも、性能が優れていることが分かる。
 なお、本実施形態では、ラベル修正部150が、選択ラベルの発話継続時間長とその選択ラベルの前後のラベルの発話継続時間長との比と、選択ラベルの発話継続時間長とこの選択ラベルの前後のラベルの前記発話継続時間長との比に関する第1の所定の条件と、に基づいて、選択ラベルを修正候補ラベルに修正する例を中心に説明してきたが、これに限るものではない。
According to the evaluation results shown in FIG. 17, the speech recognition method (2-Dim_time length Ratio, 2-Dim_PowerRatio) according to the present embodiment is a method based on the overall average time length and overall average power (1 -Dim_time length Ratio, 1-Dim_PowerRatio) It can be seen that the performance is superior.
In the present embodiment, the label correcting unit 150 determines the ratio between the utterance duration time of the selected label and the utterance duration time of the label before and after the selected label, the utterance duration time of the selected label, and the Although the description has been made centering on the example in which the selected label is corrected to the correction candidate label based on the first predetermined condition regarding the ratio of the preceding and following labels to the utterance duration time length, the present invention is not limited to this.
 各ラベルのPre_ratioとPost_ratioの何れか一方が不正解のサンプルの範囲に含まれる場合、他方も不正解のサンプルの範囲に含まれる可能性が極めて高いことから、ラベル修正部150は、選択ラベルの発話継続時間長とその選択ラベルの前又は後のラベルの発話継続時間長との比と、選択ラベルの発話継続時間長とこの選択ラベルの前又は後のラベルの前記発話継続時間長との比に関する第1の所定の条件と、に基づいて、選択ラベルを修正候補ラベルに修正してもよい。即ち、ラベル修正部150は、隣接するラベルの発話継続時間長の比と、前記隣接するラベルの発話継続時間長の比に関する統計情報に基づく第1の所定の条件と、に基づいて、前記隣接するラベルの少なくとも一方のラベルを修正してもよい。
 この構成によれば、選択ラベルの発話継続時間長とその選択ラベルの前後のラベルの発話継続時間長との比に基づいて、選択ラベルを修正する場合に比べ、修正精度についてはやや劣るものの、処理速度を向上させることができる。
If either Pre_ratio or Post_ratio of each label is included in the incorrect sample range, the label correction unit 150 is highly likely to be included in the incorrect sample range. The ratio of the utterance duration length to the utterance duration length of the label before or after the selected label, and the ratio of the utterance duration length of the selected label to the utterance duration length of the label before or after the selected label. The selected label may be corrected to the correction candidate label on the basis of the first predetermined condition regarding. That is, the label correction unit 150 determines whether the adjacent label is based on the first predetermined condition based on the ratio of the utterance duration length of the adjacent label and the statistical information regarding the ratio of the utterance duration length of the adjacent label. At least one of the labels to be modified may be modified.
According to this configuration, although the correction accuracy is slightly inferior to the case of correcting the selection label based on the ratio of the utterance duration length of the selected label and the utterance duration length of the label before and after the selection label, The processing speed can be improved.
(音声認識プログラム)
 以上説明した本実施形態に係る音声認識装置における処理は、コンピュータに、入力音声を音声認識し、その音声認識結果を示すラベル列を出力する第1のステップと、前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話継続時間長を出力する第2のステップと、前記ラベルおよびその前後のラベルの前記発話継続時間長と、前記ラベルの発話継続時間長に関する統計情報に基づく第1の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正する第3のステップと、を実行させるための音声認識プログラムによって実現される処理の一態様である。
(Voice recognition program)
The processing in the speech recognition apparatus according to the present embodiment described above includes the first step of recognizing the input speech to the computer and outputting a label sequence indicating the speech recognition result, and the label contained in the label sequence. A second step of outputting the utterance duration length of at least one of the labels and the labels before and after the label, the utterance duration length of the labels and the labels before and after the label, and statistical information relating to the utterance duration length of the labels And a third step of correcting the label to a correction candidate label which is a correction candidate label based on a first predetermined condition based on one aspect of processing realized by a speech recognition program for executing It is.
 また、本実施形態に係る音声認識装置における処理は、コンピュータに、入力音声を音声認識し、その音声認識結果を示すラベル列を出力する第1のステップと、前記ラベル列に含まれるラベルの発話継続時間長を出力する第2のステップと、隣接するラベルの発話継続時間長の比と、前記隣接するラベルの発話継続時間長の比に関する統計情報に基づく第1の所定の条件と、に基づいて、前記隣接するラベルの少なくとも一方のラベルを修正する第3のステップと、を実行させるための音声認識プログラムにより実現されてもよい。 Further, the processing in the speech recognition apparatus according to the present embodiment includes a first step of recognizing input speech to a computer and outputting a label sequence indicating the speech recognition result, and speech of a label included in the label sequence. Based on the second step of outputting the duration time, the ratio of the utterance duration length of the adjacent labels, and the first predetermined condition based on the statistical information on the ratio of the utterance duration length of the adjacent labels The third step of correcting at least one of the adjacent labels may be realized by a speech recognition program for executing the third step.
 また、この音声認識プログラムは、RAM、ROM等の半導体記憶媒体、FD、HD等の磁気記憶型記憶媒体、CD、CDV、LD、DVD等の光学的読取方式記憶媒体、MO等の磁気記憶型/光学的読取方式記憶媒体などであって、電子的、磁気的、光学的等の読み取り方法のいかんにかかわらず、コンピュータで読み取り可能な記憶媒体に記憶し配布・販売等することが可能であり、ネットワークなどを通じてダウンロードすることが可能である。 In addition, this voice recognition program includes a semiconductor storage medium such as RAM and ROM, a magnetic storage type storage medium such as FD and HD, an optical reading type storage medium such as CD, CDV, LD, and DVD, and a magnetic storage type such as MO. / Optical reading type storage media, etc., which can be stored in a computer-readable storage medium and distributed and sold regardless of electronic, magnetic, optical, etc. reading methods It can be downloaded through a network.
 以上、本発明の実施の形態について説明したが、本発明の範囲は、図示され記載された例示的な実施形態に限定されるものではなく、本発明が目的とするものと均等な効果をもたらす全ての実施形態をも含む。さらに、本発明の範囲は、すべての開示されたそれぞれの特徴のうち特定の特徴のあらゆる所望する組み合わせによって画されうる。 While the embodiments of the present invention have been described above, the scope of the present invention is not limited to the illustrated and described exemplary embodiments, and provides the same effects as those intended by the present invention. All embodiments are also included. Further, the scope of the invention can be defined by any desired combination of particular features among all the disclosed features.
 1 音声認識装置
 10 第1の所定の条件
 20 第2の所定の条件
 110 ラベル列出力部
 120 ラベル選択部
 130 発話継続時間長出力部
 140 音声パワー情報出力部
 150 ラベル修正部
 160 結果出力部
 170 記憶装置
DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 10 1st predetermined condition 20 2nd predetermined condition 110 Label sequence output part 120 Label selection part 130 Speech duration time output part 140 Voice power information output part 150 Label correction part 160 Result output part 170 Storage apparatus

Claims (9)

  1.  入力音声を音声認識し、その音声認識結果を示すラベル列を出力するラベル列出力部と、
     前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話継続時間長を出力する発話継続時間長出力部と、
     前記ラベルおよびその前後のラベルの前記発話継続時間長と、前記ラベルの発話継続時間長に関する統計情報に基づく第1の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正するラベル修正部と、
    を有する音声認識装置。
    A label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result;
    An utterance duration output unit for outputting the utterance duration of at least one of the labels included in the label string and the labels before and after the label; and
    Based on the utterance duration length of the label and the labels before and after the label, and a first predetermined condition based on statistical information relating to the utterance duration length of the label, the correction candidate that is a label as a correction candidate label A label correction section for correcting the label;
    A speech recognition apparatus.
  2.  前記ラベル修正部は、前記ラベルの前記発話継続時間長と前記ラベルの前後のラベルの前記発話継続時間長との比と、前記ラベルの前記発話継続時間長と前記ラベルの前後のラベルの前記発話継続時間長との比に関する前記第1の所定の条件と、に基づいて、前記ラベルを前記修正候補ラベルに修正すること
    を特徴とする請求項1に記載の音声認識装置。
    The label correction unit includes a ratio between the utterance duration length of the label and the utterance duration length of the label before and after the label, the utterance duration length of the label and the utterance of the label before and after the label. The speech recognition apparatus according to claim 1, wherein the label is corrected to the correction candidate label based on the first predetermined condition relating to a ratio to a duration length.
  3.  前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話区間における音声パワーを示す情報である音声パワー情報を出力する音声パワー情報出力部をさらに有し、
     前記ラベル修正部は、前記ラベルおよびその前後のラベルの前記発話継続時間長、並びに、前記第1の所定の条件に加えて、前記ラベルおよびその前後のラベルの前記音声パワー情報と、前記ラベルの音声パワーに関する統計情報に基づく第2の所定の条件と、にも基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正すること
    を特徴とする請求項1又は2に記載の音声認識装置。
    A voice power information output unit that outputs voice power information, which is information indicating voice power in the speech section of at least one label and the labels before and after the label included in the label sequence;
    In addition to the utterance duration length of the label and the labels before and after the label, and the first predetermined condition, the label correction unit includes the audio power information of the label and the labels before and after the label, The speech recognition according to claim 1 or 2, wherein the label is corrected to a correction candidate label that is a label of a correction candidate based on a second predetermined condition based on statistical information relating to voice power. apparatus.
  4.  前記ラベル修正部は、前記ラベルおよびその前後のラベルの前記発話継続時間長、並びに、前記第1の所定の条件に加えて、前記ラベルの前記音声パワーと前記ラベルの前後のラベルの前記音声パワーとの比と、前記ラベルの前記音声パワーと前記ラベルの前後のラベルの前記音声パワーとの比に関する前記第2の所定の条件と、にも基づいて、前記ラベルを前記修正候補ラベルに修正すること
    を特徴とする請求項3に記載の音声認識装置。
    In addition to the utterance duration length of the label and the labels before and after the label, and the first predetermined condition, the label correction unit includes the sound power of the label and the sound power of the label before and after the label. And the second predetermined condition relating to the ratio of the audio power of the label and the audio power of the labels before and after the label, the label is corrected to the correction candidate label. The speech recognition apparatus according to claim 3.
  5.  入力音声を音声認識し、その音声認識結果を示すラベル列を出力するラベル列出力部と、
     前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話区間における音声パワーを示す情報である音声パワー情報を出力する音声パワー情報出力部と、
     前記ラベルおよびその前後のラベルの前記音声パワー情報と、前記ラベルの音声パワーに関する統計情報に基づく第2の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正するラベル修正部と、
    を有する音声認識装置。
    A label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result;
    A voice power information output unit that outputs voice power information that is information indicating voice power in a speech section of at least one label and labels before and after the label included in the label sequence;
    The label is corrected to a correction candidate label which is a correction candidate label based on the audio power information of the label and the labels before and after the label and a second predetermined condition based on statistical information on the audio power of the label. A label correction section to perform,
    A speech recognition apparatus.
  6.  入力音声を音声認識し、その音声認識結果を示すラベル列を出力するラベル列出力部と、
     前記ラベル列に含まれるラベルの発話継続時間長を出力する発話継続時間長出力部と、
     隣接するラベルの発話継続時間長の比と前記隣接するラベルの発話継続時間長の比に関する統計情報に基づく第1の所定の条件とに基づいて、前記隣接するラベルの少なくとも一方のラベルを修正する修正部と、
    を備える音声認識装置。
    A label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result;
    An utterance duration output unit for outputting an utterance duration of a label included in the label string;
    Correcting at least one of the adjacent labels based on a first predetermined condition based on statistical information on a ratio of utterance durations of adjacent labels and a ratio of utterance durations of adjacent labels Correction part,
    A speech recognition apparatus comprising:
  7.  ラベル列出力部と、発話継続時間長出力部と、ラベル修正部と、を有する音声認識装置が実行する音声認識方法であって、
     前記ラベル列出力部により、入力音声を音声認識し、その音声認識結果を示すラベル列を出力する第1のステップと、
     前記発話継続時間長出力部により、前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話継続時間長を出力する第2のステップと、
     前記ラベル修正部により、前記ラベルおよびその前後のラベルの前記発話継続時間長と、前記ラベルの発話継続時間長に関する統計情報に基づく第1の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正する第3のステップと、
    を含むことを特徴とする音声認識方法。
    A speech recognition method executed by a speech recognition device having a label string output unit, an utterance duration output unit, and a label correction unit,
    A first step of recognizing input speech by the label sequence output unit and outputting a label sequence indicating the speech recognition result;
    A second step of outputting the utterance duration length of at least one of the labels included in the label string and the labels before and after the utterance duration length output unit;
    The label correction unit corrects the label based on the utterance duration length of the label and the labels before and after the label and a first predetermined condition based on statistical information on the utterance duration length of the label. A third step of correcting to a correction candidate label that is a label of
    A speech recognition method comprising:
  8.  コンピュータに、
     入力音声を音声認識し、その音声認識結果を示すラベル列を出力する第1のステップと、
     前記ラベル列に含まれるラベルのうち少なくとも一つのラベルおよびその前後のラベルの発話継続時間長を出力する第2のステップと、
     前記ラベルおよびその前後のラベルの前記発話継続時間長と、前記ラベルの発話継続時間長に関する統計情報に基づく第1の所定の条件と、に基づいて、前記ラベルを修正候補のラベルである修正候補ラベルに修正する第3のステップと、
    を実行させるための音声認識プログラム。
    On the computer,
    A first step of recognizing the input speech and outputting a label string indicating the speech recognition result;
    A second step of outputting the utterance duration length of at least one of the labels included in the label string and the labels before and after the label;
    Based on the utterance duration length of the label and the labels before and after the label, and a first predetermined condition based on statistical information relating to the utterance duration length of the label, the correction candidate that is a label as a correction candidate label A third step to modify the label;
    A speech recognition program for running.
  9.  コンピュータに、
     入力音声を音声認識し、その音声認識結果を示すラベル列を出力する第1のステップと、
     前記ラベル列に含まれるラベルの発話継続時間長を出力する第2のステップと、
     隣接するラベルの発話継続時間長の比と、前記隣接するラベルの発話継続時間長の比に関する統計情報に基づく第1の所定の条件と、に基づいて、前記隣接するラベルの少なくとも一方のラベルを修正する第3のステップと、
    を実行させるための音声認識プログラム。
    On the computer,
    A first step of recognizing the input speech and outputting a label string indicating the speech recognition result;
    A second step of outputting an utterance duration length of a label included in the label sequence;
    Based on the first predetermined condition based on the ratio of the utterance duration length of adjacent labels and the statistical information on the ratio of the utterance duration length of the adjacent labels, at least one label of the adjacent labels A third step to modify;
    A speech recognition program for running.
PCT/JP2012/002861 2011-05-02 2012-04-26 Voice recognition device and voice recognition method WO2012150658A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2011103110 2011-05-02
JP2011-103110 2011-05-02

Publications (1)

Publication Number Publication Date
WO2012150658A1 true WO2012150658A1 (en) 2012-11-08

Family

ID=47107849

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/002861 WO2012150658A1 (en) 2011-05-02 2012-04-26 Voice recognition device and voice recognition method

Country Status (1)

Country Link
WO (1) WO2012150658A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871503A (en) * 2016-09-28 2018-04-03 丰田自动车株式会社 Speech dialogue system and sounding are intended to understanding method
CN110232923A (en) * 2019-05-09 2019-09-13 青岛海信电器股份有限公司 A kind of phonetic control command generation method, device and electronic equipment
CN112420016A (en) * 2020-11-20 2021-02-26 四川长虹电器股份有限公司 Method and device for aligning synthesized voice and text and computer storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS58224396A (en) * 1982-06-23 1983-12-26 富士通株式会社 Voice recognition equipment
JPH056196A (en) * 1991-06-27 1993-01-14 Matsushita Electric Ind Co Ltd Voice recognizing device
JPH07210193A (en) * 1994-01-12 1995-08-11 Matsushita Electric Ind Co Ltd Voice conversation device
JPH08248983A (en) * 1995-03-09 1996-09-27 Mitsubishi Electric Corp Voice recognition device
JPH11184496A (en) * 1997-12-19 1999-07-09 Toshiba Corp Device and method for speech recognition
JP2003345388A (en) * 2002-05-23 2003-12-03 Nec Corp Method, device, and program for voice recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS58224396A (en) * 1982-06-23 1983-12-26 富士通株式会社 Voice recognition equipment
JPH056196A (en) * 1991-06-27 1993-01-14 Matsushita Electric Ind Co Ltd Voice recognizing device
JPH07210193A (en) * 1994-01-12 1995-08-11 Matsushita Electric Ind Co Ltd Voice conversation device
JPH08248983A (en) * 1995-03-09 1996-09-27 Mitsubishi Electric Corp Voice recognition device
JPH11184496A (en) * 1997-12-19 1999-07-09 Toshiba Corp Device and method for speech recognition
JP2003345388A (en) * 2002-05-23 2003-12-03 Nec Corp Method, device, and program for voice recognition

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871503A (en) * 2016-09-28 2018-04-03 丰田自动车株式会社 Speech dialogue system and sounding are intended to understanding method
CN107871503B (en) * 2016-09-28 2023-02-17 丰田自动车株式会社 Speech dialogue system and utterance intention understanding method
CN110232923A (en) * 2019-05-09 2019-09-13 青岛海信电器股份有限公司 A kind of phonetic control command generation method, device and electronic equipment
CN110232923B (en) * 2019-05-09 2021-05-11 海信视像科技股份有限公司 Voice control instruction generation method and device and electronic equipment
CN112420016A (en) * 2020-11-20 2021-02-26 四川长虹电器股份有限公司 Method and device for aligning synthesized voice and text and computer storage medium
CN112420016B (en) * 2020-11-20 2022-06-03 四川长虹电器股份有限公司 Method and device for aligning synthesized voice and text and computer storage medium

Similar Documents

Publication Publication Date Title
US8731926B2 (en) Spoken term detection apparatus, method, program, and storage medium
EP3114679B1 (en) Predicting pronunciation in speech recognition
US8140330B2 (en) System and method for detecting repeated patterns in dialog systems
JP6131537B2 (en) Speech recognition system, speech recognition program, recording medium, and speech recognition method
JP5207642B2 (en) System, method and computer program for acquiring a character string to be newly recognized as a phrase
US8494853B1 (en) Methods and systems for providing speech recognition systems based on speech recordings logs
US8645139B2 (en) Apparatus and method of extending pronunciation dictionary used for speech recognition
US8990086B2 (en) Recognition confidence measuring by lexical distance between candidates
EP2685452A1 (en) Method of recognizing speech and electronic device thereof
JP3834169B2 (en) Continuous speech recognition apparatus and recording medium
JP6812843B2 (en) Computer program for voice recognition, voice recognition device and voice recognition method
US8532990B2 (en) Speech recognition of a list entry
CN111797632B (en) Information processing method and device and electronic equipment
US9484019B2 (en) System and method for discriminative pronunciation modeling for voice search
CN109036471B (en) Voice endpoint detection method and device
US20130289987A1 (en) Negative Example (Anti-Word) Based Performance Improvement For Speech Recognition
JP6276513B2 (en) Speech recognition apparatus and speech recognition program
US20170270923A1 (en) Voice processing device and voice processing method
WO2012150658A1 (en) Voice recognition device and voice recognition method
JP6481939B2 (en) Speech recognition apparatus and speech recognition program
KR101122591B1 (en) Apparatus and method for speech recognition by keyword recognition
JP6027754B2 (en) Adaptation device, speech recognition device, and program thereof
KR101483947B1 (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
JP7098587B2 (en) Information processing device, keyword detection device, information processing method and program
CN112863496A (en) Voice endpoint detection method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12779301

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12779301

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP