WO2012150658A1

WO2012150658A1 - Voice recognition device and voice recognition method

Info

Publication number: WO2012150658A1
Application number: PCT/JP2012/002861
Authority: WO
Inventors: 暁東王; 邦彦尾和; 誠庄境
Original assignee: 旭化成株式会社
Priority date: 2011-05-02
Filing date: 2012-04-26
Publication date: 2012-11-08

Abstract

The purpose of the present invention is to provide a voice recognition device capable of voice recognition with higher accuracy while reducing the effect caused by a change in speaking speed, in how each speaker speaks and so forth. The voice recognition device is provided with a label sequence output unit (110) which recognizes an input voice and outputs a label sequence indicative of the voice recognition result thereof, a speech duration length output unit (130) which outputs a speech duration length in at least one label among labels included in said label sequence and the labels before and after the at least one label, and a label modification unit (150) which, on the basis of said speech duration length in said label and the labels before and after, and a first given condition based on statistical information associated with the speech duration length in said label, modifies said label to a modification candidate label that is a modification candidate label.

Description

Speech recognition apparatus and speech recognition method

The present invention relates to a voice recognition device and a voice recognition method.

Conventionally, various methods for performing speech recognition with higher accuracy have been proposed. For example, in the speech recognition method described in Patent Document 1, word string candidates are selected by pruning using information on the correlation probability model generated in the learning phase before speech recognition is executed. An attempt has been made to perform speech recognition with higher accuracy by weighting each selected word string candidate for each word. This correlation probability model includes statistical information regarding the time length and speech power in recognition word units, the speech speed based on the time length ratio of vowels and consonants, the number of words in sentence units, and the like.

The speech recognition apparatus described in Patent Document 2 recognizes an input speech, adds a plurality of recognition result candidates with information on the duration of each syllable, and outputs the speech recognition means, and a syllable boundary from the input speech A recognition result from the plurality of recognition result candidates based on a syllable boundary candidate detection means for obtaining a candidate, an average syllable length estimation means for obtaining an average syllable length from the syllable boundary candidate, and the recognition result candidate and the average syllable length. Candidate selection means for selecting.

JP 2008-176202 A JP-A-9-292899

Certainly, even with the speech recognition method described in Patent Document 1, speech recognition with high accuracy can be performed to some extent.
However, the speech recognition method described in Patent Document 1 is based on the premise that the spoken sentence is mainly recognized, and the speech is recognized using a correlation probability model including time length and speech power statistical information in units of words. Recognize.

By the way, when speaking, the speaking speed of the entire utterance content may change from time to time, and the speaking speed and speaking timing may vary depending on the speaker. In this case, in the method using time length or voice power in units of words as in Patent Document 1, the accuracy of voice recognition is greatly affected by how the speaker speaks each time.

In addition, the speech recognition apparatus described in Patent Document 2 has a problem that it is vulnerable to a change in instantaneous speech speed because it compares information on the duration of each syllable of the recognition result candidate and the average syllable length. .
Therefore, the present invention provides a high-accuracy speech recognition apparatus and speech recognition method that are fast in processing speed and low in cost, and that are not easily affected by changes in speech speed or the manner of speech for each speaker. Objective.

In order to solve the above problem, according to an aspect of the present invention, a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result; and among labels included in the label string An utterance duration output unit that outputs the utterance duration length of at least one label and the preceding and following labels, the utterance duration length of the label and the preceding and following labels, and statistical information relating to the utterance duration length of the label And a label correction unit that corrects the label to a correction candidate label that is a correction candidate label based on the first predetermined condition based on the above.

According to this configuration, label correction (speech recognition) is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.

Further, according to another aspect of the present invention, the label correction unit includes a ratio between the utterance duration length of the label and the utterance duration length of labels before and after the label, and the utterance continuation of the label. The speech recognition device, wherein the label is corrected to the correction candidate label based on the first predetermined condition relating to a ratio between a time length and the utterance duration time of the label before and after the label. Is provided.
According to this configuration, label correction (speech recognition) is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.

According to another aspect of the present invention, voice power information that outputs voice power information, which is information indicating voice power in an utterance section of at least one of the labels included in the label string and the labels before and after the label. In addition to the output duration of the label and the labels before and after the label, and the first predetermined condition, the label correction unit further includes the voice of the label and the labels before and after the label. A speech recognition apparatus that corrects the label to a correction candidate label, which is a correction candidate label, based on power information and a second predetermined condition based on statistical information on the sound power of the label. Is provided.
According to this configuration, using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label, the label correction is performed based on the speech power in addition to the speech duration time. Therefore, more accurate speech recognition can be performed.

Further, according to another aspect of the present invention, the label correction unit includes the speech duration time of the label and the labels before and after the label, and the voice of the label in addition to the first predetermined condition. Also based on the ratio of power and the audio power of the labels before and after the label, and the second predetermined condition relating to the ratio of the audio power of the label and the audio power of the labels before and after the label Thus, a speech recognition apparatus is provided that corrects the label to the correction candidate label.
According to this configuration, using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label, the label correction is performed based on the speech power in addition to the speech duration time. Therefore, more accurate speech recognition can be performed.

According to another aspect of the present invention, a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result, at least one label among labels included in the label string, and A voice power information output unit that outputs voice power information, which is information indicating voice power in an utterance section of the label before and after the label, the voice power information of the label and the labels before and after the label, and statistical information regarding the voice power of the label And a label correction unit that corrects the label to a correction candidate label that is a correction candidate label based on the second predetermined condition based on the above.
According to this configuration, it is possible to sequentially perform label correction (speech recognition) based on the voice power of each label included in the label string that is the recognition result of the input voice and the labels before and after the label. . Therefore, it is possible to perform highly accurate speech recognition that is not easily affected by changes in speech speed or the manner of speech for each speaker.

According to another aspect of the present invention, a label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result, and outputs an utterance duration time of a label included in the label string Based on a first predetermined condition based on statistical information regarding the ratio of the utterance duration length of adjacent labels and the ratio of the utterance duration length of adjacent labels, A speech recognition device is provided that includes a correction unit that corrects at least one of adjacent labels.
According to this configuration, the label correction (voice recognition) is sequentially performed based on the ratio of the durations of utterances using adjacent labels among the labels included in the label sequence that is the recognition result of the input voice. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.

According to another aspect of the present invention, there is provided a speech recognition method executed by a speech recognition apparatus having a label sequence output unit, an utterance duration output unit, and a label correction unit, wherein the label sequence output A first step of recognizing input speech by the unit and outputting a label sequence indicating the speech recognition result, and at least one label among the labels included in the label sequence by the utterance duration output unit, and The second step of outputting the utterance duration length of the label before and after the label, and the label correction unit, to the statistical information regarding the utterance duration length of the label and the labels before and after the label, and the utterance duration length of the label And a third step of correcting the label to a correction candidate label, which is a correction candidate label, based on a first predetermined condition based on識方 there is provided a method.
According to this configuration, label correction (speech recognition) is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.

According to another aspect of the present invention, a first step of recognizing an input voice to a computer and outputting a label string indicating the voice recognition result to at least one of labels included in the label string. A second step of outputting utterance duration lengths of one label and the preceding and following labels, a first step based on statistical information relating to the utterance duration length of the labels and preceding and following labels, and the utterance duration length of the labels And a third step of correcting the label to a correction candidate label, which is a correction candidate label, based on the predetermined condition (1).
According to this configuration, label correction (speech recognition) is sequentially performed based on the duration of the utterance using only each label included in the label sequence that is the recognition result of the input speech and the labels before and after the label. Can do. Therefore, it is possible to perform highly accurate speech recognition that is not easily influenced by changes in speech speed or the manner of speech for each speaker.

According to the present invention, it is possible to perform high-accuracy speech recognition that is not easily affected by changes in speech speed or the manner of speech for each speaker.

It is a figure which shows the structural example of the speech recognition apparatus which concerns on one Embodiment of this invention. It is a figure which shows the specific example of a correct answer label and a recognition result label. It is a figure which shows the specific example of the speech duration time of a correct answer label and a recognition result label, Pre_ratio, and Post_ratio. It is a figure which shows an example of the scatter diagram of the sample of the correct answer of number "5", and the sample of an incorrect answer. It is a figure which shows another specific example of a correct answer label and a recognition result label. It is a figure which shows the specific example of the average value of voice power of a correct answer label and a recognition result label, Pre_ratio, and Post_ratio. It is a figure which shows the time series image of the audio | voice power of a recognition result label. It is a figure which shows an example of the scatter diagram of the sample of the correct answer of number "2", and the sample of an incorrect answer. It is a figure which shows another specific example of a correct answer label and a recognition result label. It is a figure which shows the specific example of the speech duration time of a correct answer label and a recognition result label, Pre_ratio, and Post_ratio. It is a figure which shows an example of the scatter diagram of the sample of the correct answer of number "1", and the sample of an incorrect answer. It is a flowchart which shows the flow of a process of the speech recognition apparatus which concerns on one Embodiment of this invention. It is a figure which shows about the recognition performance of the Chinese continuous number of the speech recognition apparatus which concerns on one Embodiment of this invention. It is a figure which shows about the change of the average utterance continuation length of one digit number (syllable unit) by a speaker. It is a figure which shows about the change of the average utterance continuation length of the number for one digit (syllable unit) by the length of a number sequence. It is a figure which shows the syllable average time length calculated from the data which recorded the audio | voice which uttered the Chinese mobile phone number (11 digits) by the main reading habits. It is a figure which shows about the recognition performance of the Chinese continuous number of the speech recognition apparatus which concerns on one Embodiment of this invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings referred to in the following description, the same parts as those in the other drawings are denoted by the same reference numerals.
In the embodiment described below, a case where speech recognition is performed for pronunciation of Chinese numerals will be described as an example, but the speech recognition according to the present embodiment is not limited to this. It can be applied to recognition objects in various languages.

(Configuration of voice recognition device)
FIG. 1 is a diagram illustrating a configuration example of a speech recognition apparatus according to the present embodiment. As shown in FIG. 1, the speech recognition apparatus 1 includes a label string output unit 110, a label selection unit 120, an utterance duration output unit 130, a voice power information output unit 140, a label correction unit 150, And a result output unit 160.

(Label string output unit 110)
The label string output unit 110 recognizes the input voice and outputs a label string indicating the voice recognition result. Here, the “label” refers to one unit obtained by dividing the input voice by syllables. In the present embodiment, one number corresponds to one label. Further, the “label column” includes at least one label. The “speech recognition result (of input speech)” means a result of performing speech recognition using an acoustic analysis of a speech waveform of the input speech and using a feature quantity such as MFCC (Mel Frequency Cepstrum Coefficient), for example.

That is, in the label sequence output unit 110, for example, numbers are recognized from the speech waveform of the input speech using MFCC or the like, and a number sequence (that is, a label sequence) including at least one number (ie, a label) is output. Is done. At this time, an utterance start time and an utterance end time indicating the utterance timing of each number included in the number string (hereinafter referred to as “time boundary information” as appropriate) are also recognized together with the label string.

(Label selection unit 120)
The label selection unit 120 selects at least one of the labels included in the label sequence output by the label sequence output unit 110. The label selected by the label selection unit 120 is a label to be corrected by a label correction unit 150 described later. That is, some of the labels included in the label row are sequentially selected by the label selection unit 120, and the selected labels are appropriately corrected by the label correction unit 150. By repeating the process of selecting the label and the process of correcting the selected label, voice recognition of the entire label string is performed.

Further, the labels to be corrected by the label correction unit 150 (that is, the labels selected by the label selection unit 120) may be all labels included in the label column or only a part thereof. For example, in the case of voice recognition of Chinese numbers, in general, those composed of only one syllable such as “1”, “2”, “5” are easily misrecognized. . Therefore, not all labels (numbers) included in the label string output by the label string output unit 110 but labels that consist of only one syllable such as “1”, “2”, and “5”. Only the label correction unit 150 may perform label correction. Thereby, the processing load of voice recognition can be reduced and the processing speed can be increased.

Further, for example, the labels included in the label row may be selected and corrected in order from the label with the shorter utterance duration or the one with the lower voice power. As a result, label correction is performed preferentially from those with a high possibility of erroneous recognition, and the efficiency of label correction is improved.

(Speech duration time output unit 130)
The utterance duration length output unit 130 outputs the utterance duration length of at least one of the labels included in the label sequence output by the label sequence output unit 110 and the labels before and after the label. As described above, the label string output unit 110 outputs a label string by, for example, recognizing a label using an MFCC or the like from the voice waveform of the input voice. At this time, along with the recognition of the label string, the utterance start time and utterance end time indicating the utterance timing of each label included in the label string are also recognized. For example, the utterance duration time output unit 130 calculates the utterance duration time length by subtracting the utterance start time from the utterance end time for each label.

(Audio power information output unit 140)
The voice power information output unit 140 outputs voice power information, which is information indicating voice power in the utterance section of at least one of the labels included in the label string output by the label string output unit 110 and the labels before and after the label string. Output. The “information indicating the voice power in the utterance section of the label” may be any information that directly or indirectly indicates the voice power of the label. For example, an average value of voice power between the utterance start time and the utterance end time of the label can be used.

(Label Correction Unit 150)
The label correction unit 150 is based on statistical information relating to the utterance duration of the label selected by the label selection unit 120 (hereinafter referred to as “selected label”) and the labels before and after the label, and the utterance duration of the selected label. Based on the predetermined condition of 1, the selected label is corrected to a correction candidate label which is a correction candidate label. At this time, the label correcting unit 150 also compares the utterance duration length of the selected label with the utterance duration length of the label before and after the selected label, the utterance duration length of the selected label, and the before and after of the selected label. The selected label may be corrected to the correction candidate label based on the first predetermined condition regarding the ratio of the label to the utterance duration time. In addition, the first predetermined condition may be a condition based on statistical information regarding a ratio between the utterance duration time of the selected label and the utterance duration times of labels before and after the selected label.

In the present embodiment, the first predetermined condition is stored in the storage device 170 such as a hard disk included in the voice recognition device 1, but is not limited thereto. For example, it may be received from an external device or the like every time the process in the label correction unit 150 is executed. The statistical information regarding the duration of the utterance for each label may also be stored in an external device or the like, or may be stored in the storage device 170 of the speech recognition device 1.

Further, the label correction unit 150 sets the selected label as a correction candidate label based on the audio power information of the selected label and the labels before and after the selected label and the second predetermined condition based on the statistical information on the audio power of the selected label. It may be modified. At this time, the label correcting unit 150 calculates the ratio between the audio power of the selected label and the audio power of the labels before and after the selected label, and the audio power of the selected label and the audio power of the labels before and after the selected label. The selection label may be corrected to the correction candidate label based on the second predetermined condition regarding the ratio. Further, the second predetermined condition may be a condition based on statistical information relating to a ratio between the audio power of the selected label and the audio power of the labels before and after the selected label.

Furthermore, in addition to the utterance duration length of the selected label and the labels before and after the selected label and the first predetermined condition, the label correcting unit 150 further includes audio power information of the selected label and the labels before and after the selected label, and the selected label. The selected label may be corrected to the correction candidate label based on the second predetermined condition based on the statistical information on the voice power of the voice. Thereby, voice recognition with higher accuracy can be performed.

In the present embodiment, the second predetermined condition is stored in the storage device 170 such as a hard disk included in the voice recognition device 1, but is not limited thereto. For example, it may be received from an external device or the like every time the process in the label correction unit 150 is executed. Statistical information regarding the voice power of each label may also be stored in an external device or the like, or may be stored in the storage device 170 of the voice recognition device 1.
Although FIG. 1 shows the case where the speech recognition apparatus 1 has both the utterance duration output unit 130 and the speech power information output unit 140, it may have only one of them.

(Result output unit 160)
The result output unit 160 outputs the label sequence corrected by the label correction unit 150 to an external device or the like as a final speech recognition result.
The functions of the components described above are such that a CPU (Central Processing Unit) (not shown) included in the speech recognition apparatus 1 executes a program stored in a storage device such as a hard disk or a ROM (Read Only Memory). This function is realized by reading out and executing on a memory such as “Memory”. The label string, label, utterance start time, utterance end time, utterance duration time, voice power information, first predetermined condition 10, second predetermined condition 20, and the like are stored in a storage device, a memory, or the like. Data.

(Specific example 1)
Hereinafter, the process in the speech recognition apparatus according to the present embodiment will be described with a specific example. In this specific example, the number of labels included in the recognition result recognized from the speech waveform is larger than the number of labels of the input speech because the extra labels are erroneously recognized when recognizing the spoken input speech. A case (insertion recognition error) will be described.
In this specific example, as shown in FIG. 2, the actual input speech (correct answer) is “0, 3, 6, 4”, but is recognized by the label string output unit 110 of the speech recognition device 1. It is assumed that the label string (recognition result) is “0, 3, 6, 5, 4”. Further, it is assumed that the utterance start time and the utterance end time, which are time boundary information of each label (number) of the recognition result, are as shown in FIG.

In this specific example, the label correction unit 150 calculates a ratio between the utterance duration time length of the selected label selected by the label selection unit 120 and the utterance duration time lengths of the labels before and after the selected label. In addition, a threshold (first predetermined condition) for the ratio of the speech duration time to the labels before and after each label is determined in advance based on statistical information (a method for determining the threshold will be described later). Then, the label correction unit 150 compares the calculated ratio with a threshold value, and determines whether or not to correct the selected label to a correction candidate label according to the comparison result.

Specifically, first, from the utterance start time and utterance end time of each label “0, 3, 6, 5, 4” of the recognition result, the utterance of each label in the utterance duration time output unit 130 according to the following formula: The duration time is calculated.
Speech duration time = speech end time−speech start time Then, Pre_ratio and Post_ratio of each label are calculated based on the calculated speech duration time of each label (FIG. 3). Pre_ratio is the ratio between the utterance duration of the correction target label and the utterance duration of the immediately preceding label.

Further, Post_ratio is the ratio between the utterance duration of the correction target label and the utterance duration of the label immediately after that.
Pre_ratio
= Utterance duration length of the numeric label to be corrected / utterance duration length of the immediately preceding numeric label Post_ratio
= Utterance duration length of the numeric label to be corrected / utterance duration length of the numeric label immediately after that The threshold for Pre_ratio and Post_ratio of each label determined in advance based on statistical information is generally different for each label, In this example, it is assumed that all are “0.5” in order to simplify the description.

Comparing the Pre_ratio and Post_ratio of each label shown in FIG. 3 with the threshold value “0.5”, the numeric label “5 (wu)” is “5 (wu) because Pre_ratio <0.5 and Post_ratio <0.5. "" Is determined to be a number label that has been erroneously recognized and inserted (hereinafter referred to as an insertion error). Then, “5 (wu)” is deleted in the label correction unit 150, and the final recognition result in the speech recognition apparatus 1 is “0, 3, 6, 4”.

(Threshold determination method)
The determination of thresholds for Pre_ratio and Post_ratio is determined in the prior learning phase. For example, when the threshold values of Pre_ratio and Post_ratio for the number “5” are determined, speech recognition of the learning speech signal is performed, and the sample of the correct number “5” and the sample of the number “5” that is incorrect And are separated. Also, Pre_ratio and Post_ratio are calculated for each sample.
Then, when plotted on a two-dimensional space with Pre_ratio as the horizontal axis and Post_ratio as the vertical axis, for example, a rectangular area whose diagonal line connects the origin and another point R (Pre_ratio = Tx, Post_ratio = Ty) The coordinates of the point R at which the difference between the number of correct samples and the number of incorrect samples included in (2) is maximized may be obtained.

FIG. 4 is a scatter diagram of the correct answer sample (●) and the incorrect answer sample (△) of the number “5” plotted in a two-dimensional space with the Pre_ratio as the horizontal axis and the Post_ratio as the vertical axis. It is. In this case, for example, the coordinates of the point R are changed in a certain fixed value unit (for example, 0.05) within a certain range (for example, from 0 to 1), and the difference between the number of incorrect samples and the number of correct samples is maximized. The coordinates Tx and Ty of the point R to become the threshold values of Pre_ratio and Post_ratio, respectively. FIG. 4 is an example in which the coordinates (Tx, Ty) of the point R at which the difference between the number of correct samples and the number of incorrect samples is maximum are (0.5, 0.5).
The above processing is performed for all labels that can be output by the label string output unit 110, and the Pre_ratio and Post_ratio threshold values for each label are determined.

(Specific example 2)
Hereinafter, another specific example will be described. This specific example will also explain the case of erroneous insertion recognition.
In this specific example, as shown in FIG. 5, the actual input speech (correct answer) is “0, 3, 4”, but the label sequence recognized by the label sequence output unit 110 of the speech recognition apparatus 1. It is assumed that (recognition result) is “0, 3, 2, 4”. Further, it is assumed that the utterance start time and the utterance end time, which are time boundary information of each label (number) of the recognition result, are as shown in FIG.

In this specific example, the label correction unit 150 calculates a ratio between the average value of the audio power of the selected label selected by the label selection unit 120 and the average value of the audio power of the labels before and after the selected label. To do. Further, a threshold (second predetermined condition) for the ratio of the average value of the audio power to the labels before and after each label is determined in advance based on statistical information (a method for determining the threshold will be described later). Then, the label correction unit 150 compares the calculated ratio with a threshold value, and determines whether or not to correct the selected label to a correction candidate label according to the comparison result.

Specifically, first, the average value of the voice power between the utterance start time and the utterance end time of each label “0, 3, 2, 4” of the recognition result is calculated. Then, Pre_ratio and Post_ratio of each label are calculated based on the calculated average value of the audio power of each label (FIG. 6). Pre_ratio is a ratio between the average value of the sound power of the label to be corrected and the average value of the sound power of the label immediately before that.

Further, Post_ratio is a ratio between the average value of the sound power of the label to be corrected and the average value of the sound power of the label immediately after that.
Pre_ratio
= Average voice power of the numeric label to be corrected / Average voice power of the numeric label immediately before it Post_ratio
= Average value of the voice power of the numeric label to be corrected / Average value of the voice power of the numeric label immediately after that The thresholds for the Pre_ratio and Post_ratio of each label determined in advance based on statistical information are generally different for each label. However, in this example, all are assumed to be “0.5” in order to simplify the description.

Comparing the Pre_ratio and Post_ratio of each label shown in FIG. 6 with the threshold value “0.5”, the numeric label “2 (er)” is “2 (er) because Pre_ratio <0.5 and Post_ratio <0.5. "" Is determined to be a number label that has been misrecognized and inserted (insertion misrecognition). Then, “2 (er)” is deleted in the label correcting unit 150, and the final recognition result in the speech recognition apparatus 1 is “0, 3, 4”.

(Voice power and threshold determination method)
FIG. 7 is a diagram illustrating an example of a relationship between a time series of audio power and a time boundary of each label of a recognition result. The average value of the speech power for each numeric label is the average value of the time series of the speech power in the segment within the time boundary from the speech start time to the speech end time of each number.
For example, when the threshold values of Pre_ratio and Post_ratio for the number “2” are determined, the learning speech signal is recognized in the learning phase, and the correct number “2” sample and the incorrect number “2” are determined. 2 "sample is separated. Also, Pre_ratio and Post_ratio are calculated for each sample. Then, when plotting on a two-dimensional space with Pre_ratio as the horizontal axis and Post_ratio as the vertical axis, for example, a rectangle with a diagonal line connecting the origin and another point R (Pre_ratio = Tx, Post_ratio = Ty) What is necessary is just to obtain | require R coordinate = (Tx, Ty) from which the difference of the sample number of the correct answer contained in an area | region and the sample number of an incorrect answer becomes the maximum.

FIG. 8 is a scatter diagram of the correct answer sample (●) and the incorrect answer sample (△) of the number “2” plotted on a two-dimensional space with the Pre_ratio as the horizontal axis and the Post_ratio as the vertical axis. It is. In this case, for example, the coordinates of the point R are changed in a certain fixed value unit (for example, 0.05) within a certain range (for example, from 0 to 1), and the difference between the number of incorrect samples and the number of correct samples is maximized. The coordinates Tx and Ty of the point R to become the threshold values of Pre_ratio and Post_ratio, respectively. FIG. 8 is an example in which the coordinates (Tx, Ty) of the point R at which the difference between the number of correct samples and the number of incorrect samples is the maximum are (0.5, 0.5).
The above processing is performed for all labels that can be output by the label string output unit 110, and the Pre_ratio and Post_ratio threshold values for each label are determined.

(Specific example 3)
Hereinafter, another specific example will be described. In this specific example, the number of labels included in the recognition result recognized from the speech waveform is less than the number of labels of the input speech due to the absence of labels that should be erroneously recognized when recognizing the spoken input speech. A case where the number is reduced (deletion error) is described.
In this specific example, as shown in FIG. 9, the actual input speech (correct answer) is “0, 1, 1, 4”, but has been recognized by the label string output unit 110 of the speech recognition device 1. It is assumed that the label string (recognition result) is “0, 1, 4”. Further, it is assumed that the utterance start time and the utterance end time, which are time boundary information of each label (number) of the recognition result, are as shown in FIG.

In the following description, as in the first specific example, the label correcting unit 150 corrects the label using the utterance duration length of each label. However, as in the second specific example, the power information of each label is used. The same applies to the case where label correction is performed using.
Specifically, first, in the utterance duration output unit 130, the utterance duration of each label is calculated from the utterance start time and utterance end time of each label “0, 1, 1, 4” of the recognition result by the following expression. The length is calculated. Based on the calculated utterance duration length of each label, Pre_ratio and Post_ratio of each label are calculated (FIG. 10). Note that the utterance duration time of each label, the Pre_ratio, and the Post_ratio are calculated in the same manner as in the first specific example.

The thresholds (first predetermined condition) for the Pre_ratio and Post_ratio of each label determined in advance based on statistical information are generally different values for each label, but may be the same value. In this example, all are assumed to be “1.8” for the sake of simplicity.
Comparing the Pre_ratio and Post_ratio of each label shown in FIG. 10 with the threshold “1.8”, the numeric label “1 (yi)” has Pre_ratio> 1.8 and Post_ratio> 1.8. Therefore, the numeric label “1 (yi)” is determined to have been erroneously recognized.

In addition, a frequently recognized misrecognition pattern is determined in advance from the collation result between the recognition result and the correct answer in the prior learning phase. Here, for example, if there is a misrecognition pattern of “correct answer = 1 (yi) i1 (yi), misrecognition = 1 (yi)”, in the recognition result of this specific example, the label correcting unit 150 The number “1 (yi)” is replaced with “1 (yi) 1 (yi)”, and the final recognition result in the speech recognition apparatus 1 is “0, 1, 1, 4”.

(Threshold determination method)
The determination of thresholds for Pre_ratio and Post_ratio is determined in the prior learning phase. For example, when the threshold values of Pre_ratio and Post_ratio for the number “1” are determined, the result of speech recognition of the learning speech signal is analyzed as follows.
First, the recognition result is a sample of the correct number “1”, an incorrect answer (insertion error) in which the number “1” is inserted among the incorrect number “1”, and the number “1”. Three of the incorrect answer (deletion error) samples that result in deletion of the preceding and succeeding numbers “1” because the utterance section of “is recognized to include the preceding and following numbers“ 1 ”. Sort into Also, Pre_ratio and Post_ratio are calculated for each sample.

Then, when plotting on a two-dimensional space with Pre_ratio as the horizontal axis and Post_ratio as the vertical axis, for example, a point P determined by the maximum value of Pre_ratio and Post_ratio set a priori (Pre_ratio = Mx, Post_ratio = My) And the coordinates of the point R where the difference between the number of correct samples and the number of incorrect samples contained in the rectangular area whose diagonal line connects the point R and the other point R (Pre_ratio = Tx, Post_ratio = Ty) = (Tx, Ty) may be obtained.

FIG. 11 shows a sample of correct answers (●) plotted on a two-dimensional space with Pre_ratio on the horizontal axis and Post_ratio on the vertical axis, incorrect sample (△) of insertion error (INS), and deletion error (DEL). It is a scatter diagram of an incorrect answer sample (x). In this case, for example, the coordinates of the point R are changed within a certain range (for example, 1 to 10) in a certain fixed value unit (for example, 0.05), and the difference between the number of incorrect samples and the number of correct samples of the deletion error is determined. The coordinates Tx and Ty of the maximum point R are set as thresholds for Pre_ratio and Post_ratio, respectively. FIG. 11 is an example in which the coordinates (Tx, Ty) of the point R at which the difference between the number of incorrect samples of deletion errors and the number of correct samples is maximum are (1.8, 1.8).

The above processing is performed for all labels that can be output by the label string output unit 110, and the Pre_ratio and Post_ratio threshold values for each label are determined.
In the above specific example, the insertion error and the deletion error have been individually described, but it is of course possible to process these simultaneously. That is, both the Pre_ratio and Post_ratio thresholds for correcting insertion errors and the Pre_ratio and Post_ratio thresholds for correcting deletion errors are held in the storage device 170, and the Pre_ratio and Post_ratio of each label of the recognition result are stored. The insertion error and the deletion error may be corrected simultaneously by comparing these threshold values.

(Processing flow)
FIG. 12 is a flowchart showing the flow of processing in the speech recognition apparatus according to this embodiment.
The label string output unit 110 outputs the label string, the time boundary information of each label, and the audio power 30 (step S101). One label is selected from each label included in the label row (step S102).
When only a specific label among the labels selected in step S102 is to be corrected, the label to be corrected is determined in advance as the high-frequency error pattern 40 and stored in the storage device 170 such as a hard disk. (For example, “1 (yi)”, “2 (er)”, “5 (wu)”). Then, with reference to the high-frequency error pattern 40, when the selected label selected in step S102 is a verification target of the label correction (step S103), the label existing immediately before this selected label, 1 The label existing behind is acquired (step S104). If the selected label is not a verification target for label correction (step S103), the process returns to step S101 and the process is repeated.

Next, the utterance duration length or voice power 50 of the selected label and the labels before and after the selection label are output by the utterance duration length output unit 130 or the voice power information output unit 140, respectively (step S105).
Then, based on the prosodic information (speech duration time or voice power) 50 output in step S105, the ratio of the utterance duration time of the selected label to the utterance duration times of the preceding and following labels, or the voice power of the selected label And the ratio of the audio power of the front and rear labels is calculated by the label correction unit 150 (step S106).

The ratio calculated in step S104 is compared with threshold values (first predetermined condition 10 and second predetermined condition 20) determined in advance from statistical information, and recognition errors (insertion error, deletion error, etc.) are compared. Is determined (step S107). If it is a recognition error, the label correction unit 150 corrects the selected label (step S108).
After the above processing is repeated for all the labels of the recognition results, the speech recognition of the label sequence ends (step S109), and the final speech recognition result is output by the result output unit 160.

In the flow of FIG. 12, the label sequence is output in the first step S101, but without waiting for all the labels in the label sequence to be output, the recognition of the labels proceeds while the labels are sequentially recognized. The processing after step S102 may be executed.

(Advantages of the speech recognition method according to the present embodiment)
FIG. 13 is a diagram illustrating the recognition performance of Chinese continuous numerals of the speech recognition apparatus according to the present embodiment. In FIG. 13, “Baseline” is the recognition rate evaluation result when this method is not used (speech recognition only from the speech waveform), and “2-Dim_time length Ratio” is the continuation of the utterance of the selected label and the preceding and following labels. The recognition rate evaluation result of the correction method using the time length ratio, and “2-Dim_PowerRatio” is the recognition rate evaluation result of the correction method using the ratio of the average value of the voice power of the selected label and the preceding and following labels. “Long” is an evaluation result of a continuous numeric string of 11 to 15 digits, and “Short” is an evaluation result of a continuous numeric sequence of 1 to 8 digits.

The recognition rate was higher than Baseline in both cases of using the ratio of duration of utterance and the ratio of average value of voice power.
Further, with respect to the utterance duration length and voice power of the selected label, the advantages of performing label correction using the ratio of the preceding and following labels to the utterance duration length or voice power will be described below.

FIG. 14 is a diagram showing a change in average utterance duration length of one digit (single syllable unit) by a speaker. When comparing the leftmost speaker (speaker ID = CTM008_4B) and the rightmost speaker (speaker ID = CTF006_3A), there is a difference of about 1.7 times in the average utterance duration. For this reason, for example, when the average utterance duration of all speakers is used as a reference standard and the ratio of the utterance duration length of the selected label to this is adopted, it is not possible to deal with variations in speech speed for each speaker.
On the other hand, in the speech recognition method according to the present embodiment, the ratio of the duration of the utterance to the labels before and after the selected label is adopted, so that the misrecognition is determined while reducing the influence due to the variation in the speaking speed for each speaker. can do.

FIG. 15 is a diagram showing a change in the average utterance duration length for one digit (in syllable units) depending on the length of the digit string. Comparing the shortest 1-digit number sequence on the left side with the longest 15-digit number sequence on the rightmost side, the average utterance duration is different by about 1.6 times. Therefore, for example, when the ratio of the utterance duration time of the selected label with respect to the overall average utterance duration length is used as a reference standard, it is not possible to deal with variations in the utterance duration length due to the length of the numeric string.
On the other hand, in the speech recognition method according to the present embodiment, since the ratio of the utterance duration time to the labels before and after the selected label is adopted, the influence of the variation in the utterance duration time due to the length of the numeric string is reduced. Misrecognition can be determined.

FIG. 16 is calculated from data recorded with the main reading habit of “(front) 3 digits + (middle) 3 digits + (rear) 5 digits” of the Chinese mobile phone number (11 digits). It is a figure which shows syllable average time length. Even for the same three-digit number, the duration of the utterance differs depending on the position of the first three digits and the next three digits in the speech. Thus, even within the same utterance, there is a variation in the duration of the utterance depending on the utterance position, so it is effective to use a local ratio before and after, as in the speech recognition method according to this embodiment. is there.

Furthermore, when using the average time length and average voice power value of the entire uttered voice, the voice recognition is performed after the utterance is finished to the end. On the other hand, in the case of using the utterance duration length of each label and the labels before and after the label as in the speech recognition method of the present embodiment, it is possible to process sequentially from the speech that has been uttered. There is also an effect that the processing time is shortened. In addition, the processing load is lower and it can be implemented with a simpler configuration.

FIG. 17 is a diagram showing the recognition performance of Chinese continuous numbers of the speech recognition apparatus according to the present embodiment. Baseline is a recognition rate evaluation result when this method is not used (speech recognition only from a speech waveform). Also, 1-Dim_Time Length Ratio is the recognition rate evaluation result of the correction method using the ratio of the utterance duration length of the selected label to the overall average time length, and 1-Dim_PowerRatio is the overall average power of the voice power of the selected label. It is the recognition rate evaluation result of the correction method using the ratio to. In addition, 2-Dim_Time Length Ratio is the recognition rate evaluation result of the correction method using the ratio of the utterance duration length of the selected label and the preceding and following labels, and 2-Dim_PowerRatio is the voice power of the selected label and the preceding and following labels. It is the recognition rate evaluation result of the correction method using the ratio of the average value.

According to the evaluation results shown in FIG. 17, the speech recognition method (2-Dim_time length Ratio, 2-Dim_PowerRatio) according to the present embodiment is a method based on the overall average time length and overall average power (1 -Dim_time length Ratio, 1-Dim_PowerRatio) It can be seen that the performance is superior.
In the present embodiment, the label correcting unit 150 determines the ratio between the utterance duration time of the selected label and the utterance duration time of the label before and after the selected label, the utterance duration time of the selected label, and the Although the description has been made centering on the example in which the selected label is corrected to the correction candidate label based on the first predetermined condition regarding the ratio of the preceding and following labels to the utterance duration time length, the present invention is not limited to this.

If either Pre_ratio or Post_ratio of each label is included in the incorrect sample range, the label correction unit 150 is highly likely to be included in the incorrect sample range. The ratio of the utterance duration length to the utterance duration length of the label before or after the selected label, and the ratio of the utterance duration length of the selected label to the utterance duration length of the label before or after the selected label. The selected label may be corrected to the correction candidate label on the basis of the first predetermined condition regarding. That is, the label correction unit 150 determines whether the adjacent label is based on the first predetermined condition based on the ratio of the utterance duration length of the adjacent label and the statistical information regarding the ratio of the utterance duration length of the adjacent label. At least one of the labels to be modified may be modified.
According to this configuration, although the correction accuracy is slightly inferior to the case of correcting the selection label based on the ratio of the utterance duration length of the selected label and the utterance duration length of the label before and after the selection label, The processing speed can be improved.

(Voice recognition program)
The processing in the speech recognition apparatus according to the present embodiment described above includes the first step of recognizing the input speech to the computer and outputting a label sequence indicating the speech recognition result, and the label contained in the label sequence. A second step of outputting the utterance duration length of at least one of the labels and the labels before and after the label, the utterance duration length of the labels and the labels before and after the label, and statistical information relating to the utterance duration length of the labels And a third step of correcting the label to a correction candidate label which is a correction candidate label based on a first predetermined condition based on one aspect of processing realized by a speech recognition program for executing It is.

Further, the processing in the speech recognition apparatus according to the present embodiment includes a first step of recognizing input speech to a computer and outputting a label sequence indicating the speech recognition result, and speech of a label included in the label sequence. Based on the second step of outputting the duration time, the ratio of the utterance duration length of the adjacent labels, and the first predetermined condition based on the statistical information on the ratio of the utterance duration length of the adjacent labels The third step of correcting at least one of the adjacent labels may be realized by a speech recognition program for executing the third step.

In addition, this voice recognition program includes a semiconductor storage medium such as RAM and ROM, a magnetic storage type storage medium such as FD and HD, an optical reading type storage medium such as CD, CDV, LD, and DVD, and a magnetic storage type such as MO. / Optical reading type storage media, etc., which can be stored in a computer-readable storage medium and distributed and sold regardless of electronic, magnetic, optical, etc. reading methods It can be downloaded through a network.

While the embodiments of the present invention have been described above, the scope of the present invention is not limited to the illustrated and described exemplary embodiments, and provides the same effects as those intended by the present invention. All embodiments are also included. Further, the scope of the invention can be defined by any desired combination of particular features among all the disclosed features.

DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 10 1st predetermined condition 20 2nd predetermined condition 110 Label sequence output part 120 Label selection part 130 Speech duration time output part 140 Voice power information output part 150 Label correction part 160 Result output part 170 Storage apparatus

Claims

A label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result;
An utterance duration output unit for outputting the utterance duration of at least one of the labels included in the label string and the labels before and after the label; and
Based on the utterance duration length of the label and the labels before and after the label, and a first predetermined condition based on statistical information relating to the utterance duration length of the label, the correction candidate that is a label as a correction candidate label A label correction section for correcting the label;
A speech recognition apparatus.
The label correction unit includes a ratio between the utterance duration length of the label and the utterance duration length of the label before and after the label, the utterance duration length of the label and the utterance of the label before and after the label. The speech recognition apparatus according to claim 1, wherein the label is corrected to the correction candidate label based on the first predetermined condition relating to a ratio to a duration length.
A voice power information output unit that outputs voice power information, which is information indicating voice power in the speech section of at least one label and the labels before and after the label included in the label sequence;
In addition to the utterance duration length of the label and the labels before and after the label, and the first predetermined condition, the label correction unit includes the audio power information of the label and the labels before and after the label, The speech recognition according to claim 1 or 2, wherein the label is corrected to a correction candidate label that is a label of a correction candidate based on a second predetermined condition based on statistical information relating to voice power. apparatus.
In addition to the utterance duration length of the label and the labels before and after the label, and the first predetermined condition, the label correction unit includes the sound power of the label and the sound power of the label before and after the label. And the second predetermined condition relating to the ratio of the audio power of the label and the audio power of the labels before and after the label, the label is corrected to the correction candidate label. The speech recognition apparatus according to claim 3.
A label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result;
A voice power information output unit that outputs voice power information that is information indicating voice power in a speech section of at least one label and labels before and after the label included in the label sequence;
The label is corrected to a correction candidate label which is a correction candidate label based on the audio power information of the label and the labels before and after the label and a second predetermined condition based on statistical information on the audio power of the label. A label correction section to perform,
A speech recognition apparatus.
A label string output unit that recognizes an input voice and outputs a label string indicating the voice recognition result;
An utterance duration output unit for outputting an utterance duration of a label included in the label string;
Correcting at least one of the adjacent labels based on a first predetermined condition based on statistical information on a ratio of utterance durations of adjacent labels and a ratio of utterance durations of adjacent labels Correction part,
A speech recognition apparatus comprising:
A speech recognition method executed by a speech recognition device having a label string output unit, an utterance duration output unit, and a label correction unit,
A first step of recognizing input speech by the label sequence output unit and outputting a label sequence indicating the speech recognition result;
A second step of outputting the utterance duration length of at least one of the labels included in the label string and the labels before and after the utterance duration length output unit;
The label correction unit corrects the label based on the utterance duration length of the label and the labels before and after the label and a first predetermined condition based on statistical information on the utterance duration length of the label. A third step of correcting to a correction candidate label that is a label of
A speech recognition method comprising:
On the computer,
A first step of recognizing the input speech and outputting a label string indicating the speech recognition result;
A second step of outputting the utterance duration length of at least one of the labels included in the label string and the labels before and after the label;
Based on the utterance duration length of the label and the labels before and after the label, and a first predetermined condition based on statistical information relating to the utterance duration length of the label, the correction candidate that is a label as a correction candidate label A third step to modify the label;
A speech recognition program for running.
On the computer,
A first step of recognizing the input speech and outputting a label string indicating the speech recognition result;
A second step of outputting an utterance duration length of a label included in the label sequence;
Based on the first predetermined condition based on the ratio of the utterance duration length of adjacent labels and the statistical information on the ratio of the utterance duration length of the adjacent labels, at least one label of the adjacent labels A third step to modify;
A speech recognition program for running.