WO2011007627A1 - 音声処理装置および方法ならびに記憶媒体 - Google Patents
音声処理装置および方法ならびに記憶媒体 Download PDFInfo
- Publication number
- WO2011007627A1 WO2011007627A1 PCT/JP2010/059515 JP2010059515W WO2011007627A1 WO 2011007627 A1 WO2011007627 A1 WO 2011007627A1 JP 2010059515 W JP2010059515 W JP 2010059515W WO 2011007627 A1 WO2011007627 A1 WO 2011007627A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phrase
- speech
- recognition
- word
- unit
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 94
- 238000000034 method Methods 0.000 title description 52
- 238000004458 analytical method Methods 0.000 claims description 33
- 238000007476 Maximum Likelihood Methods 0.000 claims description 5
- 238000003672 processing method Methods 0.000 claims description 2
- 238000013519 translation Methods 0.000 description 94
- 238000004364 calculation method Methods 0.000 description 20
- 230000006870 function Effects 0.000 description 14
- 238000004891 communication Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 6
- 238000007493 shaping process Methods 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 230000006866 deterioration Effects 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004904 shortening Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 239000000725 suspension Substances 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
Definitions
- the present invention relates to a speech processing apparatus and method for recognizing input speech, and a storage medium.
- speech translation A technique of automatic interpretation (speech translation) by recognizing speech input by speech and translating the recognition result is known.
- speech translation it is an important technique to output the translation result more immediately. For example, if it is possible to specify (set) the input speech, that is, the start and end of an utterance, based on the system specification or user instruction, the translation processing may be performed in the specified unit. By making it shorter, more immediate translation results can be obtained.
- speech translation is performed on continuously input speech such as a telephone call, for example, it is not possible to specify the start and end of an utterance by a user instruction or the like. In such a case, the voice translation is simply performed until the call is temporarily interrupted, but this causes a long waiting time. In such a case, the technology and method for performing sequential speech translation are not so much developed and proposed.
- the first recognition path is processed at a certain time interval by a multipath search method, and the stable section in the certain time interval is determined by the second recognition path.
- a method of sequentially outputting the results of speech recognition has been proposed (see Patent Document 1).
- a method has been developed in which the timing for driving the second recognition path is estimated based on the frame reliability, thereby eliminating the waste of speech processing by always performing the second recognition path at regular time intervals (Patent Literature). 2).
- the above-described technology is a speech recognition technology, and there is no description on how to combine translation processing, which is a discrete process on speech content, after speech is recognized. Further, the recognition result obtained by the above-described technique is not always a unit suitable for translation.
- Japanese Patent No. 3834169 JP 2004-12615 A Japanese Patent No. 3766111 Japanese Patent No. 3009642 JP 2008-269122 A
- the present invention has been made to solve the above-described problems.
- the real-time property is increased to shorten the waiting time, and the speech translation is performed sequentially with high accuracy.
- the purpose is to be able to output the result.
- the speech processing apparatus includes an analysis unit that detects and analyzes input speech and outputs a feature amount, and a speech recognition unit that performs speech recognition based on the feature amount and outputs a recognition result.
- the speech recognition means includes phrase determination means for determining a phrase boundary based on a comparison between a hypothetical word group generated by speech recognition and a word representing a set phrase boundary, and the phrase determination means determines Outputs recognition results in phrase units by phrase boundaries.
- the speech processing method includes an analysis step of detecting and analyzing input speech and outputting a feature amount, and a speech recognition step of performing speech recognition based on the feature amount and outputting a recognition result.
- the speech recognition step includes a phrase determination step for determining a phrase boundary based on a comparison between a hypothetical word group generated by the speech recognition and a word representing the set phrase boundary, and the phrase determination step determines Outputs recognition results in phrase units by phrase boundaries.
- the storage medium includes an analysis function for detecting and analyzing voice input to a computer and outputting a feature value to a computer, and a voice recognition function for performing voice recognition based on the feature value and outputting a recognition result.
- the speech recognition function includes a phrase determination function that determines a phrase boundary based on a comparison between a hypothetical word group generated by speech recognition and a word representing a set phrase boundary, and the speech recognition function includes:
- a computer-readable storage medium storing a program for realizing a function of outputting a recognition result in a phrase unit based on a phrase boundary determined by a phrase determination function.
- the phrase boundary is determined based on the comparison between the hypothetical word group generated by the speech recognition and the word representing the set phrase boundary.
- the speech translation result can be output sequentially with high accuracy by increasing the real-time property and shortening the waiting time for the speech input to.
- FIG. 1 is a configuration diagram showing the configuration of the speech processing apparatus according to Embodiment 1 of the present invention.
- FIG. 2 is a block diagram showing the configuration of the speech processing apparatus according to Embodiment 2 of the present invention.
- FIG. 3 is a flowchart for explaining an operation example of the speech processing apparatus according to Embodiment 2 of the present invention.
- FIG. 4 is a configuration diagram showing the configuration of the call translation system according to the third embodiment of the present invention using the speech processing apparatus according to the second embodiment.
- FIG. 5 is a flowchart for explaining an operation example of the system according to the third embodiment of the present invention.
- FIG. 6 is a configuration diagram showing the configuration of the speech processing apparatus according to Embodiment 4 of the present invention.
- FIG. 1 is a configuration diagram showing the configuration of the speech processing apparatus according to Embodiment 1 of the present invention.
- FIG. 2 is a block diagram showing the configuration of the speech processing apparatus according to Embodiment 2 of the present invention.
- FIG. 3 is
- FIG. 7 is a flowchart for explaining an operation example of the speech processing apparatus according to Embodiment 4 of the present invention.
- FIG. 8 is a configuration diagram illustrating a configuration of a caption generation system according to the fifth embodiment of the present invention using the four audio processing device according to the embodiment.
- FIG. 9 is a flowchart for explaining an operation example of the system according to the fifth embodiment of the present invention.
- FIG. 1 is a configuration diagram showing the configuration of the speech processing apparatus according to the first embodiment.
- the speech processing apparatus first includes an analysis unit 101 that detects and analyzes input speech and outputs a feature value, and a speech recognition unit 102 that performs speech recognition based on the feature value and outputs a recognition result.
- the speech recognition unit 102 includes a phrase determination unit 103 that determines a phrase boundary based on a comparison between a hypothetical word group generated by speech recognition and a word representing a set phrase boundary.
- the speech recognition unit 102 outputs the recognition result in units of phrases based on the phrase boundaries determined by the phrase determination unit 103.
- the analysis unit 101 detects and analyzes the input speech and outputs a feature value.
- the phrase determination unit 103 determines a phrase boundary based on a comparison between a hypothetical word group generated by speech recognition and a word representing the set phrase boundary.
- the speech recognition unit 102 outputs a recognition result in units of phrases based on the determined phrase boundaries.
- speech translation is performed while determining phrase boundaries for translation.
- the recognition result word string is extracted in a unit suitable for translation, and translation processing is performed.
- the result of speech translation can be obtained sequentially. Since the phrase boundary is determined for a hypothetical word group generated by speech recognition, it is performed during the word search in the speech recognition process. Therefore, in the present embodiment, the phrase boundary is not determined after the recognition process is completed, so that there is little risk of impairing the sequentiality / real-time property of the recognition result output. Further, if the hypothesis likelihood and occupancy ratio are taken into consideration in the word search process, it is possible to suppress the deterioration of the speech recognition accuracy due to the sequential output of the recognition results.
- FIG. 2 is a configuration diagram showing the configuration of the speech processing apparatus 200 according to the second embodiment.
- the speech processing apparatus 200 includes an analysis unit 202, a speech recognition unit 203, an acoustic model storage unit 204, a recognition dictionary storage unit 205, a translation dictionary storage unit 206, and a translation unit 207.
- the analysis unit 202 detects a voice section from the voice data input from the input unit 201, acoustically analyzes the detected section, and outputs a time series of a cepstrum, for example, a feature amount series.
- a technique for performing voice detection and acoustic analysis is well known as a known technique, and detailed description thereof is omitted here.
- the voice recognition unit 203 includes a distance calculation unit 231, a word search unit 232, and an output unit 234. Furthermore, the word search unit 232 includes a phrase determination unit 233.
- the speech recognition unit 203 uses an acoustic model that gives acoustic accuracy and a recognition dictionary that includes words to be recognized, receives a feature amount sequence that is an output of the analysis unit 202, and outputs a recognition result word string as an output unit 234. To output.
- the acoustic model is stored in the acoustic model storage unit 204, and the recognition dictionary is stored in the recognition dictionary storage unit 205.
- the distance calculation unit 231 performs acoustic calculation of the feature amount series obtained from the analysis unit 202 using an acoustic model. Further, the word search unit 232 performs a word search for the distance calculation result by the distance calculation unit 231 using the recognition dictionary, and outputs a word string that becomes the recognition result.
- the translation unit 207 receives the word string output by the speech recognition unit 203 as input, performs translation using the translation dictionary stored in the translation dictionary storage unit 206, and outputs the translation result.
- the translation dictionary may include grammar knowledge for translation.
- the above-described voice processing apparatus 200 is a general-purpose computer system, and includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and a nonvolatile storage device as components not shown.
- the voice processing device 200 is a computer including, for example, a CPU, and reads an OS (Operation System) and a voice processing program stored in a storage medium 209 such as a RAM, a ROM, or a nonvolatile storage device. Voice processing is performed by executing. Thereby, the speech translation result can be sequentially output with respect to continuous input speech.
- the voice processing apparatus 200 may be configured by a single computer or may be configured by a plurality of computers. The same applies to the other embodiments.
- the acoustic model storage unit 204, the recognition dictionary storage unit 205, and the translation dictionary storage unit 206 are each a nonvolatile storage device such as a fixed disk, a magneto-optical disk, or a flash memory, or a volatile memory such as a DRAM (Dynamic Random Access Memory). As long as it is configured with a storage device of a sex.
- the acoustic model storage unit 204, the recognition dictionary storage unit 205, and the translation dictionary storage unit 206 may be storage devices connected to the outside of the computer constituting the speech processing device 200.
- step S301 a voice is input by the input unit 201.
- the input unit 201 is a microphone, and for example, an English speech waveform input from the microphone is obtained.
- step S302 the end of voice input is determined. For example, if the input voice exists, the subsequent processing is continued, but if it is finished, the processing is ended.
- step S303 the analysis unit 202 detects a voice section from the input voice, acoustically analyzes the detected section, and outputs a feature amount series.
- step S ⁇ b> 304 the distance calculation unit 231 of the speech recognition unit 203 calculates the distance between the feature amount series obtained from the analysis unit 202 and the acoustic model stored in the acoustic model storage unit 204.
- the closeness between the input speech and the acoustic model is calculated.
- the distance calculation unit 231 performs an acoustic distance calculation between the feature amount series obtained by the analysis unit 202 and the acoustic model, and outputs a distance calculation result.
- the technique for calculating the distance from the acoustic model is well known as a known technique, and detailed description thereof is omitted here.
- step S305 the word search unit 232 of the speech recognition unit 203 most likely uses the recognition dictionary stored in the recognition dictionary storage unit 205 for the distance calculation result obtained by the distance calculation unit 231.
- a word string is searched and a hypothetical word (word hypothesis) is generated. For example, if the input speech is English, English speech recognition is performed to generate a word hypothesis consisting of a probable English word or word string.
- the word search technique in speech recognition is well known as a known technique, and detailed description thereof is omitted here.
- step S306 the phrase determination unit 233 of the word search unit 232 determines a phrase boundary based on a comparison between the obtained word hypothesis and a word representing the set phrase boundary. For example, using the property that the first word of a preposition phrase suitable as a translation unit for English is a preposition, a word whose preposition is a preposition is determined in advance as a word representing a phrase boundary.
- the ratio of Hp to the total number of words Hall of the word hypothesis is determined in advance.
- a predetermined threshold value Hthre is exceeded, in other words, when “Hp / Hall> Hthre” holds, the phrase boundary is determined.
- the phrase boundary determination if the hypothesis occupancy exceeds a threshold value, for example, the start time of the most likely word hypothesis representing the phrase boundary is determined as the phrase boundary, and immediately before the determined start time, The maximum likelihood hypothesis as the end time of the previous phrase is output as the recognition result up to the end time.
- a threshold value for example, the start time of the most likely word hypothesis representing the phrase boundary is determined as the phrase boundary, and immediately before the determined start time.
- the maximum likelihood hypothesis as the end time of the previous phrase is output as the recognition result up to the end time.
- the end time of the most likely word hypothesis representing the phrase boundary is determined as the phrase boundary, and the maximum likelihood hypothesis having the determined end time as the word end is recognized as the recognition result up to the determined end time. It can also be output.
- the result is output from the output unit 234.
- step S306 the process returns to step S301 to accept the next voice input.
- the word representing the phrase boundary described above is preliminarily determined that the part of speech is a preposition.
- the present invention is not limited thereto, and other parts of speech such as a conjunction, or punctuation marks and silence may be included.
- other parts of speech such as a conjunction, or punctuation marks and silence may be included.
- fillers connecting words
- an algorithm for learning a model representing a phrase boundary is described in Patent Document 5.
- the number of hypotheses within the same time when speech input is performed may be targeted for calculation, and the time width including the time immediately before or after a certain time is included.
- the number of hypotheses in can be calculated.
- the translation unit 207 uses the translation dictionary stored in the translation dictionary storage unit 206 for the recognition result word string up to the determined phrase boundary. Translate and output the translation result. For example, when the input language is English and the output language is Japanese, an English word string obtained as a recognition result word string is translated into English and a Japanese word string is output as a translation result.
- the technique for translating a word string is well known as a known technique, and detailed description thereof is omitted here.
- step S308 the translation result described above is output by the output unit 208 in a state visible to the user.
- the process returns to step S301, and the above-described steps S301 to S308 are continued until the voice input is completed.
- the hypothesis occupancy rate is used as a criterion for determination in the phrase determination unit 233.
- a word hypothesis representing a phrase boundary is the maximum likelihood (first hypothesis) )
- the likelihood difference with the next most likely word hypothesis (second hypothesis) exceeds a threshold
- the start time or end time of the word representing the phrase boundary may be determined as the phrase boundary. good.
- the speech processing apparatus 200 performs speech translation on a continuously input speech while determining phrase boundaries, that is, a recognition result word string in a unit suitable for translation. Are output and translated, so that it is possible to sequentially output speech translation results.
- Phrase boundary determination is not performed after the recognition process is completed, but is performed in the process of word search in the speech recognition process, so there is little risk of impairing the sequentiality and real-time performance of the recognition result output. Also, by considering the likelihood and occupancy rate of the hypothesis during the word search process, it is possible to suppress the deterioration of the speech recognition accuracy due to the sequential output of recognition results.
- FIG. 4 is a configuration diagram showing the configuration of the call translation system according to the third embodiment using the speech processing device 200.
- This system includes a receiving unit 401, a voice synthesizing unit 408, an output unit 409, and a communication network 420 in addition to the voice processing device 200 in the second embodiment.
- the communication network 420 is, for example, a public telephone network.
- Communication network 420 may be an Internet communication network.
- the receiving unit 401 receives audio input from the communication network 420 and outputs the received audio to the audio processing device 200.
- the receiving unit 401 is a receiving unit in a telephone that realizes a voice call, for example.
- the analysis unit 202 performs speech detection / analysis using the speech received by the reception unit 401 as an input.
- the translation unit 207 sequentially outputs the translation results to the voice synthesis unit 408. For example, when the input language is English and the output language is Japanese, English-Japanese translation is performed and a Japanese word string is output as a translation result.
- the speech synthesizer 408 synthesizes the translation results obtained sequentially and outputs synthesized speech. Specifically, for example, when a Japanese word string is obtained as a translation result, Japanese speech synthesis is performed. A technique for synthesizing text data into voice data is well known as a known technique, and detailed description thereof is omitted here.
- the output unit 409 is, for example, a speaker, and outputs sound by using the sound data obtained by the sound synthesizing unit 408 as an input.
- step S ⁇ b> 501 the reception unit 401 receives voice waveforms that are continuously input from the communication network 420.
- step S502 the speech processing described in the second embodiment is performed by the processing of the analysis unit 202, speech recognition unit 203, and translation unit 207 in the speech processing apparatus 200, and the translation results are sequentially output. .
- step S503 the speech synthesizer 408 synthesizes the translation result obtained from the speech processing apparatus 200 (S202). For example, a Japanese word string output as a translation result is synthesized with speech.
- step S504 the output unit 409 outputs the synthesized voice from, for example, a speaker.
- the audio data continuously received / input from the communication network 420 is sequentially processed, and as a result, the sequential speech translation results are output (synthesized speech output).
- the speech translation result is output as synthesized speech.
- the present invention is not limited to this and may be output as text information.
- the above-described system is, for example, a general-purpose computer system, and includes a CPU, a RAM, a ROM, and a nonvolatile storage device as a configuration (not shown), and the CPU is stored in the RAM, ROM, or the nonvolatile storage device.
- the OS and the call translation program are read and executed to perform the call translation process. As a result, the voice during a call can be translated and sequentially output.
- the system described above does not have to be a single computer, and may be configured by a plurality of computers.
- FIG. 6 is a configuration diagram showing the configuration of the audio processing device 600 according to the fourth embodiment.
- the speech processing apparatus 600 includes an analysis unit 602, a speech recognition unit 603, an acoustic model storage unit 604, a recognition dictionary storage unit 605, a translation dictionary storage unit 606, and a translation unit 607.
- the analysis unit 602 detects a voice section from the voice data input from the input unit 601, acoustically analyzes the detected section, and outputs a time series of, for example, a cepstrum, which is a feature amount series.
- a technique for performing voice detection and acoustic analysis is well known as a known technique, and detailed description thereof is omitted here.
- the voice recognition unit 603 includes a distance calculation unit 631 and a word search unit 632 therein.
- the word search unit 632 includes a phrase determination unit 633.
- phrase determining unit 633 includes section specifying unit 634.
- the section specifying unit 634 is used by the phrase determining unit 633 for phrase determination for each set section, for example, for each section of 500 ms (milliseconds) of input speech based on section information from the start of input, for example, time information.
- the threshold is temporarily changed within the set interval. For example, every 500 ms of input speech, the threshold used by the phrase determination unit 633 is reduced within this section, so that the phrase boundary is easily determined.
- the sound processing device 600 is a general-purpose computer system, and includes a CPU, a RAM, a ROM (Read Only Memory), and a non-volatile storage device as components not shown.
- the CPU reads the OS and the sound processing program stored in the RAM, ROM, or non-volatile storage device, and executes these to execute sound processing.
- the speech translation result can be sequentially output with respect to continuous input speech.
- the voice processing apparatus 600 may be configured by a single computer or a plurality of computers.
- the acoustic model storage unit 604, the recognition dictionary storage unit 605, and the translation dictionary storage unit 606 are configured by a nonvolatile storage device such as a fixed disk, a magneto-optical disk, or a flash memory, or a volatile storage device such as a DRAM. It only has to be done.
- the acoustic model storage unit 604, the recognition dictionary storage unit 605, and the translation dictionary storage unit 606 may be storage devices connected to the outside of the computer that constitutes the speech processing device 600.
- step S701 a voice is input by the input unit 601.
- the input unit 601 is a microphone, and for example, an English speech waveform input from the microphone is obtained.
- step S702 the end of voice input is determined. For example, if the input voice exists, the subsequent processing is continued, but if it is finished, the processing is ended.
- step S703 the analysis unit 602 detects a voice section from the input voice, acoustically analyzes the detected section, and outputs a feature amount series.
- step S ⁇ b> 704 the distance calculation unit 631 of the speech recognition unit 603 calculates the distance between the feature amount sequence obtained from the analysis unit 602 and the acoustic model stored in the acoustic model storage unit 604.
- the closeness between the input speech and the acoustic model is calculated.
- the distance calculation unit 631 performs an acoustic distance calculation between the feature amount series obtained by the analysis unit 602 and the acoustic model, and outputs a distance calculation result.
- the technique for calculating the distance from the acoustic model is well known as a known technique, and detailed description thereof is omitted here.
- step S705 the word search unit 632 of the speech recognition unit 603 is most likely to use the recognition dictionary stored in the recognition dictionary storage unit 605 for the distance calculation result obtained by the distance calculation unit 631.
- a word string is searched and a hypothetical word (word hypothesis) is generated. For example, if the input speech is English, English speech recognition is performed to generate a word hypothesis consisting of a probable English word or word string.
- the word search technique in speech recognition is well known as a known technique, and detailed description thereof is omitted here.
- step S706 in the word search unit 632, the section specifying unit 634 determines that a set time interval (for example, 500 milliseconds) has elapsed.
- a set time interval for example, 500 milliseconds
- step S706 is performed for the first time from the start of voice input (voice processing)
- step S708 the section specifying unit 634 sets the threshold used by the phrase determining unit 633 to a value that is smaller by the set value.
- step S708 the phrase determination unit 633 determines a phrase boundary based on a comparison between the obtained word hypothesis and a word representing the set phrase boundary.
- the phrase determination unit 633 determines a phrase boundary in the same manner as the phrase determination unit 233 of the second embodiment described above. In this determination, if the hypothesis occupancy is equal to or less than the threshold (“N” in step S708), the process returns to step S701 to accept the next voice input.
- step S709 the threshold used by the phrase determination unit 633 is initialized. Therefore, if the hypothesis occupancy rate is kept below the threshold value, the threshold value is decreased every time it is determined that the time interval set in step S706 has elapsed. For this reason, phrase boundaries are easily determined.
- step S710 the translation unit 607 translates the recognition result word string up to the determined phrase boundary using the translation dictionary stored in the translation dictionary storage unit 606, and outputs the translation result.
- the translation result described above is output in a state visible to the user by the output unit 608.
- the process returns to step S701, and the above-described steps S701 to S711 are continued until the voice input is completed.
- the speech processing apparatus 600 performs speech translation on a continuously input speech while determining phrase boundaries, that is, a recognition result word string in a unit suitable for translation. Are output and translated, so that it is possible to sequentially output speech translation results.
- phrase boundaries that is, a recognition result word string in a unit suitable for translation.
- the threshold for determining the phrase is changed and changed. For example, even if it is difficult to determine the phrase boundary, The boundary can be easily determined, and the translation processing can be performed more sequentially.
- the phrase boundary determination is not performed after the recognition processing is completed, but is performed in the process of word search in the speech recognition processing, which may impair the sequentiality and real-time performance of the recognition result output. Few. Also, by considering the likelihood and occupancy rate of the hypothesis during the word search process, it is possible to suppress the deterioration of the speech recognition accuracy due to the sequential output of recognition results.
- the threshold for phrase determination is changed every predetermined time until the phrase boundary is determined, but the present invention is not limited to this.
- the threshold for phrase determination may be changed in two steps (twice) within a set fixed time.
- FIG. 8 is a configuration diagram illustrating a configuration of a caption generation system according to the fifth embodiment using the audio processing device 600.
- This system includes a reception unit 801, a shaping unit 808, an output unit 809, and a communication network 820 in addition to the audio processing device 600 in the fourth embodiment described above.
- the receiving unit 801 receives audio input from the communication network 820 and outputs the received audio to the audio processing device 600.
- the analysis unit 602 performs speech detection / analysis using the speech received by the reception unit 801 as an input.
- the translation unit 607 sequentially outputs the translation results to the shaping unit 808. For example, when the input language is English and the output language is Japanese, English-Japanese translation is performed and a Japanese word string is output as a translation result.
- the shaping unit 808 shapes the translation results (text data) obtained sequentially and outputs the shaped text data. Specifically, for example, when a Japanese word string is obtained as a translation result, a summary or a line feed is inserted. Techniques for performing summarization, line feed insertion, and the like on text data are well known as known techniques, and will not be described in detail here.
- the output unit 809 is a display, for example, and performs audio output using the audio data obtained by the shaping unit 808 as an input.
- step S ⁇ b> 901 the reception unit 801 receives voice waveforms that are continuously input from the communication network 820.
- step S902 the speech processing described in Embodiment 4 is performed by the processing of the analysis unit 602, speech recognition unit 203, and translation unit 607 in the speech processing apparatus 600, and the translation results are sequentially output. .
- step S903 the shaping unit 808 shapes the translation result obtained from the speech processing apparatus 200 (S202). For example, the Japanese word string (text data) output as the translation result is summarized and inserted into a new line, etc., formatted to be easy to see when displaying the text, and the formatted text data is output .
- step S904 the output unit 809 displays and outputs the formatted text on a display, for example.
- the voice data continuously received / input from the communication network 820 is sequentially voice-processed, and as a result, sequential translation results are output (formatted text data output).
- sequential translation results are output (formatted text data output).
- output may be performed at regular intervals. It can be said that the effect is high.
- the above-described system is, for example, a general-purpose computer system, and includes a CPU, a RAM, a ROM, and a nonvolatile storage device as a configuration (not shown), and the CPU is stored in the RAM, ROM, or the nonvolatile storage device.
- the OS and the call translation program are read and executed to perform the call translation process. As a result, the voice during a call can be translated and sequentially output.
- the system described above does not have to be a single computer, and may be configured by a plurality of computers.
- (Appendix 1) Analyzing means for detecting and analyzing the input speech and outputting a feature quantity; and speech recognition means for performing speech recognition based on the feature quantity and outputting a recognition result;
- a phrase determination unit for determining a phrase boundary based on a comparison between a hypothetical word group generated by speech recognition and a word representing a set phrase boundary, and a phrase unit based on the phrase boundary determined by the phrase determination unit And outputting the recognition result.
- phrase determination unit determines the phrase boundary based on a likelihood of a word representing the phrase boundary in a word group of the hypothesis.
- phrase determination unit is configured such that the word hypothesis representing the phrase boundary is the maximum likelihood among all the word hypotheses, and the likelihood difference with the next most likely word hypothesis.
- the speech processing apparatus is characterized in that the phrase boundary is determined when the threshold exceeds a set threshold.
- Appendix 7 The speech processing apparatus according to appendix 6, wherein the word representing the phrase boundary is a preposition or a conjunction, and the phrase boundary is set immediately before the word.
- the speech recognition function comprising: an analysis function for detecting and analyzing speech input to a computer and outputting a feature value; and a speech recognition function for performing speech recognition based on the feature value and outputting a recognition result.
- a phrase determination function for determining a phrase boundary based on a comparison between a hypothetical word group generated by the speech recognition and a word representing a set phrase boundary, and the speech recognition function includes the phrase determination function
- a computer-readable storage medium storing a program for realizing the function of outputting the recognition result in units of phrases based on the phrase boundaries determined in (1).
- the present invention can be applied to applications such as speech input / translation services using speech recognition / machine translation technology.
- 101 analysis unit, 102 ... voice recognition unit, 103 ... phrase determination unit.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
はじめに、本発明の実施の形態1について説明する。図1は、実施の形態1における音声処理装置の構成を示す構成図である。この音声処理装置は、まず、入力された音声を音声検出・分析して特徴量を出力する分析部101と、特徴量に基づいて音声認識を行って認識結果を出力する音声認識部102とを備える。加えて、音声認識部102は、音声認識により生成した仮説の単語群と、設定されている句境界を表す単語との比較に基づいて句境界を判定する句判定部103を備える。本音声処理装置では、句判定部103が判定した句境界による句の単位で、音声認識部102が認識結果を出力する。
次に、本発明における実施の形態2について説明する。図2は、実施の形態2における音声処理装置200の構成を示す構成図である。音声処理装置200は、分析部202,音声認識部203,音響モデル記憶部204,認識辞書記憶部205,翻訳辞書記憶部206,および翻訳部207を備える。
次に、本発明の実施の形態3について説明する。図4は、音声処理装置200を用いた実施の形態3における通話翻訳システムの構成を示す構成図である。本システムは、上述した実施の形態2における音声処理装置200に加え、受信部401,音声合成部408,出力部409,および通信ネットワーク420を備える。通信ネットワーク420は、例えば、公衆電話網である。なお、通信ネットワーク420は、インターネット通信網であってもよい。
次に、本発明の実施の形態4について説明する。図6は、実施の形態4における音声処理装置600の構成を示す構成図である。音声処理装置600は、分析部602,音声認識部603,音響モデル記憶部604,認識辞書記憶部605,翻訳辞書記憶部606,および翻訳部607を備える。
次に、本発明の実施の形態5について説明する。図8は、音声処理装置600を用いた実施の形態5における字幕生成システムの構成を示す構成図である。本システムは、上述した実施の形態4における音声処理装置600に加え、受信部801,整形部808,出力部809,および通信ネットワーク820を備える。
入力された音声を音声検出・分析して特徴量を出力する分析手段と、前記特徴量に基づいて音声認識を行い、認識結果を出力する音声認識手段とを備え、前記音声認識手段は、前記音声認識により生成した仮説の単語群と、設定されている句境界を表す単語との比較に基づいて句境界を判定する句判定手段を備え、この句判定手段が判定した句境界による句の単位で前記認識結果を出力することを特徴とする音声処理装置。
付記1記載の音声処理装置において、前記句判定手段は、前記句境界を表す単語の前記仮説の単語群における尤度に基づいて前記句境界を定めることを特徴とする音声処理装置。
付記2記載の音声処理装置において、前記句判定手段は、前記仮説の単語群の中における前記句境界を表す単語の占有率が設定されている閾値を超える場合に、前記句境界を判定することを特徴とする音声処理装置。
付記2記載の音声処理装置において、前記句判定手段は、前記句境界を表す単語仮説が全体の単語仮説の中で最尤であり、かつ、次に尤度の高い単語仮説との尤度差が、設定されている閾値を超える場合に、前記句境界を判定することを特徴とする音声処理装置。
付記1~4のいずれか1項に記載の音声処理装置において、前記句判定手段は、入力された音声の区間情報を指定する区間指定手段をさらに備え、前記句判定手段は、前記区間指定手段に設定されている区間毎に、設定した区間内で前記閾値を一時変更することを特徴とする音声処理装置。
付記1~5のいずれか1項に記載の音声処理装置において、前記句境界を表す単語は、句の先頭あるいは末尾に現れる句境界を表す単語であることを特徴とする音声処理装置。
付記6記載の音声処理装置において、前記句境界を表す単語は、前置詞または接続詞であり、単語の直前を句境界とすることを特徴とする音声処理装置。
入力された音声を音声検出・分析して特徴量を出力する分析ステップと、前記特徴量に基づいて音声認識を行い、認識結果を出力する音声認識ステップとを備え、前記音声認識ステップは、前記音声認識により生成した仮説の単語群と、設定されている句境界を表す単語との比較に基づいて句境界を判定する句判定ステップを備え、この句判定ステップで判定した句境界による句の単位で前記認識結果を出力することを特徴とする音声処理方法。
コンピュータに、入力された音声を音声検出・分析して特徴量を出力する分析機能と、前記特徴量に基づいて音声認識を行い、認識結果を出力する音声認識機能とを備え、前記音声認識機能は、前記音声認識により生成した仮説の単語群と、設定されている句境界を表す単語との比較に基づいて句境界を判定する句判定機能を備え、前記音声認識機能は、前記句判定機能で判定した句境界による句の単位で前記認識結果を出力する機能を実現するためのプログラムを記憶したコンピュータに読み取り可能な記憶媒体。
この出願は、2009年7月17日に出願された日本出願特願2009-168764号を基礎とする優先権を主張し、その開示のすべてをここに取り込む。
Claims (9)
- 入力された音声を音声検出・分析して特徴量を出力する分析手段と、
前記特徴量に基づいて音声認識を行い、認識結果を出力する音声認識手段と
を備え、
前記音声認識手段は、前記音声認識により生成した仮説の単語群と、設定されている句境界を表す単語との比較に基づいて句境界を判定する句判定手段を備え、この句判定手段が判定した句境界による句の単位で前記認識結果を出力する
ことを特徴とする音声処理装置。 - 請求項1記載の音声処理装置において、
前記句判定手段は、前記句境界を表す単語の前記仮説の単語群における尤度に基づいて前記句境界を定める
ことを特徴とする音声処理装置。 - 請求項2記載の音声処理装置において、
前記句判定手段は、前記仮説の単語群の中における前記句境界を表す単語の占有率が設定されている閾値を超える場合に、前記句境界を判定する
ことを特徴とする音声処理装置。 - 請求項2記載の音声処理装置において、
前記句判定手段は、前記句境界を表す単語仮説が全体の単語仮説の中で最尤であり、かつ、次に尤度の高い単語仮説との尤度差が、設定されている閾値を超える場合に、前記句境界を判定する
ことを特徴とする音声処理装置。 - 請求項1記載の音声処理装置において、
前記句判定手段は、入力された音声の区間情報を指定する区間指定手段をさらに備え、
前記句判定手段は、前記区間指定手段に設定されている区間毎に、設定した区間内で前記閾値を一時変更する
ことを特徴とする音声処理装置。 - 請求項1記載の音声処理装置において、
前記句境界を表す単語は、句の先頭あるいは末尾に現れる句境界を表す単語であることを特徴とする音声処理装置。 - 請求項6記載の音声処理装置において、
前記句境界を表す単語は、前置詞または接続詞であり、単語の直前を句境界とすることを特徴とする音声処理装置。 - 入力された音声を音声検出・分析して特徴量を出力する分析ステップと、
前記特徴量に基づいて音声認識を行い、認識結果を出力する音声認識ステップと
を備え、
前記音声認識ステップは、前記音声認識により生成した仮説の単語群と、設定されている句境界を表す単語との比較に基づいて句境界を判定する句判定ステップを備え、この句判定ステップで判定した句境界による句の単位で前記認識結果を出力する
ことを特徴とする音声処理方法。 - コンピュータに、
入力された音声を音声検出・分析して特徴量を出力する分析機能と、
前記特徴量に基づいて音声認識を行い、認識結果を出力する音声認識機能と
を備え、
前記音声認識機能は、前記音声認識により生成した仮説の単語群と、設定されている句境界を表す単語との比較に基づいて句境界を判定する句判定機能を備え、
前記音声認識機能は、前記句判定機能で判定した句境界による句の単位で前記認識結果を出力する
機能を実現するためのプログラムを記憶したコンピュータに読み取り可能な記憶媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011522761A JP5418596B2 (ja) | 2009-07-17 | 2010-06-04 | 音声処理装置および方法ならびに記憶媒体 |
US13/383,527 US9583095B2 (en) | 2009-07-17 | 2010-06-04 | Speech processing device, method, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009168764 | 2009-07-17 | ||
JP2009-168764 | 2009-07-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011007627A1 true WO2011007627A1 (ja) | 2011-01-20 |
Family
ID=43449236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/059515 WO2011007627A1 (ja) | 2009-07-17 | 2010-06-04 | 音声処理装置および方法ならびに記憶媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US9583095B2 (ja) |
JP (1) | JP5418596B2 (ja) |
WO (1) | WO2011007627A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9672820B2 (en) | 2013-09-19 | 2017-06-06 | Kabushiki Kaisha Toshiba | Simultaneous speech processing apparatus and method |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9583095B2 (en) * | 2009-07-17 | 2017-02-28 | Nec Corporation | Speech processing device, method, and storage medium |
US10102851B1 (en) * | 2013-08-28 | 2018-10-16 | Amazon Technologies, Inc. | Incremental utterance processing and semantic stability determination |
US10121471B2 (en) * | 2015-06-29 | 2018-11-06 | Amazon Technologies, Inc. | Language model speech endpointing |
US11158307B1 (en) * | 2019-03-25 | 2021-10-26 | Amazon Technologies, Inc. | Alternate utterance generation |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH04117560A (ja) * | 1990-09-07 | 1992-04-17 | Fujitsu Ltd | 節/句境界抽出方式 |
JPH07200591A (ja) * | 1993-12-28 | 1995-08-04 | Fujitsu Ltd | 構文解析装置 |
JPH07261782A (ja) * | 1994-03-22 | 1995-10-13 | Atr Onsei Honyaku Tsushin Kenkyusho:Kk | 音声認識装置 |
JPH08123469A (ja) * | 1994-10-28 | 1996-05-17 | Mitsubishi Electric Corp | 句境界確率計算装置および句境界確率利用連続音声認識装置 |
JPH1011439A (ja) * | 1996-06-21 | 1998-01-16 | Oki Electric Ind Co Ltd | 英日機械翻訳システム |
JPH11259474A (ja) * | 1998-03-10 | 1999-09-24 | Matsushita Electric Ind Co Ltd | 機械翻訳装置及び機械翻訳方法 |
JP2000112941A (ja) * | 1998-10-07 | 2000-04-21 | Internatl Business Mach Corp <Ibm> | イディオム処理機能を有する電子辞書 |
JP2004012615A (ja) * | 2002-06-04 | 2004-01-15 | Sharp Corp | 連続音声認識装置および連続音声認識方法、連続音声認識プログラム、並びに、プログラム記録媒体 |
Family Cites Families (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3766111B2 (ja) | 1991-08-13 | 2006-04-12 | 株式会社東芝 | 音声認識装置 |
JPH0695684A (ja) * | 1992-09-17 | 1994-04-08 | Meidensha Corp | 音声認識システム |
JP3009642B2 (ja) | 1997-10-22 | 2000-02-14 | 株式会社エイ・ティ・アール音声翻訳通信研究所 | 音声言語処理単位変換装置 |
JP3614648B2 (ja) * | 1998-03-13 | 2005-01-26 | 富士通株式会社 | 文書理解支援装置、要約文生成方法、並びに文書理解支援プログラムを記録したコンピュータ読み取り可能な記録媒体 |
US6453292B2 (en) * | 1998-10-28 | 2002-09-17 | International Business Machines Corporation | Command boundary identifier for conversational natural language |
JP3834169B2 (ja) | 1999-09-22 | 2006-10-18 | 日本放送協会 | 連続音声認識装置および記録媒体 |
DE10018134A1 (de) * | 2000-04-12 | 2001-10-18 | Siemens Ag | Verfahren und Vorrichtung zum Bestimmen prosodischer Markierungen |
NO316480B1 (no) * | 2001-11-15 | 2004-01-26 | Forinnova As | Fremgangsmåte og system for tekstuell granskning og oppdagelse |
US7386454B2 (en) * | 2002-07-31 | 2008-06-10 | International Business Machines Corporation | Natural error handling in speech recognition |
US8818793B1 (en) * | 2002-12-24 | 2014-08-26 | At&T Intellectual Property Ii, L.P. | System and method of extracting clauses for spoken language understanding |
JP3998668B2 (ja) * | 2004-07-14 | 2007-10-31 | 沖電気工業株式会社 | 形態素解析装置、方法及びプログラム |
EP1681670A1 (en) * | 2005-01-14 | 2006-07-19 | Dialog Semiconductor GmbH | Voice activation |
US20070192309A1 (en) * | 2005-10-12 | 2007-08-16 | Gordon Fischer | Method and system for identifying sentence boundaries |
US7908552B2 (en) * | 2007-04-13 | 2011-03-15 | A-Life Medical Inc. | Mere-parsing with boundary and semantic driven scoping |
JP2008269122A (ja) * | 2007-04-18 | 2008-11-06 | National Institute Of Information & Communication Technology | 処理単位分割装置、処理単位分割方法、及びプログラム |
US8364485B2 (en) * | 2007-08-27 | 2013-01-29 | International Business Machines Corporation | Method for automatically identifying sentence boundaries in noisy conversational data |
DE602007004733D1 (de) * | 2007-10-10 | 2010-03-25 | Harman Becker Automotive Sys | Sprechererkennung |
JP2010230695A (ja) * | 2007-10-22 | 2010-10-14 | Toshiba Corp | 音声の境界推定装置及び方法 |
US9583095B2 (en) * | 2009-07-17 | 2017-02-28 | Nec Corporation | Speech processing device, method, and storage medium |
-
2010
- 2010-06-04 US US13/383,527 patent/US9583095B2/en active Active
- 2010-06-04 WO PCT/JP2010/059515 patent/WO2011007627A1/ja active Application Filing
- 2010-06-04 JP JP2011522761A patent/JP5418596B2/ja not_active Expired - Fee Related
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH04117560A (ja) * | 1990-09-07 | 1992-04-17 | Fujitsu Ltd | 節/句境界抽出方式 |
JPH07200591A (ja) * | 1993-12-28 | 1995-08-04 | Fujitsu Ltd | 構文解析装置 |
JPH07261782A (ja) * | 1994-03-22 | 1995-10-13 | Atr Onsei Honyaku Tsushin Kenkyusho:Kk | 音声認識装置 |
JPH08123469A (ja) * | 1994-10-28 | 1996-05-17 | Mitsubishi Electric Corp | 句境界確率計算装置および句境界確率利用連続音声認識装置 |
JPH1011439A (ja) * | 1996-06-21 | 1998-01-16 | Oki Electric Ind Co Ltd | 英日機械翻訳システム |
JPH11259474A (ja) * | 1998-03-10 | 1999-09-24 | Matsushita Electric Ind Co Ltd | 機械翻訳装置及び機械翻訳方法 |
JP2000112941A (ja) * | 1998-10-07 | 2000-04-21 | Internatl Business Mach Corp <Ibm> | イディオム処理機能を有する電子辞書 |
JP2004012615A (ja) * | 2002-06-04 | 2004-01-15 | Sharp Corp | 連続音声認識装置および連続音声認識方法、連続音声認識プログラム、並びに、プログラム記録媒体 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9672820B2 (en) | 2013-09-19 | 2017-06-06 | Kabushiki Kaisha Toshiba | Simultaneous speech processing apparatus and method |
Also Published As
Publication number | Publication date |
---|---|
JP5418596B2 (ja) | 2014-02-19 |
US9583095B2 (en) | 2017-02-28 |
JPWO2011007627A1 (ja) | 2012-12-27 |
US20120116765A1 (en) | 2012-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10643609B1 (en) | Selecting speech inputs | |
US9972318B1 (en) | Interpreting voice commands | |
US11061644B2 (en) | Maintaining context for voice processes | |
CN110675855B (zh) | 一种语音识别方法、电子设备及计算机可读存储介质 | |
US10460034B2 (en) | Intention inference system and intention inference method | |
US8635070B2 (en) | Speech translation apparatus, method and program that generates insertion sentence explaining recognized emotion types | |
CN105632499B (zh) | 用于优化语音识别结果的方法和装置 | |
JP3004883B2 (ja) | 終話検出方法及び装置並びに連続音声認識方法及び装置 | |
EP4018437B1 (en) | Optimizing a keyword spotting system | |
US20080077387A1 (en) | Machine translation apparatus, method, and computer program product | |
EP3739583B1 (en) | Dialog device, dialog method, and dialog computer program | |
JP5418596B2 (ja) | 音声処理装置および方法ならびに記憶媒体 | |
JP2010230695A (ja) | 音声の境界推定装置及び方法 | |
KR101747873B1 (ko) | 음성인식을 위한 언어모델 생성 장치 및 방법 | |
Tran et al. | Joint modeling of text and acoustic-prosodic cues for neural parsing | |
US20230343332A1 (en) | Joint Segmenting and Automatic Speech Recognition | |
KR20180127020A (ko) | 자연어 대화체 음성 인식 방법 및 장치 | |
JP5184467B2 (ja) | 適応化音響モデル生成装置及びプログラム | |
KR20200102309A (ko) | 단어 유사도를 이용한 음성 인식 시스템 및 그 방법 | |
US6772116B2 (en) | Method of decoding telegraphic speech | |
US11627185B1 (en) | Wireless data protocol | |
US11277304B1 (en) | Wireless data protocol | |
US11043212B2 (en) | Speech signal processing and evaluation | |
US11393451B1 (en) | Linked content in voice user interface | |
JP2009146043A (ja) | 音声翻訳装置、音声翻訳方法、及びプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10799689 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011522761 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13383527 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 10799689 Country of ref document: EP Kind code of ref document: A1 |