WO2007097176A1 - 音声認識辞書作成支援システム、音声認識辞書作成支援方法及び音声認識辞書作成支援用プログラム - Google Patents
音声認識辞書作成支援システム、音声認識辞書作成支援方法及び音声認識辞書作成支援用プログラム Download PDFInfo
- Publication number
- WO2007097176A1 WO2007097176A1 PCT/JP2007/051778 JP2007051778W WO2007097176A1 WO 2007097176 A1 WO2007097176 A1 WO 2007097176A1 JP 2007051778 W JP2007051778 W JP 2007051778W WO 2007097176 A1 WO2007097176 A1 WO 2007097176A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech recognition
- text data
- dictionary
- language model
- virtual
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 25
- 238000004458 analytical method Methods 0.000 claims description 31
- 230000000877 morphologic effect Effects 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 2
- 241000282320 Panthera leo Species 0.000 claims 2
- 238000006243 chemical reaction Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 8
- 238000000605 extraction Methods 0.000 description 5
- 150000001875 compounds Chemical class 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Definitions
- Speech recognition dictionary creation support system speech recognition dictionary creation support method, and program for speech recognition dictionary creation support
- the present invention relates to a speech recognition dictionary creation support system, a speech recognition dictionary creation support method, and a speech recognition dictionary creation support program, and more particularly, to a speech recognition dictionary storing a vocabulary that is a component of speech recognition processing
- the present invention relates to a speech recognition dictionary creation support system for creating a language model in which word sequences are ordered, a speech recognition dictionary creation support method, and a program for speech recognition dictionary creation support.
- FIG. 6 An outline of a conventional speech recognition dictionary creation support system will be described. As shown in FIG. 6, it comprises a text analysis unit 201, an appearance frequency counting unit 202, an updating unit 203, a background dictionary storage unit 204, a recognition dictionary storage unit 205, and a language model storage unit 206. There is.
- the conventional speech recognition dictionary creation support system having such a configuration operates as follows.
- the text analysis unit 201 receives text data including a speech recognition target vocabulary from the outside, and performs text analysis processing by using the word dictionary stored in the background dictionary storage unit 204.
- the word sequence is divided into individual word sequences, a reading string is given, and if necessary, a part-of-speech tag is given, and the result is sent to the appearance frequency counting means 202.
- the appearance frequency counting means 202 receives a word sequence from the text analysis means 201, counts the appearance frequency for each word, and sends the result to the updating means 203.
- the updating means 203 calculates the appearance probability of each word from the word appearance frequency received from the appearance frequency counting means 202, and compares it with the appearance probability of the word stored in the language model storage means 206 as well.
- the appearance probability stored in the language model storage means 206 is corrected so as to be close to the calculated appearance probability. Also, among the words appearing in the text data, those having an appearance probability value of a certain level or more are stored in the recognition dictionary memory. It is registered in the recognition dictionary stored in the means 205, and it is confirmed whether or not it is unregistered, and if it is not registered, the word is recognized as an unknown word, and the word and the appearance probability are recognized dictionary memory respectively. Means 205 and language model storage means 206 are registered.
- the appearance frequency counting means 202 usually performs counting in units of the number of appearances of continuous two or three words. Also, in the morphological analysis process of the text analysis means 201, the word boundary is corrected or the reading is manually input in order to cope with the case where the word is divided incorrectly or an incorrect reading is given. It is often performed to provide an interface to be used in the updating means 203 or the like (see Patent Document 1 etc. described later).
- FIG. 7 is a rewriting of the speech recognition dictionary creation support system of Patent Document 1 so that it can be compared with FIG. 6, and character string comparison means 301, unknown word extraction means 302, update means 303, recognition It consists of a dictionary storage unit 305 and a language model storage unit 306, and has a major feature of using a result of correcting an incorrect recognition that is not used to detect an unknown word using a statistical method.
- the conventional speech recognition dictionary creation support system having such a configuration operates as follows.
- the character string comparison means 301 does not show the recognition dictionary stored in the recognition dictionary storage means 305 and the language model stored in the language model storage means 306 as a component element, speech recognition A recognition result text data, which is the result of recognition of the speech to be recognized using a means, and erroneously recognized corrected text data, in which the recognition error contained in the recognition result text data is manually corrected, are externally received
- a word or word string is extracted in a form that includes the recognition error and is sent to the unknown word extraction means 302.
- the unknown word extraction unit 302 is configured such that for each word or word string received by the character string comparison unit 301, the word or word string is registered in the recognition dictionary stored in the recognition dictionary storage unit 305. Whether or not it has not been registered is confirmed, and if it is not registered, the word or word string is registered as a new word in the recognition dictionary storage means 305. Furthermore, the registered new word and the predetermined appearance probability are also registered in the language model storage means 306.
- Patent documents 2 to 4 describe a method for extracting / registering unknown words into other speech recognition dictionaries. In Patent Document 2, a document file containing an unknown word is subjected to morphological analysis or the like to extract a word, and the word is not in the speech recognition dictionary.
- the word is referred to a background dictionary (background dictionary)
- a background dictionary background dictionary
- Patent Documents 3 and 4 disclose an unknown word registration device which has a function of estimating the part of speech of the unknown word and the pronunciation thereof and which automatically registers the unknown word.
- Patent Document 5 broadly discloses a method of counting the appearance frequency of words of pages collected from sites on the Internet, and updating the selection order in the words of the same reading in the word speech recognition dictionary. It! Scold.
- Patent Document 6 also includes an acoustic model management server and a language model management server for transmitting a speech model (acoustic model and language model) used for matching with the input speech to the speech recognition device.
- An acoustic model management server and a language model management server having a function of periodically updating the acoustic model and the language model are disclosed.
- Patent Document 7 is mentioned as a background art of the present invention.
- Patent Document 7 relates to a speech recognition device, but describes a technique related to a method of generating phoneme strings from unknown words not registered in a background dictionary (morphological analysis dictionary).
- Patent Document 1 Japanese Patent Application Laid-Open No. 2002-229585
- Patent Document 2 Japanese Patent Application Laid-Open No. 2003-316376
- Patent Document 3 Japanese Patent Application Laid-Open No. 2004-265440
- Patent Document 4 Japanese Patent Application Laid-Open No. 2002-014693
- Patent document 5 Unexamined-Japanese-Patent No. 2005-099741
- Patent Document 6 Japanese Patent Application Laid-Open No. 2002-091477
- Patent Document 7 Japanese Patent Application Laid-Open No. 2004-294542
- the above-mentioned "similarity to the speech to be recognized” refers to both the similarity regarding content or topic and the similarity in speech style.
- speech to be recognized newscaster's speech
- newspaper article have the same topic
- the speech style that is, the expression specific to the spoken language is different.
- a newscaster's voice is a sentence of "I Z”
- a newspaper article is a sentence of "I Z”.
- fillers such as "Eichi" and "Anoi” frequently appear. If a dictionary 'language model is created by ignoring such differences in utterance styles, it is likely that the speech style specific to speech can not be correctly recognized as a speech, which is a negative effect.
- Patent Documents 3, 5 and 6 above it is proposed to collect text on the Internet and multimedia broadcasting power as well, but as a matter of course, the above-mentioned “voice to be recognized” and “Displacement” occurs, and it is thought that the limit will appear naturally in the recognition result
- the second problem of the prior art is that the recognition error in the presence of phonologically similar words or word strings is not reflected in the dictionary language model. For example, looking at Patent Documents 2 to 5, only actual power recognition processing is performed in which phonological information is also involved, only considering whether each word appeared in text data and the number of appearances. What happens if you do not consider anything. Whether a word is included in the dictionary should be considered whether it is originally phonologically the same as other words in the dictionary, or similar, or if similar words exist. If it is necessary to exclude one word from the dictionary or to lower the priority (occurrence probability) in the language model, the conventional technology can not deny the possibility of double registration. it is conceivable that.
- the third problem of the prior art is that it is not always easy to construct a dictionary / language model so that speech recognition of compound words in which a plurality of words are connected can be performed correctly. . Even in the case where each word that constitutes a compound word is a known word that has already been registered in the dictionary, if the connection probability of each word in the language model used for speech recognition is low, the compound word as a total of words The probability of correct recognition is low. In addition, collecting text data containing a large number of compound words itself is difficult as described above, and causes cost problems.
- the fourth problem of the prior art is that, as a result of the above, it is difficult to correctly feed back a recognition error to the dictionary and language model and to prevent the recognition error in advance.
- the recognition error actually occurring in the speech recognition system is used in operation, it is possible to reliably reflect the recognition error. For this reason, it is necessary to actually observe the recognition error that occurs in the speech recognition system in operation, and another inconvenience arises.
- recognition problems not caused by the dictionary and language model can not be excluded, and another problem remains.
- the recognition errors that occur in speech recognition systems include those caused by acoustic factors as well as those caused by dictionary and language models.
- the recognition errors include those caused by acoustic factors as well as those caused by dictionary and language models.
- the utterance is unclear and difficult to hear And so on.
- it is considered difficult to correct a meaningful dictionary 'language model.
- the object of the present invention has been made in view of the above-described circumstances, and low-cost text data can be used, and phonological similarity between words is taken into consideration, and speech recognition based on linguistic factors is also possible. It is an object of the present invention to provide a speech recognition dictionary creation support system, a speech recognition dictionary creation support method and a speech recognition dictionary creation support program capable of creating a dictionary and language model optimized to reduce errors efficiently.
- a memory for storing a dictionary, a language model and an acoustic model And a text analysis unit that performs morphological analysis processing on text data; and virtual text recognition result text data generated using the dictionary, language model, and acoustic model for analyzed text data analyzed by the text analysis unit.
- a virtual speech recognition processing unit for extracting a difference between the analyzed text data and the virtual speech recognition result text data, and based on the difference, at least one of the dictionary or the language model.
- a speech recognition dictionary creation support system characterized by comprising: an update processing unit that corrects, a speech recognition dictionary creation support method performed using the system, and a program for realizing the system are provided.
- the speech recognition dictionary creation support system configured as described above generates virtual speech recognition result text data of given text data, and compares the result of comparison between the virtual speech recognition result text data and the original text data. Use it to update the dictionary and language model. Effect of the invention
- recognition errors in speech recognition processing in operation are predicted using text data which is relatively easily available, and a dictionary / language model reflecting the prediction results is created. It becomes possible. The reason is that virtual speech recognition is performed using the dictionary, language model and acoustic model, and the dictionary and language model are updated using the result.
- FIG. 1 is a diagram showing a schematic configuration of a speech recognition dictionary creation support system according to a first embodiment of the present invention.
- FIG. 2 is a block diagram showing the speech recognition dictionary creation support system according to the first embodiment of the present invention in functional blocks.
- FIG. 3 is a diagram showing a configuration example of a virtual speech recognition processing unit of the speech recognition dictionary creation support system according to the first embodiment of the present invention.
- FIG. 4 is a flowchart showing the operation of the speech recognition dictionary creation support system according to the first embodiment of the present invention.
- FIG. 5 is a view for explaining a specific example of the operation of the speech recognition dictionary creation support system according to the first embodiment of the present invention.
- FIG. 6 is a block diagram showing a conventional speech recognition dictionary creation support system by functional blocks.
- FIG. 7 is a block diagram showing a conventional speech recognition dictionary creation support system by functional blocks.
- FIG. 1 is a diagram showing a schematic configuration of a speech recognition dictionary creation support system according to a first embodiment of the present invention.
- a speech recognition dictionary creation support system configured by a data processing device (computer) 73 provided with an input device 71 and a storage device 74 is shown.
- the storage unit 74 includes a background dictionary storage unit 741, a recognition dictionary storage unit 742, a language model storage unit 743, and a hard disk having an acoustic model storage unit 744.
- the background dictionary, the recognition dictionary, the language model and the acoustic It is possible to hold the model
- FIG. 2 is a block diagram representing the speech recognition dictionary creation support system by functional blocks.
- a text analysis unit 101 a virtual speech recognition processing unit 102, an update processing unit 103, a background dictionary storage unit 104, recognition A dictionary storage unit 105, a language model storage unit 106, and an acoustic model storage unit 107 are composed of forces.
- the text analysis unit 101 divides text (character string) data 108 given from the outside into words, and applies a part-of-speech tag or reading. More specifically, the text analysis unit 101 reads the text data 108, reads the background dictionary stored in the background dictionary storage unit 104, analyzes the text data 108, and outputs the analyzed text data. . [0033]
- the virtual speech recognition processing unit 102 has a high possibility of causing speech recognition errors by being included in the recognition dictionary or not being given in the recognition dictionary or given a low priority in the language model.
- the virtual speech recognition processing unit 102 reads the recognition dictionary, the language model and the acoustic model stored in the recognition dictionary storage unit 105, the language model storage unit 106 and the acoustic model storage unit 107, respectively, and analyzes the text.
- the recognition text processing is virtually performed on the analyzed text data output from the part 101, virtual recognition result text data corresponding to the analyzed text data is generated, and the original analyzed text data is generated. And the virtual recognition result text data are compared, and the processing to extract and output the difference is performed.
- the update processing unit 103 performs processing of changing the recognition dictionary and language model in consideration of words and phrases that are likely to cause recognition errors determined by the virtual speech recognition processing unit 102. More specifically, the update processing unit 103 is stored in the recognition dictionary storage unit 105 and the language model storage unit 106 based on the difference output from the virtual speech recognition processing unit 102. The recognition dictionary and the language model are corrected.
- Background dictionary storage unit 104 and recognition dictionary storage unit 105 store a background dictionary and a recognition dictionary, respectively.
- the background dictionary is also referred to as a morphological analysis dictionary, and holds a vocabulary of several tens to several hundreds times as large as the recognition dictionary. Therefore, in many cases, reading and other information can be added to almost all of the given text data. Further, even when an unknown word not registered in the background dictionary appears, reading information can be added using, for example, the technology described in Patent Document 5.
- the language model storage unit 106 and the acoustic model storage unit 107 store a language model and a sound model, respectively.
- the recognition dictionary and the language model initially stored in the recognition dictionary storage unit 105 and the language model storage unit 106 respectively are the same as those used in the speech recognition system to be put to practical use. Use.
- an acoustic model stored in the acoustic model storage unit 107 in principle equivalent to the acoustic model used in the speech recognition system to be actually operated.
- FIG. 3 is a diagram showing an example of a configuration of the virtual speech recognition processing unit 102.
- the virtual speech recognition processing unit 102 includes a reading Z phoneme string conversion unit 61, a phoneme Z state sequence conversion unit 62, a state Z feature string conversion unit 63, and an optimum word string search unit 64. , And a text data comparison unit 65.
- Reading Z-phoneme string conversion unit 61 reads analyzed text data divided into words and given readings, in appropriate units, for example, one sentence at a time, and pre-stored syllable Z phonetic string conversion According to the table, the reading character string usually represented by hiragana or katakana is converted into a phoneme string and output one by one.
- the phoneme is the minimum unit of recognition in speech recognition, that is, the recognition unit, and each phoneme is a symbol such as v, a, i, u, ⁇ ⁇ ⁇ consonant k, s, t, ⁇ ⁇ ⁇ Is represented by
- reading Z phoneme string conversion unit 61 outputs a phoneme string “Z # Zo Zh Z a Z y Z Y Z O Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z” (here, “#” is It represents the silence at the end of the utterance).
- the reading Z-phoneme string conversion unit 61 "Z # Z #-o + hZo-h + aZh-a + y / a-y + o / y-o + o / o-o + g / o-g + o / g-o + z / o-z + a / z-a + i / a-i + m / i-m + a / m-a + s / a-s + u / s-u + # Z # Z Output phoneme ⁇ 1 J.
- recognition units Since most recent speech recognition systems use phonemes as recognition units, the embodiment of the present invention follows them, and phonemes are used as recognition units. However, recognition units other than phonemes, For example, even if syllables and semi-syllables are used as recognition units, the present invention can be implemented in principle, and there is no particular limitation on the way of recognition units.
- the phoneme Z-state sequence conversion unit 62 refers to the configuration information of the acoustic model stored in the acoustic model storage unit 107 for the phoneme sequence received from the reading Z-phoneme sequence conversion unit 61 to identify each phoneme. Output the state sequence expanded into the state sequence.
- “state” is a concept attached to a Hidden Markov Model (hereinafter, “HMM”) generally used as an acoustic model in speech recognition.
- HMM Hidden Markov Model
- the acoustic model is configured as a set of HMMs for each phoneme, and each phoneme HMM is composed of several “states”.
- the acoustic model it is possible to easily convert phonemes into state sequences.
- the above-described phoneme string is # [1], o [l], o [2], o [3], h [l ], h [2], h [3], a [l], a [2], a [3], y [l], y [2], y [3], o [l],.
- State Z feature sequence conversion unit 63 reads the acoustic model stored in acoustic model storage unit 107, sequentially receives the state sequence output by phoneme Z state sequence conversion unit 62, and uses voice for speech recognition. Output a sequence of feature vectors including feature parameters. That is, a feature vector is generated based on random numbers according to a probability distribution defined for each state in the acoustic model, for example, a mixed Gaussian distribution. Also, the number of feature vectors to be generated per state is also defined for each state, and is determined by random numbers based on the state transition probability.
- the optimum word string search unit 64 reads the recognition dictionary, the language model, and the acoustic model stored in the recognition dictionary storage unit 105, the language model storage unit 106, and the sound model storage unit 107, respectively, and performs state Z feature string conversion.
- the feature vector sequence output from the unit 63 is sequentially received, and using a search method such as frame synchronous beam search generally used in voice recognition systems, a word sequence that most matches the feature vector sequence, that is, virtual speech recognition Search and output the result text data (generally, Kanji / Kana mixed sentences).
- the text data comparison unit 65 compares the virtual recognition result text data output from the optimum word string search unit 64 with the corresponding portion of the analyzed text data that is input to the virtual speech recognition processing unit 102, A pair of strings of differences, that is, a virtual correct string and The virtual recognition result is extracted as a pair of character strings, and the appearance frequency is counted for each identical character string, and then sent to the update processing unit 103 as virtual recognition error case data illustrated in FIG.
- FIG. 4 is a flowchart showing the operation of the speech recognition dictionary creation support system according to the present embodiment.
- text analysis unit 101 reads the background dictionary stored in background dictionary storage unit 104 (step A1), and performs morphological analysis processing on given text data (step A1). A2).
- morphological analysis processing text data is divided into words, and a part-of-speech tag or reading (a symbol string representing the pronunciation of a word) is given to each word as necessary.
- the background dictionary holds a vocabulary that is several tens to several hundreds times the size of the recognition dictionary, so that almost all information given to text data is read, etc. It can be granted.
- reading information can be added using, for example, the technology described in Patent Document 5.
- the virtual speech recognition processing unit 102 reads the recognition dictionary, the language model and the acoustic model stored in the recognition dictionary storage unit 105, the language model storage unit 106 and the acoustic model storage unit 107, respectively (see FIG. In steps A3 to A5), virtual speech recognition processing is executed based on the text output from the text analysis unit 101 to create virtual recognition result text data (step A6).
- the virtual speech recognition processing unit 102 compares the analyzed text data with the corresponding virtual recognition result text data, and finds a difference, ie, a word as a virtual recognition error case.
- word strings are extracted from both text data boxes to generate virtual recognition error example data (see FIG. 5) (step A7).
- the virtual speech recognition processing unit 102 finally sends the word level and ⁇ or phrase level character string pairs and their readings to the update processing unit 103 together with their respective appearance frequencies.
- FIG. 5 shows an example of virtual recognition error case data information that the virtual speech recognition processing unit 102 sends to the update processing unit 103.
- the update processing unit 103 receives virtual recognition error case data output by the virtual speech recognition processing unit 102, extracts them one by one in order, and according to the contents thereof, the recognition dictionary storage unit 105 and The recognition dictionary and language model stored in the language model storage unit 106 may be modified as follows (steps ⁇ 8 to ⁇ 10).
- the first entry (HTML, E-Z sluggishness) is taken out, and the word “HTML” of the analyzed text corresponding to the correct character string in speech recognition is added to the recognition dictionary. If it does not exist, the update processing unit 103 adds “HTML” to the recognition dictionary, and sets a default value (medium priority defined appropriately) as the priority of the word “HTML” in the language model.
- the update processing unit 103 does not update the recognition dictionary, and the priority of the word “: HTML” in the language model is determined in advance as appropriate. Increase by a predetermined value.
- the entries of all the virtual recognition error case data are used. For example, extremely low frequency of occurrence, entry, and so on. It is also effective to set it so that it is not used to change the recognition dictionary and language model. is there.
- the entries to be reflected in the recognition dictionary and language model are selected using the appearance frequency information etc., and repetition is performed until the portion corresponding to the recognition error in the virtual recognition result text data is less than a certain percentage. It is also good.
- a recognition dictionary and language model are changed using the analyzed text “HTML” and “terrestrial digital” corresponding to the correct character string.
- the recognition dictionary 'language model may be changed using virtual recognition result text corresponding to recognition error.
- the update processing unit 103 raises the priority of the word “: HTML” in the language model for the entry (HTML, E-Z stagnation) while You may change the language model to lower the priority of “Eiichi” and “Slowness” and also lower the priority of the two words “Eiichi” and “Slowness”.
- a process of deleting recognition dictionary power may be performed.
- the amount of change may be controlled depending on the appearance frequency.
- the priority of the corresponding word or word string should be changed significantly, and conversely, with low frequency of occurrence, for entries, only the power should be changed. , As you control,.
- an interface for presenting the updated contents of the dictionary and language model to the system operator in advance and an interface for asking the system operator about the possibility of updating are provided as appropriate, and when updating the dictionary language model, It is also preferable to configure it so that inappropriate changes can be avoided.
- the virtual speech recognition processing unit 102 illustrated in FIG. 3 may be configured more simply.
- a configuration may be considered in which the state Z feature string converter 63 is removed and the phoneme Z state string converter 62 is directly connected to the optimum word string search unit 64.
- optimal word string search unit 64 calculates, for each element of the H MM state sequence received from phoneme Z state sequence conversion unit 62, the degree of similarity or distance with all the states in the acoustic model. According to the linguistic constraints defined by the recognition dictionary and the language model, the optimal word sequence will be sought.
- the distance between states should be calculated using the distance measure between probability distributions associated with the state, for example, measures such as the culpac-libera dippet.
- limitation (pruning) of the search range similar to the above-mentioned frame synchronization beam search may be performed as appropriate.
- the frame synchronous beam search is configured based on the distance calculation between the feature vector and the state, while the state Z feature string converter In the present embodiment where 63 is omitted, the point in which the search is configured based on distance calculation between states is different, but the principle is almost the same.
- the virtual speech recognition processing unit 102 illustrated in FIG. 3 can be further simplified.
- the phoneme Z state sequence conversion unit 62 and the state Z feature sequence conversion unit 63 are removed.
- a configuration is conceivable in which the reading Z phoneme string conversion unit 61 is directly connected to the optimum word string search unit 64.
- the optimum word string search unit 64 calculates, for each element of the phonetic string received from the reading Z phoneme string conversion unit 61, the degree of similarity or distance with all phonemes in the acoustic model, and recognizes it. Find optimal word strings according to the linguistic constraints defined by the dictionary and language model. The distance between phonemes may be calculated as the sum of the distances between corresponding states.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200780006299XA CN101432801B (zh) | 2006-02-23 | 2007-02-02 | 语音识别词典制作支持系统、语音识别词典制作支持方法 |
US12/280,594 US8719021B2 (en) | 2006-02-23 | 2007-02-02 | Speech recognition dictionary compilation assisting system, speech recognition dictionary compilation assisting method and speech recognition dictionary compilation assisting program |
JP2008501662A JP5040909B2 (ja) | 2006-02-23 | 2007-02-02 | 音声認識辞書作成支援システム、音声認識辞書作成支援方法及び音声認識辞書作成支援用プログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006-046812 | 2006-02-23 | ||
JP2006046812 | 2006-02-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2007097176A1 true WO2007097176A1 (ja) | 2007-08-30 |
Family
ID=38437215
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2007/051778 WO2007097176A1 (ja) | 2006-02-23 | 2007-02-02 | 音声認識辞書作成支援システム、音声認識辞書作成支援方法及び音声認識辞書作成支援用プログラム |
Country Status (4)
Country | Link |
---|---|
US (1) | US8719021B2 (ja) |
JP (1) | JP5040909B2 (ja) |
CN (1) | CN101432801B (ja) |
WO (1) | WO2007097176A1 (ja) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009271465A (ja) * | 2008-05-12 | 2009-11-19 | Nippon Telegr & Teleph Corp <Ntt> | 単語追加装置、単語追加方法、そのプログラム |
JP2017167188A (ja) * | 2016-03-14 | 2017-09-21 | 株式会社東芝 | 情報処理装置、情報処理方法、プログラムおよび認識システム |
CN110032626A (zh) * | 2019-04-19 | 2019-07-19 | 百度在线网络技术(北京)有限公司 | 语音播报方法和装置 |
US20220215168A1 (en) * | 2019-05-30 | 2022-07-07 | Sony Group Corporation | Information processing device, information processing method, and program |
WO2023162513A1 (ja) * | 2022-02-28 | 2023-08-31 | 国立研究開発法人情報通信研究機構 | 言語モデル学習装置、対話装置及び学習済言語モデル |
JP7479249B2 (ja) | 2020-09-02 | 2024-05-08 | 株式会社日立ソリューションズ・テクノロジー | 未知語検出方法及び未知語検出装置 |
JP7481999B2 (ja) | 2020-11-05 | 2024-05-13 | 株式会社東芝 | 辞書編集装置、辞書編集方法及び辞書編集プログラム |
Families Citing this family (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8301446B2 (en) * | 2009-03-30 | 2012-10-30 | Adacel Systems, Inc. | System and method for training an acoustic model with reduced feature space variation |
JP5471106B2 (ja) * | 2009-07-16 | 2014-04-16 | 独立行政法人情報通信研究機構 | 音声翻訳システム、辞書サーバ装置、およびプログラム |
US9045098B2 (en) * | 2009-12-01 | 2015-06-02 | Honda Motor Co., Ltd. | Vocabulary dictionary recompile for in-vehicle audio system |
US20120330662A1 (en) * | 2010-01-29 | 2012-12-27 | Nec Corporation | Input supporting system, method and program |
US20130202270A1 (en) * | 2010-06-28 | 2013-08-08 | Nokia Corporation | Method and apparatus for accessing multimedia content having subtitle data |
US8484024B2 (en) * | 2011-02-24 | 2013-07-09 | Nuance Communications, Inc. | Phonetic features for speech recognition |
US11062615B1 (en) | 2011-03-01 | 2021-07-13 | Intelligibility Training LLC | Methods and systems for remote language learning in a pandemic-aware world |
US10019995B1 (en) | 2011-03-01 | 2018-07-10 | Alice J. Stiebel | Methods and systems for language learning based on a series of pitch patterns |
US8676580B2 (en) * | 2011-08-16 | 2014-03-18 | International Business Machines Corporation | Automatic speech and concept recognition |
WO2013085409A1 (ru) * | 2011-12-08 | 2013-06-13 | Общество С Ограниченной Ответственностью Базелевс-Инновации | Способ анимации sms-сообщений |
CN103165129B (zh) * | 2011-12-13 | 2015-07-01 | 北京百度网讯科技有限公司 | 一种优化语音识别声学模型的方法及系统 |
JP5787780B2 (ja) * | 2012-01-25 | 2015-09-30 | 株式会社東芝 | 書き起こし支援システムおよび書き起こし支援方法 |
JP6019604B2 (ja) * | 2012-02-14 | 2016-11-02 | 日本電気株式会社 | 音声認識装置、音声認識方法、及びプログラム |
US9489940B2 (en) * | 2012-06-11 | 2016-11-08 | Nvoq Incorporated | Apparatus and methods to update a language model in a speech recognition system |
CN103680498A (zh) * | 2012-09-26 | 2014-03-26 | 华为技术有限公司 | 一种语音识别方法和设备 |
US9035884B2 (en) * | 2012-10-17 | 2015-05-19 | Nuance Communications, Inc. | Subscription updates in multiple device language models |
US20140316783A1 (en) * | 2013-04-19 | 2014-10-23 | Eitan Asher Medina | Vocal keyword training from text |
US20180317019A1 (en) | 2013-05-23 | 2018-11-01 | Knowles Electronics, Llc | Acoustic activity detecting microphone |
TWI508057B (zh) * | 2013-07-15 | 2015-11-11 | Chunghwa Picture Tubes Ltd | 語音辨識系統以及方法 |
US20160004502A1 (en) * | 2013-07-16 | 2016-01-07 | Cloudcar, Inc. | System and method for correcting speech input |
JP2015060095A (ja) * | 2013-09-19 | 2015-03-30 | 株式会社東芝 | 音声翻訳装置、音声翻訳方法およびプログラム |
US9508345B1 (en) | 2013-09-24 | 2016-11-29 | Knowles Electronics, Llc | Continuous voice sensing |
CN103578465B (zh) * | 2013-10-18 | 2016-08-17 | 威盛电子股份有限公司 | 语音辨识方法及电子装置 |
US9953634B1 (en) | 2013-12-17 | 2018-04-24 | Knowles Electronics, Llc | Passive training for automatic speech recognition |
CN103903615B (zh) * | 2014-03-10 | 2018-11-09 | 联想(北京)有限公司 | 一种信息处理方法及电子设备 |
US9437188B1 (en) | 2014-03-28 | 2016-09-06 | Knowles Electronics, Llc | Buffered reprocessing for multi-microphone automatic speech recognition assist |
US10045140B2 (en) | 2015-01-07 | 2018-08-07 | Knowles Electronics, Llc | Utilizing digital microphones for low power keyword detection and noise suppression |
JP6516585B2 (ja) * | 2015-06-24 | 2019-05-22 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | 制御装置、その方法及びプログラム |
US10152298B1 (en) * | 2015-06-29 | 2018-12-11 | Amazon Technologies, Inc. | Confidence estimation based on frequency |
CN106935239A (zh) * | 2015-12-29 | 2017-07-07 | 阿里巴巴集团控股有限公司 | 一种发音词典的构建方法及装置 |
JP6545633B2 (ja) * | 2016-03-17 | 2019-07-17 | 株式会社東芝 | 単語スコア計算装置、単語スコア計算方法及びプログラム |
CN105845139B (zh) * | 2016-05-20 | 2020-06-16 | 北方民族大学 | 一种离线语音控制方法和装置 |
CN106297797B (zh) * | 2016-07-26 | 2019-05-31 | 百度在线网络技术(北京)有限公司 | 语音识别结果纠错方法和装置 |
CN106710587A (zh) * | 2016-12-20 | 2017-05-24 | 广东东田数码科技有限公司 | 一种语音识别数据预处理方法 |
CN107015969A (zh) * | 2017-05-19 | 2017-08-04 | 四川长虹电器股份有限公司 | 可自我更新的语义理解系统与方法 |
KR102353486B1 (ko) * | 2017-07-18 | 2022-01-20 | 엘지전자 주식회사 | 이동 단말기 및 그 제어 방법 |
JP6790003B2 (ja) * | 2018-02-05 | 2020-11-25 | 株式会社東芝 | 編集支援装置、編集支援方法及びプログラム |
US10846319B2 (en) * | 2018-03-19 | 2020-11-24 | Adobe Inc. | Online dictionary extension of word vectors |
CN108831473B (zh) * | 2018-03-30 | 2021-08-17 | 联想(北京)有限公司 | 一种音频处理方法及装置 |
JP6910987B2 (ja) * | 2018-06-07 | 2021-07-28 | 株式会社東芝 | 認識装置、認識システム、端末装置、サーバ装置、方法及びプログラム |
US20220005462A1 (en) * | 2018-11-05 | 2022-01-06 | Systran International | Method and device for generating optimal language model using big data |
KR20200063521A (ko) * | 2018-11-28 | 2020-06-05 | 삼성전자주식회사 | 전자 장치 및 이의 제어 방법 |
CN110718226B (zh) * | 2019-09-19 | 2023-05-05 | 厦门快商通科技股份有限公司 | 语音识别结果处理方法、装置、电子设备及介质 |
CN111475611B (zh) * | 2020-03-02 | 2023-09-15 | 北京声智科技有限公司 | 词典管理方法、装置、计算机设备及存储介质 |
CN112037770B (zh) * | 2020-08-03 | 2023-12-29 | 北京捷通华声科技股份有限公司 | 发音词典的生成方法、单词语音识别的方法和装置 |
US11829720B2 (en) * | 2020-09-01 | 2023-11-28 | Apple Inc. | Analysis and validation of language models |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002229585A (ja) * | 2001-01-31 | 2002-08-16 | Mitsubishi Electric Corp | 音声認識文章入力装置 |
JP2003108180A (ja) * | 2001-09-26 | 2003-04-11 | Seiko Epson Corp | 音声合成方法および音声合成装置 |
JP2003186494A (ja) * | 2001-12-17 | 2003-07-04 | Sony Corp | 音声認識装置および方法、記録媒体、並びにプログラム |
Family Cites Families (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5855000A (en) * | 1995-09-08 | 1998-12-29 | Carnegie Mellon University | Method and apparatus for correcting and repairing machine-transcribed input using independent or cross-modal secondary input |
US5864805A (en) * | 1996-12-20 | 1999-01-26 | International Business Machines Corporation | Method and apparatus for error correction in a continuous dictation system |
US5933804A (en) * | 1997-04-10 | 1999-08-03 | Microsoft Corporation | Extensible speech recognition system that provides a user with audio feedback |
KR100277694B1 (ko) * | 1998-11-11 | 2001-01-15 | 정선종 | 음성인식시스템에서의 발음사전 자동생성 방법 |
US6434521B1 (en) * | 1999-06-24 | 2002-08-13 | Speechworks International, Inc. | Automatically determining words for updating in a pronunciation dictionary in a speech recognition system |
US6622121B1 (en) * | 1999-08-20 | 2003-09-16 | International Business Machines Corporation | Testing speech recognition systems using test data generated by text-to-speech conversion |
JP3976959B2 (ja) | 1999-09-24 | 2007-09-19 | 三菱電機株式会社 | 音声認識装置、音声認識方法および音声認識プログラム記録媒体 |
JP2002014693A (ja) | 2000-06-30 | 2002-01-18 | Mitsubishi Electric Corp | 音声認識システム用辞書提供方法、および音声認識インタフェース |
US6856956B2 (en) * | 2000-07-20 | 2005-02-15 | Microsoft Corporation | Method and apparatus for generating and displaying N-best alternatives in a speech recognition system |
JP2002091477A (ja) | 2000-09-14 | 2002-03-27 | Mitsubishi Electric Corp | 音声認識システム、音声認識装置、音響モデル管理サーバ、言語モデル管理サーバ、音声認識方法及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体 |
US6975985B2 (en) * | 2000-11-29 | 2005-12-13 | International Business Machines Corporation | Method and system for the automatic amendment of speech recognition vocabularies |
JP2003316376A (ja) | 2002-04-22 | 2003-11-07 | Toshiba Corp | 未知語登録装置および未知語登録方法 |
JP4217495B2 (ja) * | 2003-01-29 | 2009-02-04 | キヤノン株式会社 | 音声認識辞書作成方法、音声認識辞書作成装置及びプログラム、記録媒体 |
US7437296B2 (en) * | 2003-03-13 | 2008-10-14 | Matsushita Electric Industrial Co., Ltd. | Speech recognition dictionary creation apparatus and information search apparatus |
JP2004294542A (ja) | 2003-03-25 | 2004-10-21 | Mitsubishi Electric Corp | 音声認識装置及びそのプログラム |
US20040243412A1 (en) * | 2003-05-29 | 2004-12-02 | Gupta Sunil K. | Adaptation of speech models in speech recognition |
JP4515186B2 (ja) | 2003-09-02 | 2010-07-28 | 株式会社ジー・エフグループ | 音声辞書作成装置、音声辞書作成方法、及びプログラム |
US7266495B1 (en) * | 2003-09-12 | 2007-09-04 | Nuance Communications, Inc. | Method and system for learning linguistically valid word pronunciations from acoustic data |
US7783474B2 (en) * | 2004-02-27 | 2010-08-24 | Nuance Communications, Inc. | System and method for generating a phrase pronunciation |
US7392186B2 (en) * | 2004-03-30 | 2008-06-24 | Sony Corporation | System and method for effectively implementing an optimized language model for speech recognition |
JP2004265440A (ja) | 2004-04-28 | 2004-09-24 | A I Soft Inc | 未知語登録装置および方法並びに記録媒体 |
CN100524457C (zh) * | 2004-05-31 | 2009-08-05 | 国际商业机器公司 | 文本至语音转换以及调整语料库的装置和方法 |
US7684988B2 (en) * | 2004-10-15 | 2010-03-23 | Microsoft Corporation | Testing and tuning of automatic speech recognition systems using synthetic inputs generated from its acoustic models |
US7590536B2 (en) * | 2005-10-07 | 2009-09-15 | Nuance Communications, Inc. | Voice language model adjustment based on user affinity |
US7756708B2 (en) * | 2006-04-03 | 2010-07-13 | Google Inc. | Automatic language model update |
-
2007
- 2007-02-02 US US12/280,594 patent/US8719021B2/en active Active
- 2007-02-02 WO PCT/JP2007/051778 patent/WO2007097176A1/ja active Application Filing
- 2007-02-02 JP JP2008501662A patent/JP5040909B2/ja active Active
- 2007-02-02 CN CN200780006299XA patent/CN101432801B/zh active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002229585A (ja) * | 2001-01-31 | 2002-08-16 | Mitsubishi Electric Corp | 音声認識文章入力装置 |
JP2003108180A (ja) * | 2001-09-26 | 2003-04-11 | Seiko Epson Corp | 音声合成方法および音声合成装置 |
JP2003186494A (ja) * | 2001-12-17 | 2003-07-04 | Sony Corp | 音声認識装置および方法、記録媒体、並びにプログラム |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009271465A (ja) * | 2008-05-12 | 2009-11-19 | Nippon Telegr & Teleph Corp <Ntt> | 単語追加装置、単語追加方法、そのプログラム |
JP2017167188A (ja) * | 2016-03-14 | 2017-09-21 | 株式会社東芝 | 情報処理装置、情報処理方法、プログラムおよび認識システム |
CN110032626A (zh) * | 2019-04-19 | 2019-07-19 | 百度在线网络技术(北京)有限公司 | 语音播报方法和装置 |
US20220215168A1 (en) * | 2019-05-30 | 2022-07-07 | Sony Group Corporation | Information processing device, information processing method, and program |
US11934779B2 (en) * | 2019-05-30 | 2024-03-19 | Sony Group Corporation | Information processing device, information processing method, and program |
JP7479249B2 (ja) | 2020-09-02 | 2024-05-08 | 株式会社日立ソリューションズ・テクノロジー | 未知語検出方法及び未知語検出装置 |
JP7481999B2 (ja) | 2020-11-05 | 2024-05-13 | 株式会社東芝 | 辞書編集装置、辞書編集方法及び辞書編集プログラム |
WO2023162513A1 (ja) * | 2022-02-28 | 2023-08-31 | 国立研究開発法人情報通信研究機構 | 言語モデル学習装置、対話装置及び学習済言語モデル |
Also Published As
Publication number | Publication date |
---|---|
US20090024392A1 (en) | 2009-01-22 |
CN101432801B (zh) | 2012-04-18 |
JPWO2007097176A1 (ja) | 2009-07-09 |
US8719021B2 (en) | 2014-05-06 |
JP5040909B2 (ja) | 2012-10-03 |
CN101432801A (zh) | 2009-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5040909B2 (ja) | 音声認識辞書作成支援システム、音声認識辞書作成支援方法及び音声認識辞書作成支援用プログラム | |
US6934683B2 (en) | Disambiguation language model | |
Schuster et al. | Japanese and korean voice search | |
Karpov et al. | Large vocabulary Russian speech recognition using syntactico-statistical language modeling | |
US7299178B2 (en) | Continuous speech recognition method and system using inter-word phonetic information | |
US5878390A (en) | Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition | |
Hirsimaki et al. | Importance of high-order n-gram models in morph-based speech recognition | |
CN107705787A (zh) | 一种语音识别方法及装置 | |
Illina et al. | Grapheme-to-phoneme conversion using conditional random fields | |
EP3948849A1 (en) | Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models | |
Al-Anzi et al. | The impact of phonological rules on Arabic speech recognition | |
KR20230156125A (ko) | 룩업 테이블 순환 언어 모델 | |
Ablimit et al. | Lexicon optimization based on discriminative learning for automatic speech recognition of agglutinative language | |
Azim et al. | Large vocabulary Arabic continuous speech recognition using tied states acoustic models | |
Pellegrini et al. | Automatic word decompounding for asr in a morphologically rich language: Application to amharic | |
Demuynck et al. | The ESAT 2008 system for N-Best Dutch speech recognition benchmark | |
JP4764203B2 (ja) | 音声認識装置及び音声認識プログラム | |
JP4733436B2 (ja) | 単語・意味表現組データベースの作成方法、音声理解方法、単語・意味表現組データベース作成装置、音声理解装置、プログラムおよび記憶媒体 | |
Al-Anzi et al. | Performance evaluation of Sphinx and htk speech recognizers for spoken Arabic language | |
JP2006031278A (ja) | 音声検索システムおよび方法ならびにプログラム | |
JP4674609B2 (ja) | 情報処理装置および方法、プログラム、並びに記録媒体 | |
KR100511247B1 (ko) | 음성 인식 시스템의 언어 모델링 방법 | |
Hasegawa-Johnson et al. | Fast transcription of speech in low-resource languages | |
WO2014035437A1 (en) | Using character describer to efficiently input ambiguous characters for smart chinese speech dictation correction | |
JP2000075885A (ja) | 音声認識装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
ENP | Entry into the national phase |
Ref document number: 2008501662 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 200780006299.X Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12280594 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07713778 Country of ref document: EP Kind code of ref document: A1 |