WO2011068170A1 - 検索装置、検索方法、及び、プログラム - Google Patents
検索装置、検索方法、及び、プログラム Download PDFInfo
- Publication number
- WO2011068170A1 WO2011068170A1 PCT/JP2010/071605 JP2010071605W WO2011068170A1 WO 2011068170 A1 WO2011068170 A1 WO 2011068170A1 JP 2010071605 W JP2010071605 W JP 2010071605W WO 2011068170 A1 WO2011068170 A1 WO 2011068170A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- search
- search result
- result
- word string
- string
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 239000013598 vector Substances 0.000 claims description 179
- 238000006243 chemical reaction Methods 0.000 claims description 74
- 238000012937 correction Methods 0.000 claims description 48
- 238000004364 calculation method Methods 0.000 claims description 46
- 238000010586 diagram Methods 0.000 description 52
- 230000006870 function Effects 0.000 description 41
- 238000012545 processing Methods 0.000 description 40
- 230000010365 information processing Effects 0.000 description 35
- 238000004458 analytical method Methods 0.000 description 34
- 238000006467 substitution reaction Methods 0.000 description 34
- 238000004090 dissolution Methods 0.000 description 32
- 230000000877 morphologic effect Effects 0.000 description 17
- 238000004088 simulation Methods 0.000 description 12
- 239000000470 constituent Substances 0.000 description 10
- 235000008733 Citrus aurantifolia Nutrition 0.000 description 9
- 235000011941 Tilia x europaea Nutrition 0.000 description 9
- 239000004571 lime Substances 0.000 description 9
- 238000012217 deletion Methods 0.000 description 7
- 230000037430 deletion Effects 0.000 description 7
- 238000007781 pre-processing Methods 0.000 description 7
- 238000004519 manufacturing process Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000001174 ascending effect Effects 0.000 description 2
- 230000006866 deterioration Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 208000000474 Poliomyelitis Diseases 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001172 regenerating effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3343—Query execution using phonetics
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present invention relates to a search device, a search method, and a program, and in particular, for example, a search device, a search method, and a search method that can robustly search for a word string corresponding to an input speech.
- a search device for example, a search device, a search method, and a search method that can robustly search for a word string corresponding to an input speech.
- a search method for example, a search method, and a search method that can robustly search for a word string corresponding to an input speech.
- the voice recognition device performs voice recognition of the input speech using the word (vocabulary) sequence registered in the dictionary in advance as the target of the voice recognition result. Are output as a search result word string that is a result of a search for a word string corresponding to the input speech.
- a word string that is a target of a search result of a word string corresponding to an input voice (hereinafter also referred to as a search result target word string) is a target of a voice recognition result. Therefore, the user's utterances are limited by the arrangement of words registered in the dictionary used for speech recognition.
- voice search a voice search method called voice search has been proposed.
- voice search continuous speech recognition is performed using a language model such as N-gram, and the speech recognition results and text registered in a DB (Database) prepared separately from the dictionary used for speech recognition (Text search for text corresponding to the speech recognition result from text registered in the DB) is performed.
- a language model such as N-gram
- DB Database
- Text search for text corresponding to the speech recognition result from text registered in the DB
- the topmost text or the text within the top N that matches the speech recognition result is output as a search result word string.
- the text registered in the DB prepared separately from the dictionary used for speech recognition becomes the search result target word string. Therefore, by registering many texts in the DB, many of them can be registered. Can be used as a search result target word string.
- the matching between the speech recognition result and the text as the search result target word string uses notation symbols which are symbols representing the respective notations of the voice recognition result and the search result target word string. This is done in units of words or notation symbols.
- search result target word string completely different from the word string corresponding to the input speech matches with the speech recognition result in matching, and as a result, A search result target word string that is completely different from the word string corresponding to the input speech may be output as the search result word string.
- the notation symbol string “City World” in the speech recognition result is divided into single words, such as “City / No / World /” (slash (/) indicates a separator), and matching is performed.
- the notation symbol string “city world” in the speech recognition result is divided into notation symbols one by one like “city / city / no / world / world”, and matching is performed. Is done.
- the matching symbol unit of the speech recognition result “Year of the year” is changed to “ Like “/ year / no / se / kai /”, matching is done by dividing into words one by one, and in the notation symbol unit matching, the notation symbol string “Yunasekai” of the speech recognition result is changed to “year”. Like “//////”, matching is performed by dividing into one notation symbol at a time.
- the user speaks with “tolkien” (the author of the Lord of the Rings (the author of “the Lord of the Rings”)), for example, as input speech, and the notation symbol string of the speech recognition result is
- the notation symbol string of the speech recognition result is For example, in the case of one word “tolkien”, in word unit matching, one word “tolkien” that is a notation symbol string of a speech recognition result is used as it is, and matching is performed, and a notation symbol (alphabetic character) unit.
- the display symbol string “tolkien” of the speech recognition result is divided into notation symbols one by one as t / o / l / k / i / e / n, and matching is performed.
- the notation symbol string of the speech recognition result of the input speech “tolkien” is, for example, “toll keene”
- the notation symbol string “toll keene” of the speech recognition result is set to toll / keene in word unit matching.
- matching is done by dividing into words one by one, and in matching in units of notation symbols (alphabetic character), as in t / o / l / l / k / e / e / n / e Matching is performed by dividing into alphabet units, which are notation symbols one by one.
- the search result target word string that matches the speech recognition result differs greatly depending on whether the notation symbol string of the speech recognition result of the input speech “tolkien” is “tolkien” or “toll keene”.
- a search result target word string that is completely different from the word string corresponding to the input voice is output as the search result word string, whereas the word string corresponding to the input voice is not output as the search result word string. is there.
- the present invention has been made in view of such a situation, and it is possible to robustly search a word string corresponding to an input voice and obtain a word string corresponding to the input voice as a search result word string. It is what you want to do.
- the search device or program includes a speech recognition unit that recognizes input speech, and a plurality of search result target words that are target words of a search result of a word sequence corresponding to the input speech. For each column, a search result target pronunciation symbol string that is a sequence of pronunciation symbols representing the pronunciation of the search result target word string, and a recognition result pronunciation symbol string that is a sequence of pronunciation symbols representing the pronunciation of the speech recognition result of the input speech And a word sequence corresponding to the input speech from the plurality of search result target word strings based on a matching result between the search result target pronunciation symbol string and the recognition result pronunciation symbol string
- a search device including an output unit that outputs a search result word string that is a result of the search, or a computer functioning as a search device Is a program.
- a search device that searches for a word string corresponding to an input voice recognizes the input voice and searches a plurality of search result targets for searching for a word string corresponding to the input voice. For each word string, a recognition result pronunciation symbol that is a string of search result target pronunciation symbol strings representing pronunciation of the search result target word string and a pronunciation symbol representing pronunciation of the speech recognition result of the input speech Search for a word string corresponding to the input speech from the plurality of search result target word strings based on a matching result between the search result target pronunciation symbol string and the recognition result pronunciation symbol string.
- This is a search method including a step of outputting a search result word string that is a result of the above.
- an input speech is recognized and a pronunciation representing the pronunciation of the search result target word string is obtained for each of a plurality of search result target word strings to be searched for a word string corresponding to the input voice.
- a search result target pronunciation symbol string which is a sequence of symbols (phonetic symbol) and a recognition result pronunciation symbol sequence which is a sequence of pronunciation symbols representing the pronunciation of the speech recognition result of the input speech are taken. Then, based on a matching result between the search result target pronunciation symbol string and the recognition result pronunciation symbol string, a search that is a result of a search for a word string corresponding to the input speech from the plurality of search result target word strings The result word string is output.
- search device may be an independent device or an internal block constituting one device.
- the program can be provided by being transmitted through a transmission medium or by being recorded on a recording medium.
- FIG. 1 It is a figure which shows the process in the case of performing matching with a speech recognition result and a search result object word string for every word using the notation symbol of each of a speech recognition result and a search result object word string.
- the matching between the speech recognition result and the search result target word string is performed in units of words using notation symbols of the speech recognition result and the search result target word string, and in the case of performing in one or more units of the notation symbols.
- FIG. The matching between the speech recognition result and the search result target word string is performed in units of words using notation symbols of the speech recognition result and the search result target word string, and in the case of performing in one or more units of the notation symbols.
- FIG. 10 is a diagram illustrating a relationship between a size
- FIG. 3 is a block diagram illustrating a configuration example of a voice recognition unit 51.
- FIG. It is a figure which shows the example of the metadata of the program as a search result object word string memorize
- FIG. 3 is a block diagram illustrating a configuration example of a total score calculation unit 91.
- FIG. 10 is a diagram illustrating processing of the device 50.
- FIG. 10 is a diagram illustrating processing of the device 50. It is a block diagram which shows the structural example of the part which calculates
- V UTR * V TITLE (i) using a reverse index It is a figure explaining the method of calculating inner product V UTR * V TITLE (i) using a reverse index.
- 4 is a flowchart for explaining processing of the voice search device 50. It is a block diagram which shows the structural example of one Embodiment of the computer to which this invention is applied.
- mapping between the speech recognition result and the text as the search result target word string uses notation symbols that are symbols representing the notation of the speech recognition result and the search result target word string. This is done in units of notation symbols.
- search result target word string completely different from the word string corresponding to the input speech matches with the speech recognition result in matching, and as a result, A search result target word string completely different from the word string corresponding to the input speech is output as the search result word string.
- the notation symbol string “City World” in the speech recognition result is divided into single words, such as “City / No / World /” (slash (/) indicates a separator), and matching is performed.
- the notation symbol string “city world” in the speech recognition result is divided into notation symbols one by one like “city / city / no / world / world”, and matching is performed. Is done.
- the matching symbol unit of the speech recognition result “Year of the year” is changed to “ Like “/ year / no / se / kai /”, matching is done by dividing into words one by one, and in the notation symbol unit matching, the notation symbol string “Yunasekai” of the speech recognition result is changed to “year”. Like “//////”, matching is performed by dividing into one notation symbol at a time.
- Matching is performed using the single word “tolkien”, which is the notation symbol string of the speech recognition result, as it is.
- the display symbol string “tolkien” of the speech recognition result is t / o / l / k.
- matching is performed by dividing into notation symbols one by one.
- the notation symbol string of the speech recognition result of the input speech “tolkien” is, for example, “toll keene”
- the notation symbol string “toll keene” of the speech recognition result is set to toll / keene in word unit matching.
- matching is performed by dividing into words one by one, and in matching in units of notation symbols, one by one like t / o / l / l / k / e / e / n / e Matching is performed by dividing into alphabetic units which are notation symbols.
- the search result target word string that matches the speech recognition result differs greatly depending on whether the notation symbol string of the speech recognition result of the input speech “tolkien” is “tolkien” or “toll keene”.
- a search result target word string that is completely different from the word string corresponding to the input voice is output as the search result word string, whereas the word string corresponding to the input voice is not output as the search result word string. is there.
- matching between the speech recognition result and the search result target word string is performed using the pronunciation recognition symbol which is a symbol representing the pronunciation of the speech recognition result and the search result target word string.
- the search for the word string corresponding to the input voice can be performed robustly, thereby preventing the word string corresponding to the input voice from being output as the search result word string.
- a cosine distance of a vector space method is used.
- the similarity between the speech recognition result and the search result target word string is obtained by dividing the inner product of the vectors X and Y by the product of the magnitude (norm)
- the cosine distance is obtained by dividing the inner product by the product of the magnitude of the vector X representing the speech recognition result
- the cosine distance is adopted as the similarity, for example, the same word string as that included in the speech recognition result is included, but the length is longer than the speech recognition result, and the speech recognition result
- a shorter search result target word string has a higher similarity (similar) to a shorter search result target word string than a speech recognition result, and a lower similarity to a longer search result target word string than a speech recognition result ( There is a strong tendency.
- search result target word string within the top N ranks with high similarity obtained as a result of matching is output as the search result word string, the same word string as included in the speech recognition result is included, but the long
- the similarity of the search result target word string that is longer than the speech recognition result is low, and such a long search result target word string is often not output as the search result word string, and the word corresponding to the input speech Column search accuracy is degraded.
- the corrected distance obtained by correcting the cosine distance is reduced to the speech recognition result and the search result target word string so as to reduce the influence of the difference in length between the speech recognition result and the search result target word string.
- a search used when obtaining the cosine distance is used.
- a method using a non-proportional value There are a method using a non-proportional value and a method not using the size
- the text that is the search result target word string may be an enormous number such as hundreds of thousands, and the word string corresponding to the utterance (input voice) is the user's utterance.
- the search result word string that is a search result, it is necessary to perform matching at high speed.
- matching is performed at high speed by using a reverse lookup index or the like.
- an acoustic model such as HMM (Hidden Markov Model) is used to represent the acoustic likelihood of the recognition hypothesis as a speech recognition result of a recognition hypothesis that is a candidate (hypothesis) of speech recognition results.
- a score is obtained, and a language score representing the linguistic likelihood of the recognition hypothesis is obtained using a language model such as N-gram, and the speech recognition result is taken into account both the acoustic score and the language score. (Recognition hypothesis) is required.
- the language model used in voice recognition for voice search is generated using, for example, a word string described in a newspaper.
- the user will obtain a search result target word string (low frequency word string) including a word string (including a word string that does not appear) that appears in a sentence described in a newspaper as a search result word string.
- a search result target word string low frequency word string
- the language score obtained for the low-frequency word string may be low in voice recognition, and a correct voice recognition result may not be obtained.
- search result word string corresponding to the input speech (a search result target appropriate for the input speech) is included in the speech recognition result even in matching performed after speech recognition in voice search. (Word string) does not match, and the search result target word string corresponding to the input voice may not be output as the search result word string.
- a program with a title uttered by the user is searched by voice search from an EPG (Electronic Program Guide) and the program is recorded.
- EPG Electronic Program Guide
- voice search first, voice recognition of the title of the program spoken by the user is performed.
- a plurality of search result target word strings that are target word search results of word strings corresponding to input speech, that is, word strings that match voice recognition results in voice search.
- the search result target word string is used to generate a so-called language model, and speech recognition is performed using the dedicated language model, thereby improving the accuracy of speech recognition.
- a word string that is a constituent element (program title, performer name, etc.) constituting the EPG is a voice recognition result. Therefore, a dedicated language model is generated using the search result target word string as a component constituting the EPG.
- search result target word string is the program title or It can be said that it is classified into fields such as performer names.
- a language model for each field is generated using the word strings of each field, and the language model for each field is converted into one language model. If speech recognition is performed using the one language model after interpolating, the language score of a recognition hypothesis in which word strings (part of each) in different fields are arranged may increase.
- the word string in which a part of the title of program A and a part of the name of the performer of program B are arranged does not exist in the constituent elements of the EPG, which is the search result target word string, such a word It is not preferred that the columns become recognition hypotheses with a high language score that can be made into speech recognition results.
- the search result target word string of each field is used for each field.
- a language model field dependent language model
- speech recognition is performed using the language model of each field.
- search result target word string that matches the voice recognition result is output as a search result word string.
- a word unrelated to the program that the user uttered the title that is, a program whose title is not similar to the title of the program that the user uttered, for example, is included in the title of the program that the user uttered.
- a program including a word string similar to a string (including a case of matching) in detailed information or the like as a search result target word string may be obtained as a result of a voice search.
- the search result target word string is classified into a plurality of fields
- the search result target word in a predetermined field such as a field desired by the user is matched with the speech recognition result. It is possible to perform only on the column.
- the user can perform a flexible search such as searching for a program that includes a certain word string only in the title, or searching for a program that includes only the performer name.
- a voice search for a program may not be performed. is there.
- the recorder to which the voice search is applied has, for example, a program search function for searching for a program containing the user's utterance in the title or the like by voice search for the user's utterance.
- a voice control function is performed in which the recorder selects one of the one or more programs searched by the program search function as a program to be played in accordance with the utterance “selection” by the user. I have it.
- the voice control function for selecting a program in accordance with the user's utterance “selection” is obtained by using “selection” as the target of the voice recognition result in voice recognition for voice search and as a voice recognition result in the recorder.
- the “selection” can be realized by interpreting it as a command for controlling the recorder.
- the user speaks “selection” to thereby select from the programs obtained by the function of program selection.
- the recorder can select one program to be reproduced.
- the user cannot utter “selection” that matches the command “selection” for controlling the recorder when searching for the program by the program selection function by voice search.
- the recorder interprets the user's utterance “select” as a command, and does not search for a program including “select” in the program title or the like.
- FIG. 1 is a block diagram showing a first configuration example of an embodiment of a voice search device to which the present invention is applied.
- the speech search apparatus includes a speech recognition unit 11, a phonetic symbol conversion unit 12, a search result target storage unit 13, a morpheme analysis unit 14, a phonetic symbol conversion unit 15, a matching unit 16, and an output unit 17.
- the voice recognition unit 11 is supplied with input voice (data) which is a user's utterance from a microphone or the like (not shown).
- the speech recognition unit 11 recognizes the input speech supplied thereto, and supplies the speech recognition result (for example, a notation symbol) to the pronunciation symbol conversion unit 12.
- the phonetic symbol conversion unit 12 supplies the speech recognition result (for example, a notation symbol) of the input speech supplied from the speech recognition unit 11 to a recognition result pronunciation symbol that is a sequence of pronunciation symbols representing the pronunciation of the speech recognition result.
- the data is converted into a column and supplied to the matching unit 16.
- the search result target storage unit 13 includes a plurality of search result target word strings, that is, a search result word string that is a result of a search for a word string corresponding to an input voice after matching with a voice recognition result in the matching unit 16.
- Possible word strings for example, text as a notation symbol
- Possible word strings are stored.
- the morpheme analysis unit 14 divides the search result target word string into, for example, units of words (morpheme) by performing morpheme analysis of the search result target word string stored in the search result target storage unit 13, and phonetic symbol conversion To the unit 15.
- the pronunciation symbol conversion unit 15 uses the search result target word string (for example, a notation symbol) supplied from the morpheme analysis unit 14 as a search result target pronunciation that is a sequence of pronunciation symbols representing the pronunciation of the search result target word string.
- the symbol string is converted and supplied to the matching unit 16.
- the matching unit 16 performs matching between the recognition result pronunciation symbol string from the pronunciation symbol conversion unit 12 and the search result target pronunciation symbol string from the pronunciation symbol conversion unit 15, and supplies the matching result to the output unit 17.
- the matching unit 16 performs matching with the speech recognition result of the input speech for each of the search result target word strings stored in the search result target storage unit 13, the pronunciation symbol of the speech recognition result, and the search result target word. This is done using the phonetic symbols in the sequence.
- the matching unit 16 matches each of the search result target word strings stored in the search result target storage unit 13 with the speech recognition result of the input speech, and supplies the matching result to the output unit 17.
- the output unit 17 is a search result that is a search result of a word string corresponding to the input speech from among the search result target word strings stored in the search result target storage unit 13.
- the result word string is output.
- a voice search process is performed according to the user's utterance.
- the speech recognition unit 11 recognizes the input speech and generates the speech recognition result of the input speech as a pronunciation. This is supplied to the symbol converter 12.
- the phonetic symbol conversion unit 12 converts the voice recognition result of the input voice from the voice recognition unit 11 into a recognition result phonetic symbol string and supplies it to the matching unit 16.
- the morpheme analysis unit 14 performs morpheme analysis on all search result target word strings stored in the search result target storage unit 13 and supplies the morpheme analysis unit 14 to the pronunciation symbol conversion unit 15.
- the pronunciation symbol conversion unit 15 converts the search result target word string from the morpheme analysis unit 14 into a search result target pronunciation symbol string and supplies it to the matching unit 16.
- the matching unit 16 For each of all search result target word strings stored in the search result target storage unit 13, the matching unit 16 recognizes the recognition result pronunciation symbol string from the pronunciation symbol conversion unit 12 and the search result target pronunciation from the pronunciation symbol conversion unit 15. Using the symbol sequence, matching with the speech recognition result of the input speech is performed, and the matching result is supplied to the output unit 17.
- a search result that is a search result of a word string corresponding to the input speech from among the search result target word strings stored in the search result target storage unit 13.
- a word string (referred to as a search result target word string) is selected and output.
- the user simply obtains a search result target word string as a search result word string that matches the user's utterance from the search result target word strings stored in the search result target storage unit 13. Can do.
- FIG. 2 is a block diagram showing a second configuration example of an embodiment of a voice search device to which the present invention is applied.
- the voice search device of FIG. 2 has a voice recognition unit 11, a search result target storage unit 13, a morpheme analysis unit 14, a matching unit 16, and an output unit 17, and is common to the case of FIG. It differs from the case of FIG. 1 in that a phonetic symbol conversion unit 21 is provided instead of the units 12 and 15.
- the phonetic symbol conversion unit 21 converts the speech recognition result of the input speech supplied from the speech recognition unit 11 into a recognition result phonetic symbol string, and supplies it to the matching unit 16 and also from the morpheme analysis unit 14.
- the search result target word string is converted into a search result target pronunciation symbol string and supplied to the matching unit 16.
- the conversion of the speech recognition result of the input speech into the recognition result pronunciation symbol string and the conversion of the search result target word string into the search result target pronunciation symbol string are separate pronunciation symbol conversion units 12.
- the speech recognition result of the input speech is converted into a recognition result pronunciation symbol string
- the search result target word string is converted into a search result target pronunciation symbol string.
- the phonetic symbol conversion unit 21 performs the conversion to “so”.
- the conversion of the speech recognition result of the input speech into the recognition result pronunciation symbol string and the conversion of the search result target word string into the search result target pronunciation symbol string are separate pronunciations.
- the voice search process similar to that in FIG. 1 is performed except that the symbol conversion units 12 and 15 do not perform each of them but the phonetic symbol conversion unit 21.
- FIG. 3 is a block diagram showing a third configuration example of an embodiment of a voice search device to which the present invention is applied.
- the voice search apparatus of FIG. 3 is common to the case of FIG. 1 in that it has a voice recognition unit 11, a phonetic symbol conversion unit 12, a matching unit 16, and an output unit 17, and a search result target storage unit 13, morphological analysis. It differs from the case of FIG. 1 in that a search result target storage unit 31 is provided instead of the unit 14 and the phonetic symbol conversion unit 15.
- the search result target storage unit 31 pronounces the search result target word string in addition to the same search result target word string (for example, a notation symbol) stored in the search result target storage unit 13.
- the search result object pronunciation symbol string converted into the symbol is stored.
- the search result target pronunciation symbol string used for matching in the matching unit 16 is stored in the search result target storage unit 31, morphological analysis of the search result target word string, The same voice search processing as in FIG. 1 is performed except that conversion to the search result target pronunciation symbol string is not performed.
- FIG. 4 is a block diagram showing a fourth configuration example of an embodiment of a voice search device to which the present invention is applied.
- the voice search device of FIG. 4 has a matching unit 16, an output unit 17, and a search result target storage unit 31, and is common to the case of FIG. 3, with the voice recognition unit 11 and the phonetic symbol conversion unit 12. Instead, it differs from the case of FIG. 3 in that a voice recognition unit 41 is provided.
- the voice recognition unit 41 recognizes the input voice and supplies a recognition result pronunciation symbol string as a voice recognition result of the input voice to the matching unit 16.
- the voice recognition unit 41 includes, for example, the voice recognition unit 11 and the phonetic symbol conversion unit 12 shown in FIG.
- the voice recognition unit 41 performs the same voice search as in FIG. 3 except that the voice recognition result outputs, for example, a recognition result pronunciation symbol string instead of a notation symbol. Processing is performed.
- system is a logical collection of a plurality of devices, and whether or not each component device is in the same housing. Can be applied).
- the voice search device of FIGS. 1 to 4 can be applied to an information processing system, for example, a recorder that records and reproduces a program.
- a recorder as an information processing system to which the voice search device of FIGS. 1 to 4 is applied (hereinafter also referred to as an information processing system with a voice search function), for example, from among recorded programs (recorded programs), The program desired by the user can be searched and reproduced by voice search.
- the recorder uses the title of the recorded program as a search result target word string, By performing the voice search, a program whose title is similar to the pronunciation of the input voice “World Heritage” is searched from the recorded programs.
- the user operates a remote commander that remotely controls the recorder to select N playback candidate programs. There is a method of selecting one program from among them.
- the user can To select one program from among the N playback candidate programs.
- the title of the second playback candidate program is “World Heritage / Great Wall”, and the second playback candidate program “World Heritage / Great Wall”
- the user can, for example, say “second”, which is the order of the playback candidate programs, or the title “World Heritage / Great Wall”, etc.
- a playback candidate program can be selected.
- a program desired by a user can be searched by voice search from EPG programs, and a recording reservation (or a viewing reservation) can be made.
- the recorder uses the title of the program as a component constituting the EPG.
- a voice search as a search result target word string, a program whose pronunciation such as a title is similar to the pronunciation of the input voice “world heritage” is searched from the EPG.
- the pronunciation of the title is similar to the pronunciation of the input voice “world heritage” (the title etc.) Is displayed as a candidate program for recording reservation (recording candidate program).
- the recording reservation of the program is performed in the recorder, and further, according to the recording reservation, Recording is performed.
- one program is selected from the N reproduction candidate programs in the playback of the recorded program.
- a method similar to the case can be adopted.
- a system for searching for and purchasing a program (video content) through a video-on-demand site connected via a network in addition to the recorder described above, a system for searching for and purchasing a game through a game software sales site connected via a network.
- various word strings can be adopted as the search result target word strings.
- the program title, the name of a performer, detailed information explaining the content of the program, metadata of the program, subtitles superimposed on the program image (closed) (Caption) or the like can be used as a search result target word string.
- the search result target word string when searching for music (music), the title, lyrics, artist name, etc. (part or all) of the music can be adopted as the search result target word string.
- FIG. 5 is a diagram for explaining a process of reproducing a recorded program in a recorder as an information processing system with a voice search function.
- a recorder as an information processing system with a voice search function
- a program desired by a user is searched and played back by voice search from recorded programs
- the user can play back the voice of the program to be played back.
- a keyword for performing a search for example, the Japanese input voice “City World Heritage” and the English input voice “World Heritage City” are spoken.
- a voice search is performed using the title of the recorded program as a search result target word string, and the pronunciation of the title is the input voice “city world heritage” or “World Heritage City A program similar to the pronunciation of “is searched for from recorded programs.
- the pronunciation of the title is similar to the pronunciation of the input voice "City World Heritage” or "World Heritage City”
- a program (such as its title) is displayed as a reproduction candidate program that is a candidate program to be reproduced.
- the user selects the playback candidate program as the playback candidate program from the top N programs next to the currently displayed top N programs. It is possible to request by utterance to use another keyword as a keyword for displaying as a program or performing a voice search.
- the user can select the desired program.
- the user can select the desired program by operating the touch panel, operating the remote commander, selecting by voice, and the like.
- the program is played back by the recorder as an information processing system with a voice search function.
- FIG. 6 is a diagram for explaining a method in which the user selects a desired program from among N playback candidate programs.
- N playback candidate programs are displayed on the touch panel
- the user displays a desired program (for example, a title) among the N playback candidate programs displayed on the touch panel.
- a desired program can be selected by touching the portion.
- N playback candidate programs are displayed together with a cursor that can be selectively focused on each playback candidate program and can be moved by a remote commander
- the user operates the remote commander.
- the user can select the desired program by operating the remote commander to move the cursor so that the desired program is focused and to confirm the selection of the focused desired program. it can.
- N playback candidate programs are displayed with numbers indicating the order of the playback candidate programs
- the remote commander is provided with a number button that can specify numbers.
- the user can select a desired program by operating a number button that designates a number added to the desired program among the number buttons of the remote commander.
- the user can select a desired program by speaking the title of the desired program among the N reproduction candidate programs.
- N playback candidate programs are displayed with numbers representing the order of the playback candidate programs
- the user speaks the numbers added to the desired program, Can be selected.
- FIG. 7 is a diagram for explaining another process of the recorder as the information processing system with a voice search function.
- FIG. 5 a plurality of reproduction candidate programs such as five are displayed as a search result of the voice search from the recorded program, but in FIG. 7, only one reproduction candidate program is displayed.
- the recorded program is recorded in the recorder as an information processing system with a voice search function.
- a search is performed by using the title of the search word as a search result target word string, and a program whose title is similar to the pronunciation of the input sound “city world heritage” is searched from the recorded programs.
- the title of one program (the title of the top program) whose pronunciation is similar to the pronunciation of the input voice “city world heritage” Etc.) are displayed as playback candidate programs.
- the user selects whether to select (accept) one playback candidate program obtained as a result of the voice search as a program to be played back, or display another program as a playback candidate program again. can do.
- a remote commander for remotely controlling a recorder as an information processing system with a voice search function is provided with an accept button for designating acceptance and another program button for designating redisplay of another program as a playback candidate program
- the user selects one playback candidate program obtained as a result of the voice search as a program to be played back by operating the accept button or another program button, or It is possible to designate whether another program is to be displayed again as a reproduction candidate program.
- the user utters, for example, “OK” as a sound designating acceptance, or “different”, for example, as a sound designating that another program is displayed again as a playback candidate program.
- “OK” as a sound designating acceptance
- “different” for example, as a sound designating that another program is displayed again as a playback candidate program.
- a recorder as an information processing system with a voice search function
- the playback candidate program is played back.
- FIG. 8 is a diagram for explaining processing performed by various devices as an information processing system with a voice search function.
- FIG. 8A is a diagram for explaining a process for making a recording reservation in a recorder as an information processing system with a voice search function.
- the recorder uses the title of the program as a component constituting the EPG as a search result target word string, By performing a voice search, a program whose pronunciation such as a title is similar to the pronunciation of the input voice is searched from the EPG.
- a program within the top N (same title) whose pronunciation of the title is similar to that of the input voice is recorded as a candidate program for recording reservation. Is displayed.
- the recording reservation of the program is performed in the recorder, and further, according to the recording reservation, Recording is performed.
- FIG. 8B is a diagram for explaining processing for purchasing a program in a program purchasing system for purchasing a program (video content) as an information processing system with a voice search function.
- the program purchase system accesses a video on demand site that sells the program via a network such as the Internet, for example. And by performing a voice search (video on demand search) using the titles of the programs sold by the video on demand site as search result target word strings, the pronunciation of the titles is similar to the pronunciation of the input voice
- the program to be searched is searched.
- a program (title etc.) within the top N ranks whose title is similar to the pronunciation of the input voice is selected as a purchase candidate program that is a candidate program for purchase. Is displayed.
- the program purchase system performs the purchase processing of the program, that is, download of the program from the video on demand site, A billing process for paying for the program is performed.
- FIG. 8C is a diagram illustrating a process of purchasing music in a music purchasing system for purchasing music (music) as an information processing system with a voice search function.
- the music purchase system accesses a music sales site that sells the music via a network such as the Internet.
- a voice search using the title (song name) of the music sold by the music sales site as a search result target word string, the music whose title is similar to the pronunciation of the input voice is searched.
- the top N-ranked music whose title pronunciation is similar to the pronunciation of the input voice (such as its title) is the purchase candidate music that is the candidate music for purchase. Is displayed.
- the music purchase system performs the purchase process of the song.
- FIG. 8D is a diagram illustrating a process of playing back music recorded on a recording medium in a music playback system that plays back music (music) as an information processing system with a voice search function.
- the music playback system uses a search result target word string such as the title (song name) of the music recorded on the recording medium.
- a search result target word string such as the title (song name) of the music recorded on the recording medium.
- a playback candidate song whose title pronunciation is similar to that of the input voice and whose top N ranks (such as titles) are candidates for playback Is displayed.
- the music reproduction system reproduces the piece of music.
- FIG. 8E is a diagram for explaining processing for purchasing game software in a game software purchasing system for purchasing game software (software) as an information processing system with a voice search function.
- the game software purchase system sells game software via a network such as the Internet.
- a network such as the Internet.
- the purchase software whose title pronunciation is similar to the pronunciation of the input voice and whose title is within the top N (such as titles) is a purchase candidate game software. Displayed as candidate game software.
- the game software purchase system performs the purchase processing of the game software.
- voice search is performed on the information processing system side connected to sites such as a video on demand site (B in FIG. 8), a music sales site (C in FIG. 8), a game software sales site (E in FIG. 8), and the like. It is possible to do it on the site side instead of doing it.
- the voice search device of FIGS. 1 to 4 can be applied to other than the information processing system described above.
- the voice search device of FIG. 1 to FIG. 4 searches the music including the lyrics, or when the user utters a part of the lines, the speech
- the present invention can be applied to an information processing system that searches for movie content that includes, and an information processing system that searches for (electronic) books and magazines that include the description when the user utters part of the description.
- FIG. 9 is a block diagram showing a configuration example of a recorder as an information processing system to which the voice search device of FIGS. 1 to 4 is applied.
- the recorder includes a voice search device 50, a recorder function unit 60, a command determination unit 71, a control unit 72, and an output I / F (Interface) 73.
- the voice search device 50 is configured in the same manner as the voice search device of FIG. 1 among the voice search devices of FIGS.
- the voice search device 50 includes a voice recognition unit 51, a phonetic symbol conversion unit 52, a search result target storage unit 53, a morpheme analysis unit 54, a phonetic symbol conversion unit 55, a matching unit 56, and an output unit 57.
- the voice recognition unit 51 to the output unit 57 are configured in the same manner as the voice recognition unit 11 to the output unit 17 of FIG.
- the voice search device 50 can be configured in the same manner as any of the voice search devices of FIGS. 2 to 4 in addition to the voice search device of FIG.
- the recorder function unit 60 includes a tuner 61, a recording / reproducing unit 62, and a recording medium 63, and records (records) and reproduces a television broadcast program.
- the tuner 61 is supplied with a television broadcast signal received by an antenna (not shown), for example, by digital broadcasting.
- the tuner 61 receives a television broadcast signal supplied thereto, extracts a television broadcast signal of a predetermined channel from the television broadcast signal, demodulates the bit stream, and supplies it to the recording / reproducing unit 62.
- the recording / playback unit 62 extracts EPG, program data, and the like from the bitstream supplied from the tuner 61 and supplies the extracted data to the output I / F 73.
- the recording / reproducing unit 62 records (records) EPG and program data on the recording medium 63.
- the recording / reproducing unit 62 reproduces program data from the recording medium 63 and supplies it to the output I / F 73.
- the recording medium 63 is, for example, an HD (Hard Disk) or the like, and EPG and program data are recorded on the recording medium 63 by the recording / reproducing unit 62.
- HD Hard Disk
- the command recognition unit 71 is supplied with the voice recognition result of the input voice from the voice recognition unit 51.
- the command determination unit 71 determines whether the input voice is a command for controlling the recorder based on the voice recognition result of the input voice from the voice recognition unit 51, and supplies the determination result to the control unit 72. To do.
- the control unit 72 performs processing according to the command based on the determination result of whether or not the input voice is a command from the command determination unit 72, and the voice search device 50, the recorder function unit 60, etc. Controls the blocks that make up the recorder. In addition, the control unit 72 performs processing according to an operation of a remote commander (not shown).
- the output I / F 73 is supplied with EPG and program data from the recording / playback unit 62. Further, the output I / F 73 is supplied with a search result display screen (data thereof) on which a search result word string that is a result of the voice search in the voice search device 50 is displayed from the output unit 57.
- the output unit I / F 73 is an interface connected to a display device that can display at least an image such as a TV, for example, and the EPG and program data from the recording / playback unit 62 and the output unit 57
- the search result display screen is supplied to a TV (not shown) connected to the output unit I / F 73, for example.
- the program title, performer name, detailed information, and the like which are constituent elements of the EPG recorded in the recording medium 63, are supplied to the search result target storage unit 53. And memorized.
- the program title, performer name, detailed information, etc. which are metadata of the program (recorded program) recorded (recorded) on the recording medium 63, are stored in the search result target storage unit 53. Supplied and stored.
- a voice search is performed using the program title, performer name, detailed information, etc. as a search result target word string.
- FIG. 10 is a block diagram showing another configuration example of a recorder as an information processing system to which the voice search device of FIGS. 1 to 4 is applied.
- FIG. 10 portions corresponding to those in FIG. 9 are denoted by the same reference numerals, and description thereof will be omitted below as appropriate.
- the voice search device 50 does not have the morphological analysis unit 54.
- the speech search device 50 of FIG. 9 having the morphological analysis unit 54 for example, the speech search device 50 of FIG. For example, an English input speech that does not require morphological analysis is searched.
- a mode in which the morphological analysis unit 54 functions and a mode in which the morphological analysis unit 54 does not function are provided.
- voice search can be performed for input speech in both Japanese and English.
- the voice recognition unit 51 performs voice recognition of the input voice
- the matching unit 56 stores the voice recognition result and the search result target storage unit 53. Matching with the search result target word string is performed.
- FIG. 11 is a diagram illustrating an example of processing in the case where matching between a speech recognition result and a search result target word string is performed in units of words using notation symbols of the speech recognition result and the search result target word string. .
- the speech recognition result “city of world heritage freedom goddess” is obtained for Japanese input speech “city of world heritage freedom goddess”, and the speech recognition result “city world heritage freedom goddess”. "Is divided into word units like” city / of / world / heritage / freedom / of / goddess ".
- a word-unit speech recognition result “city / no / world / heritage / freedom / no / goddess” is matched with, for example, a program title as a word-unit search result target word string.
- FIG. 12 is a diagram illustrating another example of the processing when the speech recognition result and the search result target word string are matched in units of words using the notation symbols of the speech recognition result and the search result target word string. is there.
- a speech recognition result “World” Heritage City The Statue of Liberty ” is obtained with respect to the English input speech“ World Heritage City The Statue of Liberty ”, and the speech recognition result“ World Heritage City The Statue of Liberty ”. Is divided into units of words, such as "World / Heritage / City / The / Statue / of / Liberty”.
- the speech recognition result “World / Heritage / City / The / Statue / of / Liberty” in units of words is matched with, for example, the title of the program as the search result target word string in units of words.
- FIGS. 13 and 14 show a case where matching between a speech recognition result and a search result target word string is performed in units of words using notation symbols of the speech recognition result and the search result target word string, and It is a figure explaining the case where it performs by the above unit.
- the search target word string that best matches the voice recognition result of the input voice is “Lime Wire” that is the same as the input voice.
- the search target word string that best matches the speech recognition result of the input speech is “tolkien” that is the same as the input speech.
- the speech recognition result obtained for the input speech “tolkien” is “toll keene”
- the speech recognition result “toll keene” is matched with the search target word string “tolkien”.
- the word string corresponding to the input speech may not be output as the search result word string.
- the notation symbol may not match the pronunciation.
- the pronunciation (reading) of hiragana “ha” may be “ha” or “wa”, but the notation symbol expresses the difference in pronunciation. I can't.
- a written symbol has a plurality of readings, that is, for example, for “city”, whether the reading (pronunciation) is “shi” or “1”, It cannot be expressed.
- FIG. 15 shows that matching using Japanese notation symbols and matching pronunciations or obtaining different matching results for speech recognition results with different notations is not advantageous for speech search performance.
- the speech recognition result “Yunose-no-Sanri” is divided into notation symbol units, such as “Year / No / se / So / San”, and each symbol unit (one string of notation symbols (one character)). Unit).
- FIG. 15 there are three search result target word strings, for example, “World Heritage City Heritage”, “Seto Dentist”, and “Year of Dissolution of the House of Representatives” as program titles. It is prepared.
- a cosine distance is adopted as the similarity obtained in matching in units of notation symbols.
- a vector representing a word string for example, a vector corresponding to a notation symbol existing in the word string is set to 1, and a vector corresponding to a notation symbol not existing in the word string is set to 0.
- a cosine distance as a similarity between word strings is obtained using a vector representing the two word strings.
- search result target word string having the highest similarity obtained as a result of matching is output as a search result word string
- voice recognition result of the input voice “city world heritage” is incorrect, and the voice recognition result
- “Yoseno dissolution” is obtained, among the three program titles “World Heritage City Heritage", “Seto Dentist”, and “Year of dissolution of the House of Representatives” as the search result target word string
- the “year of dissolution of the House of Representatives” is output as a search result word string.
- the matching result speech recognition result and each search result in the matching using the notation symbol when the speech recognition result is “city world heritage” or “year-end dissolution”.
- the similarity of the target word string is different, and as a result, the appropriate program title “World Heritage City Heritage” is output as the search result word string for the input sound “City World Heritage”
- Such an appropriate title may not be output, and the title “Year of dissolution of the House of Representatives” may be output as a search result word string that has nothing to do with the input sound “city world heritage”.
- FIG. 16 explains that it is not advantageous to the performance of voice search that matching is performed using written symbols for English and different matching results are obtained with respect to voice recognition results having different pronunciations.
- the word strings “tolkien” and “toll keene” represented by the notation symbols have the same pronunciation, but the notation symbols are different.
- one word with a circle in the figure matches in the speech recognition result “toll keene” and the search result target word string “tom keene”.
- search result target word string having the highest similarity obtained as a result of matching is output as a search result word string
- the speech recognition of the input speech “tolkien” is incorrect and the speech recognition result “toll keene”
- the search result target word string is the search result word string Will be output.
- the title “tolkien” of the first program among the above-mentioned three program titles “tolkien”, “tom keene”, and “toe clean” is the search result word. It is appropriate to output as a column.
- notation symbol unit seven notation symbols marked with a circle in the figure match in the speech recognition result “toll keene” and the search result target word string “tom keene”.
- the similarity between the speech recognition result “toll keene” and the search result target word string “tolkien” is 0.76
- the speech recognition result “toll keene” is the search result target word string.
- 0.83 is obtained as the similarity to “tom” keene ”
- 0.71 is obtained as the similarity between the speech recognition result“ toll ⁇ keene ”and the search result target word string“ toe clean ”.
- search result target word string having the highest similarity obtained as a result of matching is output as a search result word string
- the speech recognition of the input speech “tolkien” is incorrect and the speech recognition result “toll keene”
- the search result target word string is the search result word string Will be output.
- the title “tolkien” of the first program among the above-mentioned three program titles “tolkien”, “tom keene”, and “toe clean” is the search result word. It is appropriate to output as a column.
- the similarity between the speech recognition result and the search result target word string obtained in the matching of the notation symbols in two chain units is the speech recognition result “toll keene” and the search result target word string “tom keene”. The similarity is the highest.
- the similarity between the speech recognition result “toll keene” and the search result target word string “tolkien” is 0.58, and the speech recognition result “toll keene” 0.67 is obtained as the similarity with the target word string “tom keene”, and 0.13 is obtained as the similarity between the speech recognition result “toll ⁇ keene” and the search result target word string “toe clean”.
- search result target word string having the highest similarity obtained as a result of matching is output as a search result word string
- the speech recognition of the input speech “tolkien” is incorrect and the speech recognition result “toll keene”
- the search result target word string is the search result word string Will be output.
- the title “tolkien” of the first program among the above-mentioned three program titles “tolkien”, “tom keene”, and “toe clean” is the search result word. It is appropriate to output as a column.
- the search result word is not the appropriate program title “tolkien” for the input speech “tolkien”, but the program title “tom keene”, which has nothing to do with “tolkien”. Output as a column.
- the matching result speech recognition result and each search result target word string
- the appropriate program title “tolkien” is output as the search result word string for the input sound “tolkien”, and such an appropriate title is not output.
- a program title “tom ⁇ ⁇ ⁇ keene” that has nothing to do with the input voice “tolkien” is output as a search result word string.
- matching using a phonetic symbol is performed to prevent a program title appropriate for the input voice from being output as a search result word string. Is done.
- the pronunciation symbol is, for example, a symbol representing the pronunciation of a syllable or phoneme, and for Japanese, for example, hiragana representing reading can be adopted.
- syllable In matching using phonetic symbols, syllable (one), two or more chains of syllables, one (one) of phonemes, two or more chains of phonemes, etc. can be adopted as a unit of matching.
- the matching result and consequently the performance of the voice search differs depending on the matching unit used in the matching using the phonetic symbols.
- the speech recognition unit 51 performs speech recognition of Japanese input speech, and employs a syllable double chain (two consecutive syllables) as a matching unit in the matching unit 56 (FIG. 9). It is a figure explaining the process of the pronunciation symbol conversion part 52 of FIG. 9 in the case of doing.
- the phonetic symbol conversion unit 52 is supplied with a speech recognition result (for example, a notation symbol) of Japanese input speech from the speech recognition unit 51.
- the phonetic symbol conversion unit 52 converts the speech recognition result supplied from the speech recognition unit 51 into a sequence of syllables.
- the phonetic symbol conversion unit 52 shifts the noticed syllable from the beginning of the syllable sequence of the speech recognition result backward by one syllable one by one while moving the noticed syllable and the syllable immediately after the noticed syllable.
- the two syllable two syllable chains are extracted, and the sequence of the two syllable two chains is supplied to the matching unit 56 (FIG. 9) as a recognition result pronunciation symbol string.
- FIG. 18 is a diagram for explaining processing of the phonetic symbol conversion unit 55 in FIG. 9 when a syllable double chain is adopted as a unit of matching in the matching unit 56 (FIG. 9).
- the phonetic symbol conversion unit 55 is supplied with the title of the program as the search result target word string stored in the search result target storage unit 53 after morphological analysis by the morphological analysis unit 54.
- the phonetic symbol converter 55 converts the search result target word string supplied from the morpheme analyzer 54 into a syllable sequence.
- the phonetic symbol conversion unit 55 shifts the attention syllable of interest from the beginning of the syllable sequence of the search result target word string by one syllable backward, and the syllable immediately after the attention syllable.
- the two syllable syllable two chains are extracted, and the sequence of the two syllable two chains is supplied to the matching unit 56 (FIG. 9) as a search result target pronunciation symbol string.
- the speech recognition unit 51 performs speech recognition of English input speech, and employs two phoneme chains (two consecutive phonemes) as a matching unit in the matching unit 56 (FIG. 10). It is a figure explaining the process of the phonetic symbol conversion parts 52 and 55 of FIG.
- the phonetic symbol conversion unit 52 is supplied with a speech recognition result (for example, a notation symbol) of English input speech from the speech recognition unit 51.
- the phonetic symbol conversion unit 55 is supplied with a program title or the like as a search result target word string stored in the search result target storage unit 53.
- the phonetic symbol converter 52 converts the speech recognition results (each word) supplied from the speech recognizer 51 into a sequence of phonemes (phonetic symbols representing phonemes). Then, the phonetic symbol conversion unit 52 shifts the target phoneme of interest from the beginning of the phoneme sequence of the speech recognition result backward by one phoneme, and then calculates the target phoneme and the phoneme immediately after the target phoneme. Two phoneme two-chains are extracted, and the sequence of the two phonemes is supplied to the matching unit 56 (FIG. 10) as a recognition result pronunciation symbol string.
- the phonetic symbol conversion unit 55 converts the search result target word string supplied from the search result target storage unit 53 into a sequence of phonemes. Then, the phonetic symbol conversion unit 55 shifts the target phoneme of interest from the beginning of the phoneme sequence of the search result target word string by one phoneme backward, and the phoneme immediately after the target phoneme. And the two phoneme chains, which are two phonemes, are extracted, and the sequence of the two phonemes is supplied to the matching unit 56 (FIG. 10) as a recognition result pronunciation symbol string.
- characters delimited by slashes (/) represent phonemes as phonetic symbols and are IPA (International Phonetic Alphabet) which is a phonetic symbol defined by the International Phonetic Society.
- IPA International Phonetic Alphabet
- FIG. 20 is a diagram illustrating the matching when the matching unit 56 in FIG. 9 performs matching in units of two syllable chains.
- the matching unit 56 performs, for example, matching in units of two syllable chains.
- the matching unit 56 performs similarity between the recognition result pronunciation symbol string and the search result target pronunciation symbol string as a matching of the recognition result pronunciation symbol string and the search result target pronunciation symbol string in units of two syllables. For example, when obtaining a cosine distance, the matching unit 56 obtains a recognition result vector, which is a vector representing the recognition result pronunciation symbol string, based on the syllable double chain constituting the recognition result pronunciation symbol string.
- the matching unit 56 sets the component corresponding to the syllable double chain existing in the recognition result phonetic symbol string to 1 and the component corresponding to the syllable double chain not existing in the recognition result phonetic symbol string to 0.
- a recognition result vector representing the recognition result pronunciation symbol string As a recognition result vector representing the recognition result pronunciation symbol string.
- the matching unit 56 similarly uses, for example, a program title as a search result target word string stored in the search result target storage unit 53, a search result target pronunciation symbol string of the search result target word string.
- a search result target vector which is a vector representing a search result target pronunciation symbol string, is obtained on the basis of the syllable double chain that constitutes.
- the matching unit 56 calculates a cosine distance, which is a value obtained by dividing the inner product of the recognition result vector and the search result target vector by the product of the size of the recognition result vector and the size of the search result target vector. And matching in units of two syllables, which is obtained as a similarity to the search result target word string corresponding to the search result target vector.
- FIG. 21 is a diagram illustrating the matching when the matching unit 56 in FIG. 10 performs matching in units of two phonemes.
- the matching unit 56 performs matching in units of two phonemes, for example.
- the matching unit 56 performs similarity between the recognition result pronunciation symbol string and the search result target pronunciation symbol string as a matching of the recognition result pronunciation symbol string and the search result target pronunciation symbol string in units of two phonemes. For example, when obtaining a cosine distance, the matching unit 56 obtains a recognition result vector, which is a vector representing a recognition result pronunciation symbol string, based on the two phoneme chains constituting the recognition result pronunciation symbol string.
- the matching unit 56 sets the component corresponding to the phoneme double chain existing in the recognition result phonetic symbol string to 1 and the component corresponding to the phoneme double chain not existing in the recognition result phonetic symbol string to 0. As a recognition result vector representing the recognition result pronunciation symbol string.
- the matching unit 56 similarly uses, for example, a program title as a search result target word string stored in the search result target storage unit 53, a search result target pronunciation symbol string of the search result target word string.
- the search result target vector which is a vector representing the search result target pronunciation symbol string, is obtained on the basis of the phoneme double chain that constitutes.
- the matching unit 56 calculates a cosine distance, which is a value obtained by dividing the inner product of the recognition result vector and the search result target vector by the product of the size of the recognition result vector and the size of the search result target vector. Then, matching is performed in units of phoneme double chains, which are obtained as similarities with the search result target word string corresponding to the search result target vector.
- FIG. 22 is a diagram showing the results of matching in Japanese for each word, matching for each (one) syllable, and matching for each syllable double chain.
- the words or pronunciation symbols in the search result target word string that match the words or pronunciation symbols (syllables) of the speech recognition result “Yunosese Dissolution” are marked with a circle.
- the search result target word string having the highest similarity obtained as a result of matching is output as a search result word string, the voice recognition result of the input voice “city world heritage” is incorrect, and the voice recognition result
- “Yoseno-Sanritsu” is obtained, the word-by-word matching using notation symbols, the titles of the three programs as the search result target word string “World Heritage City Heritage”, “Seto Dentist”
- the highest search result target word string “Year of House Disband” is the search result word string, with a similarity of 0.75 to the speech recognition result “Yunose Disband” of Will be output.
- the speech recognition result “Yoseno dissolution” and the search result target word string “World Heritage City Heritage”, “Seto Dentist”, and “Year of dissolution of the House of Representatives” 0.82, 1.0, and 0.75 are obtained as the respective similarities.
- the search result target word string having the highest similarity obtained as a result of matching is output as a search result word string, the voice recognition result of the input voice “city world heritage” is incorrect, and the voice recognition result
- the phonetic symbol-based matching in syllable units is the title of the three programs as the search result target word string “World Heritage City Heritage”, “Seto Dentist”
- the highest search result target word string “Seto no Dentist” is 1.0 and the search result word string is the search result word string.
- the similarity of the appropriate program title “World Heritage City Heritage” with respect to the input speech “City World Heritage” is the three search result target words
- the third (lowest) value in the column is 0.22, which is appropriate for the input speech “Urban World Heritage” in the syllable unit matching using phonetic symbols.
- the similarity of the program title “World Heritage City Heritage” is 0.82 which is the second highest value among the three search result target word strings.
- matching in syllable units using phonetic symbols is based on the similarity of the title “World Heritage City Heritage” of the appropriate program to the input speech “City World Heritage” using the display symbol. It can be said that it is more effective than word-by-word matching using display symbols in that it is higher than the word-by-word matching.
- the speech recognition result “Yoseno dissolution” and the search result target word strings “World Heritage City Heritage”, “Seto Dentist”, and “ 0.68, 0.43, and 0.48 are obtained as similarities with each year.
- the search result target word string having the highest similarity obtained as a result of matching is output as a search result word string, the voice recognition result of the input voice “city world heritage” is incorrect, and the voice recognition result Even when “Yosenose Dissolution” is obtained, in the matching in the syllable double chain unit using the phonetic symbol, the title “World Heritage City Heritage” of the three programs as the search result target word string, Among the words “Seto Dentist” and “Year of Dismissal of the House of Representatives”, the similarity to the speech recognition result “Year Seto Dissolution” is 0.68, which is the highest search result target word string, that is, the input speech “city world” The title “World Heritage City Heritage” appropriate for “heritage” is output as a search result word string.
- FIG. 23 is a diagram showing the results of matching in units of words, matching in units of (one) phoneme, and matching in units of two phonemes for English.
- the word or pronunciation symbol in the search result target word string that matches the word or pronunciation symbol (phoneme) of the speech recognition result “toll keene” is marked with a circle.
- the similarity (cosine distance) between the speech recognition result “toll keene” and the search result target word strings “tolkien”, “tom keene”, and “toe clean” is 0.0 respectively. , 0.5, and 0.0 are required.
- search result target word string having the highest similarity obtained as a result of matching is output as a search result word string
- the speech recognition of the input speech “tolkien” is incorrect and the speech recognition result “toll keene”
- the highest-ranked search result target word string “tom keene” having a similarity to the speech recognition result “toll keene” of 0.5 is output as the search result word string.
- the title “tolkien” of the first program among the above-mentioned three program titles “tolkien”, “tom keene”, and “toe clean” is the search result word. It is appropriate to output as a column.
- the input speech “tolkien” matches in pronunciation (reading) but is recognized as “toll keene” with different notation, the input speech “tolkien” will be used for word-by-word matching using notation symbols. Instead of “appropriate program title“ tolkien ”, the program title“ tom ⁇ ⁇ keene ”which has nothing to do with“ tolkien ”is output as a search result word string.
- search result target word string having the highest similarity obtained as a result of matching is output as a search result word string
- the speech recognition of the input speech “tolkien” is incorrect and the speech recognition result “toll keene”
- the phonetic unit matching using phonetic symbols, the titles of the three programs” tolkien “,” tom ⁇ ⁇ keene ", and” toe clean are output as search result word strings.
- search result target word string having the highest similarity obtained as a result of matching is output as a search result word string, the speech recognition of the input speech “tolkien” is incorrect and the speech recognition result “toll keene” Even if "" is obtained, the matching of phonemes in two-chain units using phonetic symbols, the titles “tolkien”, “tom keene”, Among the “toe clean”, the similarity to the speech recognition result “toll keene” is 1.0, and the highest-ranked search result target word string, that is, the appropriate program title “tolkien” for the input speech “tolkien” And output as a search result word string.
- a search for a word string corresponding to an input speech can be performed more robustly than when matching using notation symbols is performed.
- the cosine distance is adopted as the similarity between the speech recognition result (the recognition result pronunciation symbol string) and the search result target word string (the search result target pronunciation symbol string).
- the component corresponding to the syllable (two chains) existing in the recognition result pronunciation symbol string is set to 1
- the component corresponding to the syllable not existing in the recognition result pronunciation symbol string is set to 0. Is obtained as a recognition result vector representing the recognition result pronunciation symbol string.
- the matching unit 56 similarly obtains a search result target vector representing the search result target pronunciation symbol string of the search result target word string.
- the value of the component of the recognition result vector is set to 1 or 0 depending on whether or not the syllable corresponding to the component exists in the recognition result pronunciation symbol string.
- the value of the vector component it is possible to employ tf (Term Frequency), which is the frequency at which the syllable corresponding to the component appears in the recognition result phonetic symbol string.
- the value of the component of the recognition result vector becomes large for other syllables that appear biased in a certain search result target word string, for example, and appears uniformly in many search result target word strings.
- idf Invert Document Frequency
- TF-IDF TF-IDF which takes both tf and idf into account.
- the recognition result vector is expressed as V UTR and the search result target vector of the i-th search result target word string stored in the search result target storage unit 53 (FIGS. 9 and 10) is expressed as V TITLE (i)
- the cosine distance D as the similarity between the speech recognition result and the i-th search result target word string is calculated according to Equation (1).
- the cosine distance D takes a value in the range of 0.0 to 1.0, and as the value increases, the recognition result pronunciation symbol string represented by the recognition result vector V UTR and the search result target pronunciation symbol represented by the search result target vector V TITLE (i). Indicates that the column is similar.
- the cosine distance D is the inner product V UTR ⁇ V TITLE (i) between the recognition result vector V UTR and the search result target vector V TITLE (i), and the size of the recognition result vector V UTR
- the speech recognition result and the length of the search result target word string are the matching between the speech recognition result and the search result target word string, that is, the calculation of the cosine distance D as the similarity is performed using the notation symbols.
- the speech recognition result and the number of notation symbols of the search result target word string, and calculating similarity, using notation symbols, word recognition If the result and the number of words in the search result target word string are calculated in units of phonemes using pronunciation symbols, the speech recognition result and the number of phonemes in the search result target word string If the similarity is calculated in units of two phonemes using pronunciation symbols, the number of phoneme double chains in the speech recognition result and the search result target word string is calculated, and the similarity is calculated. Using symbols When performed in units of phonemes, the speech recognition result and the number of phonemes in the search result target word string are calculated in similarity. , And the number of phoneme double chains in the search result target word string, respectively.
- the cosine distance D as a match between the speech recognition result and the search result target word string is calculated in units of words using a notation symbol
- the calculation of the cosine distance D in (1) includes division by the magnitude
- the search result target word string having a long length (here, the number of words) and the short search result target word string the similarity between the short search result target word string is high (the cosine distance D is increased). ), The similarity with the long search result target word string tends to be low (the cosine distance D is small).
- the similarity of the long title may not be high, and the long title may not be output as a search result word string.
- the same word string as the predetermined search result target word string is included, but the long speech recognition result and the predetermined search result target are longer in the long speech recognition result and the short speech recognition result.
- the degree of similarity with the word string is low, and the degree of similarity between the short speech recognition result and the predetermined search result target word string tends to be high.
- the similarity of the predetermined search result target word string does not become higher, and the predetermined search Since the result target word string is not output as the search result word string, the accuracy of the search for the word string corresponding to the input speech may deteriorate.
- the similarity of the short title may not be higher, and the short title may not be output as a search result word string.
- the matching unit 56 uses the corrected distance obtained by correcting the cosine distance D as the voice recognition result so as to reduce the influence of the difference in length between the voice recognition result and the search result target word string. And the similarity between the search result target word strings.
- the similarity between the speech recognition result and the search result target word string When the correction distance is adopted as the similarity between the speech recognition result and the search result target word string, the similarity between the above speech recognition result and the long search result target word string, and the long speech recognition result and the search result The similarity with the target word string is prevented from being lowered, and as a result, the search for the word string corresponding to the input speech can be performed robustly, and the accuracy of the search for the word string corresponding to the input speech is degraded. Can be prevented.
- the correction distance includes a first correction distance and a second correction distance.
- the first correction distance is the magnitude of the search result target vector V TITLE (i) proportional to the length of the search result target word string in the calculation of the expression (1) for obtaining the cosine distance D
- the first correction distance D1 is obtained according to the equation (2).
- S (i)) V UTR ⁇ V TITLE (i) / (
- )) V UTR ⁇ V TITLE (i) / (
- FIG. 24 shows the square root of the multiplication value of the recognition result vector V UTR
- of the recognition result vector V UTR is set to 5.
- the first correction distance D1 obtained according to the equation (2) is compared with the cosine distance D obtained according to the equation (1) as the length of the search result target word string with respect to the length of the speech recognition result.
- of the search result target vector V TITLE (i), that is, the length difference between the speech recognition result and the search result target word string is reduced. Value.
- the second correction distance is the size of the search result target vector V TITLE (i) proportional to the length of the search result target word string in the calculation of the expression (1) for obtaining the cosine distance D
- of the recognition result vector V UTR is obtained as the substitute size S (i).
- the second correction distance D2 is obtained without using the magnitude
- the value of the search result target vector V TITLE (i) is not affected by the difference of
- FIG. 25 shows the result of the simulation of matching when the cosine distance D, the first correction distance D1, and the second correction distance D2 are adopted as the similarity between the speech recognition result and the search result target word string. It is a figure which shows the example of 1.
- FIG. 25 shows the result of the simulation of matching when the cosine distance D, the first correction distance D1, and the second correction distance D2 are adopted as the similarity between the speech recognition result and the search result target word string. It is a figure which shows the example of 1.
- FIG. 26 shows the result of the simulation of matching when the cosine distance D, the first correction distance D1, and the second correction distance D2 are adopted as the similarity between the speech recognition result and the search result target word string. It is a figure which shows the example of 2.
- the long speech recognition result “World Heritage City Heritage Italy Rome Venetian Polypoli Florence” is similar to the long title “Exploration Roman World Heritage Italy Florence Historic Center”
- the degree of similarity is 0.4229
- the similarity of the short title “World Heritage” is 0.2991
- the similarity of the long title “Exploring Roman World Heritage Italy Florence Historic Center” is the similarity of the short title “World Heritage” Higher than.
- the long speech recognition result “World Heritage City Heritage Italy Rome Venetian Polypoli Florence” is similar to the long title “Exploration Roman World Heritage Italy Florence Historic Center”
- the degree of similarity is 0.4
- the similarity of the short title “World Heritage” is 0.2
- the similarity of the long title “Exploration Roman World Heritage Italy Florence Historic Center” is the similarity of the short title “World Heritage” Higher than.
- FIG. 27 shows the result of the matching simulation when the cosine distance D, the first correction distance D1, and the second correction distance D2 are adopted as the similarity between the speech recognition result and the search result target word string.
- FIG. 28 shows the result of the simulation of matching when the cosine distance D, the first correction distance D1, and the second correction distance D2 are adopted as the similarity between the speech recognition result and the search result target word string.
- the correction distance the influence of the difference in length between the speech recognition result and the search result target word string is reduced, so that the word string corresponding to the input speech can be searched robustly, It is possible to prevent deterioration in the accuracy of searching for a word string corresponding to the input speech.
- FIG. 29 is a block diagram illustrating a configuration example of the voice recognition unit 51 of FIGS. 9 and 10.
- the speech recognition unit 51 includes a recognition unit 81, a dictionary storage unit 82, an acoustic model storage unit 83, a language model storage unit 84, and a language model generation unit 85.
- the input voice is supplied to the recognition unit 81.
- the recognition unit 81 refers to the input speech supplied thereto, for example, based on the HMM method while referring to the dictionary storage unit 82, the acoustic model storage unit 83, and the language model storage unit 84 as necessary. Recognize and output the voice recognition result of the input voice.
- the dictionary storage unit 82 stores a word dictionary in which information (phonological information) related to pronunciation is described for each word (vocabulary) as a result of speech recognition.
- the acoustic model storage unit 83 stores an acoustic model representing acoustic features such as individual phonemes and syllables in a speech language for performing speech recognition.
- an HMM is used as the acoustic model.
- the language model storage unit 84 stores a language model that is a grammar rule that describes how each word registered in the word dictionary of the dictionary storage unit 82 is linked (connected).
- a language model for example, grammar rules such as context free grammar (CFG) and statistical word chain probability (N-gram) can be used.
- the recognizing unit 81 connects the acoustic model stored in the acoustic model storage unit 83 by referring to the word dictionary in the dictionary storage unit 82, thereby constructing an acoustic model (word model) of the word.
- the recognizing unit 81 connects several word models by referring to the language model stored in the language model storage unit 84, and uses the word model connected in this way, by the HMM method. Recognize input speech.
- the recognizing unit 81 detects a sequence of word models having the highest likelihood of observing the feature amount (for example, cepstrum) of the input speech supplied thereto, and a word string corresponding to the sequence of the word models. Is output as a speech recognition result.
- the feature amount for example, cepstrum
- the recognizing unit 81 accumulates the appearance probability of the feature quantity of the input speech for the word string corresponding to the connected word model, and uses the accumulated value as the likelihood that the feature quantity of the input speech is observed. As a recognition score, a word string that has the highest recognition score is output as a speech recognition result.
- the recognition score is generally an acoustic likelihood given by an acoustic model stored in the acoustic model storage unit 83 (hereinafter also referred to as an acoustic score) and a language given by a language model stored in the language model storage unit 84. It is obtained by comprehensively evaluating the likelihood (hereinafter also referred to as language score).
- the acoustic score for example, in the case of the HMM method, the probability that the feature amount of the input speech is observed from the acoustic model constituting the word model is calculated, for example, for each word.
- the language score for example, in the case of bigram, the probability that the word of interest and the word immediately preceding the word are linked (connected) is obtained.
- the recognition score is obtained by comprehensively evaluating the acoustic score and the language score for each word, and the speech recognition result is determined based on the recognition score.
- the recognition score S of the word string is calculated according to the equation (4), for example.
- Equation (4) ⁇ represents taking a summation by changing k from 1 to K.
- C k represents a weight applied to the language score L (w k ) of the word w k .
- word strings w 1 , w 2 ,..., W K whose recognition score shown in the formula (4) is within the upper M (M is an integer of 1 or more) rank are obtained,
- the columns w 1 , w 2 ,..., W K are output as speech recognition results.
- X) is calculated by the Bayes' theorem as follows: Using the probability P (X) that occurs, the probability P (W) that the word string W occurs, and the probability P (X
- X) P (W) P (X
- X) P (W) P (X
- X) is the recognition score
- the probability P (W) is the language score
- W) corresponds to the acoustic score.
- T words are registered in the word dictionary of the dictionary storage unit 82, there are T T arrangements of T words that can be configured using the T words. Therefore, simply, the recognition unit 81 evaluates this T T word string (calculates a recognition score), and from among them, the one that best fits the input speech (with a recognition score within the top M) Must be determined.
- the recognition unit 81 for example, in the process of obtaining the acoustic score for a word string as a certain recognition hypothesis, when the acoustic score obtained in the middle becomes a predetermined threshold or less, the recognition score of the recognition hypothesis Acoustic pruning that terminates the calculation and linguistic pruning that narrows down the recognition hypotheses that are subject to calculation of the recognition score are performed based on the language score.
- the metadata of the program that is, for example, the title of the program is a word string commonly used in articles described in newspapers, such as coined words, main caster names (such as stage names), and specific phrases.
- the word string that is not is included.
- the speech recognition unit 51 in FIG. 29 has a language model generation unit 85.
- the language model generation unit 85 generates a language model using the search result target word string stored in the search result target storage unit 53 of the voice search device 50 of FIGS. 9 and 10.
- the search result target storage unit 53 stores the program title, performer name, detailed information, and the like, which are constituent elements of the EPG recorded in the recording medium 63, and the recording medium 63.
- the program title, the name of the performer, the detailed information, etc., which are the metadata of the recorded program recorded in, are stored as search result target word strings.
- FIG. 30 is a diagram illustrating an example of program metadata as a search result target word string stored in the search result target storage unit 53.
- the program metadata includes, for example, a program title, performer name, and detailed information.
- the search result target word string is a word string as a program title, performer name, detailed information, etc., which is a constituent element (program metadata) constituting the EPG
- the search result target word It can be said that the columns are classified into fields such as program titles, performer names, detailed information, etc., but the dedicated language using the search result target word strings classified in such fields
- model generation it is possible to generate one dedicated language model without distinguishing which field each search result target word string belongs to, or by using the search result target word string in each field. It is also possible to generate a language model for each field and interpolate the language model for each field to generate one dedicated language model.
- the dedicated language model generated by the language model generation unit 85 is supplied to the language model storage unit 84 and stored therein.
- the recognition unit 81 obtains a language score using such a dedicated language model, the accuracy of speech recognition can be improved as compared with the case where a general-purpose language model is used.
- the language model generation unit 85 is provided inside the speech recognition unit 51, but the language model generation unit 85 can be provided outside the speech recognition unit 51.
- the language model storage unit 84 can store a general-purpose language model separately from the language model generated by the language model generation unit 85.
- FIG. 31 is a diagram for explaining the language model generation processing in the language model generation unit 85 of FIG.
- the language model generation unit 85 performs morphological analysis on each search result target word string stored in the search result target storage unit 53 (FIG. 9). Further, the language model generation unit 85 uses a morphological analysis result of the search result target word string to learn a language model such as a bigram representing the probability that the word B follows the word A, for example, and as a dedicated language model, It is supplied to the language model storage unit 84 and stored.
- a language model such as a bigram representing the probability that the word B follows the word A, for example, and as a dedicated language model
- a dedicated language model when a dedicated language model is generated using the EPG components as a search result target word string, for example, a future broadcast on a predetermined day of the week or the latest week, etc.
- a dedicated language model can be generated using an EPG for a predetermined period of time.
- the user when searching for a program desired by the user from EPG and making a recording reservation in accordance with the input voice uttered by the user, the user is interested in a program broadcast on a predetermined day of the week. If you know that, you can improve the accuracy of speech recognition for programs broadcast on a given day of the week by generating a dedicated language model using EPG for a given day of the week. A program broadcast on the predetermined day of the week is easily output as a search result word string.
- the recorder of FIG. 9 when searching for a program desired by the user from the EPG and making a recording reservation in accordance with the input voice uttered by the user, a dedicated language is used using the latest EPG for one week. By generating the model, it is possible to improve the accuracy of speech recognition for the program broadcast during the latest week. As a result, the program broadcast during the latest week is a search result word. It becomes easy to output as a column.
- the language model generation unit 85 when a dedicated language model is generated using the EPG constituent elements as the search result target word string, the latest EPG, that is, the EPG of the program whose broadcasting time is closer.
- a dedicated language model can be generated so that a higher language score is given to an arrangement of words in a search result word string that is a constituent element.
- one dedicated language model is generated from the search result target word string, and the one dedicated language model is generated.
- the language score of a recognition hypothesis in which parts of search result target word strings in different fields are arranged may be high.
- speech recognition is performed using one dedicated language model generated using a search result target word string classified into the program title, performer name, and detailed information fields. For example, when a word string in which a part of the title of a program A and a part of a performer name of another program B are arranged becomes a recognition hypothesis, the language of the recognition hypothesis Score may be high.
- the word string in which a part of the title of program A and a part of the name of the performer of program B are arranged does not exist in the constituent elements of the EPG, which is the search result target word string, such a word It is not preferred that the columns become recognition hypotheses with a high language score that can be made into speech recognition results.
- the search result target word strings classified in the program title, performer name, and detailed information fields are used without particular distinction, and the matching unit 56 (FIG. 9).
- the matching unit 56 for example, even when the user utters the title of the program, the search result target word string in all fields and not only the search result target word string in the program title field. Matching with the speech recognition result of the user's utterance is performed, and the search result target word string that matches the speech recognition result is output as the search result word string.
- a program including a performer name as a search result target word string or detailed information including a word string similar to (including a case of matching) may be output as a search result word string.
- a program irrelevant to the program that the user uttered the title is output as a search result word string.
- the search result word string is searched for and selected from the programs to be reserved for recording. The user who tries to feel annoyance.
- the matching unit 56 searches for a predetermined field such as a field desired by the user for matching with the speech recognition result. It is possible to perform only the result target word string.
- the language model generation unit 85 can generate a language model for each field using the search result target word string of the field, and the recognition unit 81 , Speech recognition is performed using the language model of the field, and a speech recognition result for each field can be obtained.
- the matching unit 56 (FIG. 9) can perform matching between the speech recognition result and the search result target word string for each field or without distinguishing the fields.
- FIG. 32 is a diagram for explaining a process of generating a language model for each field in the language model generation unit 85 of FIG.
- the model generation unit 85 performs a morphological analysis on a search result target word string in a program title field (hereinafter also referred to as a program title field) stored in the search result target storage unit 53.
- the language model generation unit 85 generates a language model for the program title field by learning a language model such as a bigram using the morphological analysis result of the search result target word string in the program title field, It is supplied to the language model storage unit 84 and stored.
- the language model generation unit 85 performs morphological analysis on the search result target word string in the performer name field (hereinafter also referred to as the performer name field) stored in the search result target storage unit 53.
- the language model generation unit 85 generates a language model for the performer field by learning a language model such as a bigram, for example, using the morphological analysis result of the search result target word string of the performer name. And supplied to the language model storage unit 84 for storage.
- the language model generation unit 85 uses the search result target word string in the detailed information field (hereinafter also referred to as the detailed information field) stored in the search result target storage unit 53 to use the language for the detailed information field.
- a model is generated and supplied to the language model storage unit 84 for storage.
- FIG. 33 shows a case where speech recognition is performed using a language model of each field, a speech recognition result for each field is obtained, and matching between the speech recognition result and the search result target word string is performed for each field.
- FIG. 10 is a diagram for explaining processing of the voice search device 50 of FIG.
- the recognizing unit 81 performs voice recognition of the input voice independently using the language model for the program title field, the language model for the performer name field, and the language model for the detailed information field.
- the recognizing unit 81 obtains one or more recognition hypotheses having a higher recognition score and uses it as the speech recognition result of the program title field.
- the recognition unit 81 obtains one or more recognition hypotheses having a higher recognition score even in speech recognition using the language model for the performer name field, and uses it as the speech recognition result of the performer name field.
- the recognizing unit 81 obtains one or more recognition hypotheses having a higher recognition score and uses it as the speech recognition result of the detailed information field.
- the matching unit 56 searches the program title field in the search result target word string stored in the search result target storage unit 53 (FIG. 9) for matching with the speech recognition result of the program title field. Only the result target word string is targeted.
- the matching unit 56 performs matching with the voice recognition result of the performer name field by using only the search result target word string in the performer name field among the search result target word strings stored in the search result target storage unit 53. As a target.
- the matching unit 56 performs matching with the speech recognition result in the detailed information field only for the search result target word string in the detailed information field in the search result target word string stored in the search result target storage unit 53. Do as.
- the output unit 57 selects a search result target word string whose similarity (for example, cosine distance, correction distance, etc.) with the speech recognition result is within the top N ranks based on the matching result. And output as a search result word string.
- similarity for example, cosine distance, correction distance, etc.
- the speech recognition result and the search result target word string are matched for each field, and the similarity is ranked in the top three as the search result word strings for the program title field, performer name field, and detailed information field.
- the search result target word string is output.
- the output unit 57 (FIG. 9) ranks the search result target word strings for each field according to the similarity to the speech recognition result, and outputs the search result target word strings within the top N ranks as the search result word strings.
- the search result target word strings are ranked, in other words, the ranking of the overall ranking is performed, and the search result target word strings having the overall ranking within the top N rank, A search result word string can be output.
- FIG. 34 is a block diagram illustrating a configuration example of a portion of the output unit 57 that obtains the overall ranking.
- the output unit 57 includes an overall score calculation unit 91.
- the overall score calculation unit 91 is supplied with a speech recognition reliability that is obtained by the speech recognition unit 51 and represents the reliability of the speech recognition result of each field.
- a recognition score can be adopted.
- the similarity score of the search result target word string in each field which is obtained by the matching unit 56, is supplied to the total score calculation unit 91.
- the comprehensive score calculation unit 91 comprehensively evaluates the speech recognition reliability of the speech recognition result and the similarity of the search result target word string for each field, and the search result target word string corresponds to the input speech. The total score representing the degree of matching with the word string to be obtained is obtained.
- the total score calculation unit 91 determines the voice recognition reliability of the voice recognition result, and the voice recognition result and the attention word string.
- Each of the similarities is normalized to a value in the range of 0.0 to 1.0, for example, as necessary.
- the total score calculation unit 91 calculates the voice recognition reliability of the voice recognition result, the weighted average value of the similarity between the voice recognition result and the target word string, the geometric average value, and the like, as the total score of the target word string. Asking.
- the overall score calculation unit 91 ranks the search result target word strings in descending order of the overall score.
- FIG. 35 is a block diagram illustrating a configuration example of the total score calculation unit 91 in FIG.
- the total score calculation unit 91 includes a program title total score calculation unit 92, a performer name total score calculation unit 93, a detailed information total score calculation unit 94, and a score comparison ranking unit 95.
- the program title total score calculation unit 92 includes the voice recognition reliability of the voice recognition result of the program title field obtained by the voice recognition unit 51, the voice recognition result of the program title field obtained by the matching unit 56, and the program The similarity with the search result target word string in the title field is supplied.
- the program title general score calculation unit 92 sequentially sets the search result target word string in the program title field as the attention word string, the voice recognition reliability of the voice recognition result in the program title field, and the voice recognition result and the attention word string. Is used to obtain the overall score of the word sequence of interest and supply it to the score comparison ranking unit 95.
- the performer name total score calculation unit 93 includes the voice recognition reliability of the voice recognition result of the performer name field obtained by the voice recognition unit 51 and the voice recognition result of the performer name field obtained by the matching unit 56. And the similarity to the search result target word string in the performer name field is supplied.
- the performer name general score calculation unit 93 sequentially uses the search result target word string in the performer name field as the attention word string, and the voice recognition reliability of the voice recognition result in the performer name field, and the voice recognition result Using the degree of similarity with the attention word string, an overall score of the attention word string is obtained and supplied to the score comparison ranking unit 95.
- the detailed information total score calculation unit 94 sequentially uses the search result target word string in the detailed information field as the attention word string, the voice recognition reliability of the voice recognition result in the detailed information field, and the voice recognition result and the attention word string. Is used to obtain the overall score of the word sequence of interest and supply it to the score comparison ranking unit 95.
- the score comparison ranking unit 95 compares the total scores from the program title total score calculation unit 92, the performer name total score calculation unit 93, and the detailed information total score calculation unit 94, and arranges them in ascending order. In the descending order, the overall ranking is given to the search result target word strings.
- the output unit 57 outputs the search result target word string whose overall ranking is within the top N ranks as the search result word string.
- the recognition unit 81 performs speech recognition using the language model of each field and obtains a speech recognition result for each field. In the recognition unit 81, so-called comprehensive speech recognition over all fields. The result can be determined.
- FIG. 36 performs speech recognition of Japanese input speech using the language model of each field, obtains comprehensive speech recognition results over all fields, and matches speech recognition results with search result target word strings. It is a figure explaining the process of the speech search device 50 of FIG. 9 when performing this for every field.
- the recognition unit 81 performs speech recognition of Japanese input speech for the language model for the program title field, the language model for the performer name field, and the detailed information field.
- the speech recognition results of the program title field, the performer name field, and the detailed information field are obtained independently using each of the language models.
- the recognizing unit 81 detects one or more speech recognition results having a higher recognition score from all of the speech recognition results of the program title field, the performer name field, and the detailed information field. The result is used as a comprehensive speech recognition result used for matching in the matching unit 56.
- the matching unit 56 matches the overall speech recognition result with the search result target word in the program title field in the search result target word string stored in the search result target storage unit 53 (FIG. 9).
- the search result target word string in the column, the performer name field, and the search result target word string in the detailed information field are each targeted.
- the output unit 57 (FIG. 9) outputs, as a search result word string, a search result target word string whose similarity with the speech recognition result is within the top N based on the matching result.
- the speech recognition result and the search result target word string are matched for each field, and the similarity is ranked in the top three as the search result word strings for the program title field, performer name field, and detailed information field.
- the search result target word string is output.
- FIG. 37 performs speech recognition of English input speech using the language model of each field, obtains a comprehensive speech recognition result over all fields, and matches the speech recognition result with the search result target word string. It is a figure explaining the process of the speech search device 50 of FIG. 10 when performing for every field.
- the recognition unit 81 independently performs speech recognition of English input speech using each of the language model for the program title field, the language model for the performer name field, and the language model for the detailed information field. And the speech recognition results of the program title field, performer name field, and detailed information field are obtained.
- the recognizing unit 81 detects one or more speech recognition results having a higher recognition score from all of the speech recognition results of the program title field, the performer name field, and the detailed information field. The result is used as a comprehensive speech recognition result used for matching in the matching unit 56.
- the matching unit 56 matches the overall speech recognition result with the search result target word in the program title field in the search result target word string stored in the search result target storage unit 53 (FIG. 10).
- the search result target word string in the column, the performer name field, and the search result target word string in the detailed information field are each targeted.
- the output unit 57 (FIG. 10) outputs, as a search result word string, a search result target word string whose similarity to the speech recognition result is within the top N based on the matching result.
- the speech recognition result and the search result target word string are matched for each field, and the similarity is ranked in the top three as the search result word strings for the program title field, performer name field, and detailed information field.
- the search result target word string is output.
- the output unit 57 does not depend on the field (over all fields). ), Ranking the search result target word strings, ranking the overall rank, and outputting the search result target word strings with the overall rank within the top N ranks as the search result word strings.
- FIG. 38 is a block diagram illustrating a configuration example of a portion for obtaining the overall ranking of the output unit 57 when the recognition unit 81 obtains a comprehensive speech recognition result.
- the output unit 57 includes a similarity comparison ranking unit 96.
- the similarity comparison ranking unit 96 is supplied with the similarity of the search result target word string in each field, which is obtained by the matching unit 56.
- the recognition score as the speech recognition reliability obtained by the recognition unit 81 is a recognition score of a comprehensive speech recognition result and is not a value existing for each field, the similarity comparison ranking unit 96 is not supplied.
- the similarity comparison ranking unit 96 compares all similarities of the search result target word string in the program title field, the search result target word string in the performer name field, and the search result target word string in the detailed information field. They are arranged in ascending order, and the overall ranking is given to the search result target word strings in descending order of similarity.
- the output unit 57 outputs the search result target word string whose overall ranking is within the top N ranks as the search result word string.
- FIG. 39 is a diagram showing an example of a search result word string display screen output by the output unit 57 (FIGS. 9 and 10).
- search result word string display screen On the search result word string display screen (hereinafter also referred to as the search result display screen), a part of the search result word string such as a word or syllable that matches (similar and matches) the speech recognition result of the input speech (Hereinafter also referred to as an utterance-corresponding portion) can be highlighted.
- FIG. 39 shows a search result display screen displayed without emphasizing the utterance corresponding portion and a search result display screen displayed with the utterance corresponding portion highlighted.
- Other methods for emphasizing the speech-corresponding part include, for example, a method of displaying the speech-corresponding part as a blink, a method of displaying with a different color, and a method of displaying with a different font type and size. Etc.
- the utterance-corresponding part may not be emphasized all, but only a part of the utterance-corresponding part such as a part with high reliability (voice recognition reliability) of the speech recognition result may be emphasized and displayed. it can.
- the search result display screen can display only the part corresponding to the utterance in the search result word string and the part before and after that.
- the search result display screen by highlighting and displaying the utterance corresponding part (or part thereof) of the search result word string, the user can grasp whether speech recognition is performed correctly, and You can decide whether to restate.
- FIG. 40 and FIG. 41 are diagrams illustrating an example of a voice search using an input voice including a specific phrase.
- the command determination unit 71 determines whether or not the input voice from the user is a command for controlling the recorder based on the voice recognition result supplied from the voice recognition unit 51. To do.
- the command determination unit 71 stores a character string defined as a command for controlling the recorder (hereinafter also referred to as a command character string), and the voice recognition result from the voice recognition unit 51 matches the command character string. It is determined whether or not the input voice from the user is a command for controlling the recorder.
- the command determination unit 71 determines that the input speech is not a command, that is, when the speech recognition result from the speech recognition unit 51 does not match the command character string, the command determination unit 71 indicates a determination result that the input speech is not a command. 72.
- control unit 72 controls the matching unit 56 so as to execute matching, for example. Therefore, in the voice search device 50, the matching unit 56 performs matching between the voice recognition result and the search result target word string, and the output unit 57 outputs the search result word string based on the matching result.
- the command determination unit 71 determines that the input speech is a command, that is, when the speech recognition result from the speech recognition unit 51 matches the command character string, the determination result that the input speech is a command. Is supplied to the control unit 72 together with a command character string that matches the voice recognition result.
- control unit 72 performs control to limit processing of the voice search device 50. Therefore, in the voice search device 50, the matching unit 56 does not perform matching and does not output the search result word string.
- control unit 72 performs processing such as controlling the recorder function unit 60 according to a command interpreted from the command character string from the command determination unit 71.
- the command determination unit 71 interprets the command character string as, for example, a command character string “select” that is interpreted as a command for selecting a program to be reproduced from a recorded program, or a command for reproducing a program.
- the command character string “playback” or the like is stored, when the voice recognition unit 51 outputs a voice recognition result “playback” that matches the command character string “playback”, for example, the control unit 72 According to the command interpreted from the column “play”, the recorder function unit 60 is controlled to play, for example, a program.
- a voice search when performing a voice search, by having the user utter an input voice including, for example, “by voice search” as a specific phrase instructing that, A voice search can be performed using a word string that matches the command character string as a keyword.
- the voice recognition unit 51 is supplied with the input voice “play by program search”, and the voice recognition unit 51 performs voice recognition of the input voice “play by program search”.
- the speech recognition result that matches the input speech “play by program search” May not be output.
- a word string including at least a specific phrase is not output as a speech recognition result with respect to the included input speech.
- the voice recognition unit 51 it is necessary to obtain a voice recognition result including the specific phrase with respect to the input sound “playback by program search” including the specific phrase. It is necessary to prevent the language score of the recognition hypothesis including the phrase from being lowered.
- the language model generation unit 85 uses a specific phrase together with the search result target word string stored in the search result target storage unit 53 (FIG. 9). Is generated.
- a language model (hereinafter referred to as a language model) that gives a high language score when a specific phrase and words constituting the search result target word string are arranged side by side. (Also referred to as a specific phrase language model).
- command character string is included in the search result target word string stored in the search result target storage unit 53 (FIG. 9).
- the language model generation unit 85 uses only the search result target word string stored in the search result target storage unit 53 (FIG. 9) without using a specific phrase, that is, a specific phrase.
- a phraseless language model which is another language model of the specific phrase language model, is generated using a word string that does not include a phrase.
- a higher value is given as the language score of the recognition hypothesis (word string) including the specific phrase than the language score of the recognition hypothesis not including the specific phrase.
- a higher value is given as the language score of the recognition hypothesis (word string) not including the specific phrase than the language score of the word string including the specific phrase.
- the speech recognition unit 51 performs speech recognition using a specific phrase language model and a phraseless language model.
- the language model without phrases is used.
- a high language score is given to the recognition hypothesis in which the words constituting the word string are arranged.
- the language score (and acoustic score) of the recognition hypothesis in which the specific phrase and the words constituting the search result target word string are arranged is the language for the specific phrase.
- the speech score is not output as a speech recognition result because the language score of the recognition hypothesis including the specific phrase is low for input speech including the specific phrase. Can be prevented.
- FIG. 40 shows an example of speech search when the speech recognition unit 51 of FIG. 9 performs speech recognition of Japanese input speech using the language model for specific phrases and the language model without phrases. Yes.
- the voice recognition unit 51 recognizes the input voice “playback by voice search”.
- the speech recognition unit 51 performs speech recognition using the language model for specific phrases, the input hypothesis including the specific phrase “recognition hypothesis“ The language score (and acoustic score) of “playback by voice search”, and hence the recognition score, is sufficiently higher than when the specific phrase language model is not used.
- the recognition hypothesis “play with voice search” including the specific phrase is output as the speech recognition result.
- the voice recognition result “playback by voice search” output from the voice recognition unit 51 is supplied to the phonetic symbol conversion unit 52 and the command determination unit 71.
- the command determination unit 71 determines that the input voice is not a command.
- control unit 72 does not perform control for restricting the processing of the voice search device 50.
- the voice recognition result “playback by voice search” from the voice recognition unit 51 is converted into a recognition result phonetic symbol string and supplied to the matching unit 56.
- search result target pronunciation symbol string of the search result target word string is supplied from the search result target storage unit 53 to the matching unit 56 via the morpheme analysis unit 54 and the pronunciation symbol conversion unit 55.
- the matching unit 56 removes the specific phrase from the recognition result pronunciation symbol string, and the recognition result pronunciation after the deletion Matching between the symbol string and the search result target pronunciation symbol string is performed.
- the matching unit 56 supplies the output unit 57 with the similarity as a matching result between the recognition result pronunciation symbol string and the search result target pronunciation symbol string.
- the output unit 57 outputs, as a search result word string, a search result target word string whose similarity is within the top N, based on the similarity as the matching result from the matching unit 56.
- the title of the program as the search result target word string within the top two is output as the search result word string for the input voice “playback by voice search” including the specific phrase.
- the matching unit 56 as described above, matching between the recognition result pronunciation symbol string from which the specific phrase is removed and the search result target pronunciation symbol string, that is, speech recognition from which the specific phrase is removed.
- the result is matched with the search result target word string, and based on the matching result, the search result target word string that matches the speech recognition result from which the specific phrase is removed is output as the search result word string.
- the search result target word string is a word string that is a target of the search result of the word string corresponding to the voice obtained by removing (removing) a specific phrase from the input voice.
- the speech recognition unit 51 performs the input.
- the voice “reproduction” is recognized as a voice, and the voice recognition result “reproduction” is supplied to the phonetic symbol conversion unit 52 and the command determination unit 71.
- the command determination unit 71 determines that the input voice is a command, and determines that the input voice is a command as a voice recognition result. Is supplied to the control unit 72 together with the command character string “reproduction” that matches
- the control unit 72 performs control to limit processing of the voice search device 50 when a determination result that the input voice is a command is supplied from the command determination unit 71. Therefore, the voice search device 50 does not perform a voice search and does not output a search result word string.
- control unit 72 controls the recorder function unit 60 so as to reproduce the program according to the command interpreted from the command character string “reproduction” from the command determination unit 71.
- FIG. 41 shows an example of speech search when the speech recognition unit 51 in FIG. 10 performs speech recognition of English input speech using the language model for specific phrases and the language model without phrases. .
- the speech recognition unit 51 performs speech recognition using the language model for specific phrases
- the input hypothesis “Program Search” including the specific phrase “Program” Search ” is used for the input speech including the specific phrase“ Program Search ”.
- the language score (and acoustic score) of “Search, Play”, and thus the recognition score, is sufficiently higher than when the language model for specific phrases is not used.
- the recognition hypothesis “Program Search” including the specific phrase is output as the speech recognition result.
- the voice recognition result “Program Search, Play” output by the voice recognition unit 51 is supplied to the phonetic symbol conversion unit 52 and the command determination unit 71.
- the command determination unit 71 determines that the input speech is not a command.
- control unit 72 does not perform control for restricting the processing of the voice search device 50.
- the phonetic symbol conversion unit 52 converts the voice recognition result “Program Search, Play” from the voice recognition unit 51 into a recognition result phonetic symbol string and supplies it to the matching unit 56.
- search result target pronunciation symbol string of the search result target word string is supplied from the search result target storage unit 53 to the matching unit 56 via the pronunciation symbol conversion unit 55.
- the matching unit 56 removes the specific phrase from the recognition result pronunciation symbol string, and the recognition result pronunciation after the deletion Matching between the symbol string and the search result target pronunciation symbol string is performed.
- the matching unit 56 supplies the output unit 57 with the similarity as a matching result between the recognition result pronunciation symbol string and the search result target pronunciation symbol string.
- the output unit 57 outputs, as a search result word string, a search result target word string whose similarity is within the top N, based on the similarity as the matching result from the matching unit 56.
- the program title as the search result target word string within the top two is output as the search result word string for the input sound “Program Search, Play” including the specific phrase.
- the voice recognition unit 51 uses the input voice. “Play” is recognized as speech, and the speech recognition result “Play” is supplied to the phonetic symbol conversion unit 52 and the command determination unit 71.
- the command determination unit 71 determines that the input voice is a command, and determines that the input voice is a command as a voice recognition result. Is supplied to the control unit 72 together with the command character string “Play” that matches
- the control unit 72 performs control to limit processing of the voice search device 50 when a determination result that the input voice is a command is supplied from the command determination unit 71. Therefore, the voice search device 50 does not perform a voice search and does not output a search result word string.
- control unit 72 controls the recorder function unit 60 so as to reproduce the program according to the command interpreted from the command character string “Play” from the command determination unit 71.
- the speech recognition unit 51 performs speech recognition using the language model for specific phrases and the language model without phrases, it does not include input speech including specific phrases and specific phrases. Both input voices can be recognized with high accuracy.
- a voice search when performing a voice search, the user is asked to make an utterance including a specific phrase, thereby distinguishing whether the user's utterance is a voice search request or a command for controlling the recorder. Even if the word string matches the command character string, a voice search can be performed using the word string as a keyword.
- the voice search and the control of the recorder can be switched depending on whether or not a specific phrase is included in the user's utterance (or whether or not the user's utterance matches the command character string).
- a command character string is included in the search result target word string, and the language model generation unit 85 uses only the search result target word string without using a specific phrase.
- the phraseless language model is generated, as the phraseless language model, for example, a language model generated using only a command character string can be employed.
- the command determination unit 71 based on the speech recognition result from the speech recognition unit 51, the input speech from the user is determined depending on whether the speech recognition result matches the command character string.
- the input voice is a command for controlling the recorder based on the matching result of the matching unit 56. It can be determined whether or not.
- the command character string is included in the search result target word string, and the matching unit 56 matches the search result target pronunciation symbol string of the search result target word string with the entire recognition result pronunciation symbol string of the speech recognition result. And the matching result is supplied to the command determination unit 71.
- the command determination unit 71 based on the matching result from the matching unit 56, the search result target word string having the highest similarity obtained by matching with the entire speech recognition result (the recognition result pronunciation symbol string) If it matches the command character string, it is determined that the input voice is a command, and if the highest search result target word string does not match the command character string, it is determined that the input voice is not a command.
- the control unit 72 When the command determination unit 71 determines that the input voice is a command, the control unit 72 performs processing according to the command, and the output unit 57 performs a search result based on the matching result of the matching unit 56. Limit the output of word strings.
- the control unit 72 when the speech recognition result of the input speech includes a specific phrase, identifies the specific phrase from the recognition result pronunciation symbol string.
- the matching unit 56 is controlled so as to match the recognition result pronunciation symbol string after the deletion and the search result target pronunciation symbol string, and based on the matching result of the matching unit 56, the search result word
- the output unit 57 is controlled to output the column.
- the command determination unit 71 determines whether the input sound includes the input sound regardless of whether or not a specific phrase is included in the input sound. Since it is possible to determine whether or not it is a command, the user may utter an input voice of only the keyword for the voice search without speaking an input voice including a specific phrase when performing a voice search. Yes (the user does not have to speak a specific phrase to perform a voice search).
- the control unit 72 matches the search result target word string already performed by the matching unit 56 with the entire speech recognition result. Based on the matching result, the output unit 57 is controlled to output the search result word string.
- 42 and 43 are diagrams showing another example of the voice search using the input voice including the specific phrase.
- the search result target word string is classified into a plurality of fields such as a program title field, a performer name field, and a detailed information field
- voice recognition is performed.
- the section 51 (FIG. 9 (and FIG. 10)
- the language model for the program title field the language model for the performer name field, which is the language model for each field, and the details
- a language model for the information field is generated, speech recognition is performed using the language model for each field, and a speech recognition result for each field can be obtained.
- the voice recognition unit 51 detects one or more voice recognition results having a higher recognition score from all the voice recognition results of the program title field, performer name field, and detailed information field, and the voice.
- the recognition result can be a comprehensive voice recognition result used for matching in the matching unit 56.
- the matching unit 56 (FIG. 9) can match the search result target word string for each field and the speech recognition result, and the output unit 57 (FIG. 9) based on the matching result for each field.
- the output unit 57 (FIG. 9) based on the matching result for each field.
- search result word strings of the program title field, performer name field, and detailed information field are output.
- search result target word string that matches the voice recognition result is output as a search result word string.
- a voice search when performing a voice search, a voice search is instructed and a specific phrase representing a field of a search result target word string that matches a voice recognition result is used.
- the search result target word string field that matches the speech recognition result is specified by having the user utter the input speech including “by program name search” and “by name search”. It is possible to perform a voice search by limiting to the field.
- the language model generation unit 85 of the speech recognition unit 51 Each time, a language model is generated using a search result target word string stored in the search result target storage unit 53 (FIG. 9) and a field phrase which is a specific phrase representing a field.
- the language model generation unit 85 For a program title field, for example, using “program name search” or “Program Title Search by” and a search result target word string in the program title field as a field phrase that is a specific phrase representing the program title field Generate a language model for
- the language model generation unit 85 uses, for example, “by name search” or “Cast Search by” as a field phrase representing the performer name field and the search result target word string in the performer name field, A language model for the performer name field is generated, and as a field phrase representing the detailed information field, for example, “in the detailed information search” or “Information Search by” and the search result target word string in the detailed information field. To generate a language model for the detailed information field.
- the program title field phrase “Program name search” or “Program Title Search by” A high language score is given when the words constituting the search result target word string in the field are aligned.
- the speech recognition unit 51 performs speech recognition using a language model for the program title field, a language model for the performer name field, and a language model for the detailed information field.
- the field phrases “in program name search” and “Program” in the program title field The recognition hypothesis that “Title Search by” and the words that make up the search result target word string in the program title field are aligned, the field phrase “Person search” or “Cast Search by” in the performer name field, The recognition hypotheses that are aligned with the words that make up the search result target word string, and the field phrases “information search by” and “Information Search by” in the detailed information field, and the words that make up the search result target word string in the detailed information field A recognition language hypothesis is given a high language score.
- the input voice can be recognized with high accuracy.
- the voice recognition unit 51 (FIG. 29)
- the matching unit 56 recognizes the recognition target word string in the field represented by the field phrase included in the speech recognition result (the language model field used to obtain the speech recognition result). Only the target is matched with the speech recognition result, and the output unit 57 outputs a search result word string based on the matching result.
- FIG. 42 shows the speech recognition of the input speech in Japanese using the language model for each field in the speech recognition unit 51 of FIG. 9, and the field represented by the field phrase included in the speech recognition result in the matching unit 56.
- An example of a voice search is shown in which only the recognition target word string is matched with the voice recognition result.
- the voice recognition unit 51 uses the input voice. Voice recognition of “XX in program name search” is recognized.
- the speech recognition unit 51 performs speech recognition using the language model for the program title field, the language model for the performer name field, and the language model for the detailed information field
- the recognition hypothesis “Search for program name ⁇ ” includes the field phrase “Search for program name” in the program title field.
- Language hypothesis (and acoustic score) and therefore, the recognition score does not include the recognition hypothesis that does not include the field phrase “in program name search” in the program title field (field phrases other than the field phrase “in program title search” in the program title field). It is sufficiently higher than the recognition score (including the recognition hypothesis including).
- the recognition hypothesis “XX in program name search” including the field phrase in the program title field is the speech recognition result.
- the speech recognition result “program name search result OO” output from the speech recognition unit 51 is converted into a recognition result pronunciation symbol string via the pronunciation symbol conversion unit 52 and supplied to the matching unit 56.
- search result target pronunciation symbol string of the search result target word string is supplied from the search result target storage unit 53 to the matching unit 56 via the morpheme analysis unit 54 and the pronunciation symbol conversion unit 55.
- the matching unit 56 removes the field phrase from the recognition result pronunciation symbol string, and the recognition result pronunciation symbol string after the deletion. Is matched only with the search result target pronunciation symbol string of the search result target word string in the field represented by the field phrase included in the recognition result pronunciation symbol string in the search result target word string.
- the matching unit 56 supplies the output unit 57 with the similarity as a matching result between the recognition result pronunciation symbol string and the search result target pronunciation symbol string.
- the matching unit 56 for the speech recognition result “program name search by XX” including the field phrase in the program title field, the speech recognition result (field phrase is changed to the search result target word string in the program title field only). Matching with the removed speech recognition result) is performed.
- the output unit 57 outputs, as a search result word string, a search result target word string whose similarity is within the top N, based on the similarity as the matching result from the matching unit 56.
- the voice recognition unit 51 performs the input.
- the voice “XX in person name search” is recognized by voice.
- the speech recognition unit 51 performs speech recognition using the language model for the program title field, the language model for the performer name field, and the language model for the detailed information field.
- the recognition hypothesis “Person Name Search ⁇ ” includes the performer name field field phrase “Person Search”.
- the score (and acoustic score), and thus the recognition score, is sufficiently higher than the recognition score of the recognition hypothesis that does not include the field phrase “in the person name search” of the performer name field.
- the recognition hypothesis “Person Search” is included in the speech recognition result including the field phrase in the performer name field.
- the speech recognition result “person name search OO” output from the speech recognition unit 51 is converted into a recognition result pronunciation symbol string via the pronunciation symbol conversion unit 52 and supplied to the matching unit 56.
- search result target pronunciation symbol string of the search result target word string is supplied from the search result target storage unit 53 to the matching unit 56 via the morpheme analysis unit 54 and the pronunciation symbol conversion unit 55.
- the matching unit 56 removes the field phrase from the recognition result pronunciation symbol string, and the recognition result pronunciation symbol string after the deletion. Is matched only with the search result target pronunciation symbol string of the search result target word string in the field represented by the field phrase included in the recognition result pronunciation symbol string in the search result target word string.
- the matching unit 56 supplies the output unit 57 with the similarity as a matching result between the recognition result pronunciation symbol string and the search result target pronunciation symbol string.
- the speech recognition result “person name search by XX” including the field phrase of the performer name field
- the speech recognition result (field phrase) is applied only to the search result target word string of the performer name field. (Speech recognition result obtained by removing).
- the output unit 57 outputs, as a search result word string, a search result target word string whose similarity is within the top N, based on the similarity as the matching result from the matching unit 56.
- the speech recognition result for the search result target word string in the performer name field Matching is made with the character string “XX” from which the field phrase is removed from “person name search”, and as a result, the program whose performer name matches the character string “XX” is the search result word string Is output as
- the speech recognition unit 51 in FIG. 10 performs speech recognition of English input speech using the language model for each field, and the matching unit 56 shows the field represented by the field phrase included in the speech recognition result.
- An example of voice search in the case where matching with a speech recognition result is performed on only a recognition target word string is shown.
- the speech recognition unit 51 uses the input speech “ Program Title Search by XX "is recognized by voice.
- the speech recognition unit 51 performs speech recognition using the language model for the program title field, the language model for the performer name field, and the language model for the detailed information field, the program title field
- the recognition hypothesis that the score (and acoustic score) and thus the recognition score does not include the field phrase “Program ⁇ ⁇ ⁇ ⁇ Title” Search by ”in the program title field (recognition that includes a field phrase other than the field phrase“ Program Title Search by ”in the program title field) Sufficiently higher than the recognition score (including hypothesis).
- the recognition hypothesis "Program Title Search" by XX containing the field phrase of the program title field is the speech recognition result, It is possible to prevent a recognition hypothesis that does not include a field phrase in the program title field from becoming a speech recognition result.
- the speech recognition result “Program Title Search by XX” output by the speech recognition unit 51 is converted into a recognition result pronunciation symbol string via the pronunciation symbol conversion unit 52 and supplied to the matching unit 56.
- search result target pronunciation symbol string of the search result target word string is supplied from the search result target storage unit 53 to the matching unit 56 via the pronunciation symbol conversion unit 55.
- the matching unit 56 removes the field phrase from the recognition result pronunciation symbol string, and the recognition result pronunciation symbol string after the deletion. Is matched only with the search result target pronunciation symbol string of the search result target word string in the field represented by the field phrase included in the recognition result pronunciation symbol string in the search result target word string.
- the matching unit 56 supplies the output unit 57 with the similarity as a matching result between the recognition result pronunciation symbol string and the search result target pronunciation symbol string.
- the speech recognition result “Program Title Search by XX” including the field phrase in the program title field
- the speech recognition result (removing the field phrase) is performed only on the search result target word string in the program title field. Matching with the voice recognition result).
- the output unit 57 outputs, as a search result word string, a search result target word string whose similarity is within the top N, based on the similarity as the matching result from the matching unit 56.
- the voice recognition unit 51 receives the input voice. "Cast Search by XX” is recognized.
- the speech recognition unit 51 performs speech recognition using the language model for the program title field, the language model for the performer name field, and the language model for the detailed information field.
- speech “Cast Search by XX” containing field phrase “Cast Search by” in the field language score of recognition hypothesis “Cast Search by XX” containing field phrase “Cast Search” by ”in performer name field ( And the acoustic score), and thus the recognition score is sufficiently higher than the recognition score of the recognition hypothesis not including the field phrase “CastCSearch by” in the performer name field.
- the recognition hypothesis "Cast Search by XX" containing the field phrase in the performer name field is the speech recognition result, It is possible to prevent a recognition hypothesis that does not include a field phrase in the performer name field from becoming a speech recognition result.
- the speech recognition result “Cast by XX” output from the speech recognition unit 51 is converted into a recognition result pronunciation symbol string via the pronunciation symbol conversion unit 52 and supplied to the matching unit 56.
- search result target pronunciation symbol string of the search result target word string is supplied from the search result target storage unit 53 to the matching unit 56 via the pronunciation symbol conversion unit 55.
- the matching unit 56 removes the field phrase from the recognition result pronunciation symbol string, and the recognition result pronunciation symbol string after the deletion. Is matched only with the search result target pronunciation symbol string of the search result target word string in the field represented by the field phrase included in the recognition result pronunciation symbol string in the search result target word string.
- the matching unit 56 supplies the output unit 57 with the similarity as a matching result between the recognition result pronunciation symbol string and the search result target pronunciation symbol string.
- the matching unit 56 for the speech recognition result “Cast Search by XX” including the field phrase in the performer name field, only the search result target word string in the performer name field is targeted. Matching with the removed speech recognition result) is performed.
- the output unit 57 outputs, as a search result word string, a search result target word string whose similarity is within the top N, based on the similarity as the matching result from the matching unit 56.
- the speech recognition result “Cast” is targeted for the search result target word string in the performer name field. Matching is performed with the character string “XX” obtained by removing the field phrase from “Search by" XX ”, and as a result, a program whose performer name matches the character string“ XX ”is output as a search result word string.
- a field phrase not only a phrase representing one field but also a phrase representing a plurality of fields can be adopted.
- a field to which a command for controlling the recorder of FIG. 9 (and FIG. 10) belongs can be adopted. In this case, it is possible to determine whether or not the input voice is a command based on the field phrase included in the voice recognition result. Further, when the input voice is a command, the matching unit 56 performs matching of the command. It is possible to search for the type (what kind of processing the command requires).
- FIG. 44 is a diagram showing search result target vectors and vector substitution information.
- a search result target vector representing the search result target pronunciation symbol string and the recognition result pronunciation symbol string when obtaining a cosine distance or a correction distance as a similarity degree The search result target word string stored in the search result target storage unit 53 (FIG. 9) is converted into a search result target vector each time a speech recognition result is obtained. Then, matching takes time and hinders the speeding up of matching.
- a search result target vector necessary for calculating the similarity is obtained in advance from a search result target word string stored in the search result target storage unit 53 (FIG. 9), and is not shown in the matching unit 56.
- the search result target vector is a C-dimensional vector.
- the number C of types of the pronunciation symbol is about 100 to 300.
- the matching unit 56 is built-in.
- the memory that needs to have a storage capacity sufficient to store D ⁇ Z components (of the search result target vector).
- the search result target vector is generally a sparse vector, that is, a vector in which most components are zero.
- a syllable pronunciation symbol corresponding to a non-zero component of the search result target vector (corresponding to a non-zero component when a syllable double chain is used as a matching unit) Only the syllable two-chain syllable symbol string) (ID (Identification)) is stored in the built-in memory.
- the search result target vector component for example, when the frequency (tf) in which the syllable corresponding to the component appears in the search result target pronunciation symbol string is adopted, the non-zero component of the search result target vector Only the set of the syllable corresponding to (ID for identifying) and the frequency of occurrence of the syllable (component value of the search result target vector) is stored in the memory built in the matching unit 56.
- the non-zero component in the search result target vector of the i-th search result target word string If the number of K (i) is K (i), the memory built in the matching unit 56 only stores K (1) + K (2) +... + K (Z) phonetic symbols. The storage capacity is sufficient.
- the values of the search result target vector components are binary values of 0 and 1, whereas the pronunciation symbol has values of about 100 to 300 as described above.
- One component of the search result target vector can be expressed by 1 bit, but 7 to 9 bits are required to express a phonetic symbol.
- the matching unit 56 for each search result target vector, only the syllable pronunciation symbol corresponding to the non-zero component of the search result target vector is stored in the built-in memory. This can be reduced compared to the case where the search result target vector itself is stored.
- the syllable pronunciation symbol corresponding to the non-zero component of the search result target vector stored in the memory built in the matching unit 56 is information that replaces the search result target vector. Also called.
- FIG. 44 shows a search result target vector and vector substitution information replacing the search result target vector.
- the value of the component of the search result target vector is 1 or 0 depending on whether the syllable corresponding to the component exists in the search result target pronunciation symbol string.
- the vector substitution information that replaces the search result target vector is composed only of syllable pronunciation symbols corresponding to non-zero components of the search result target vector.
- the pronunciation symbol of the same syllable that appears multiple times in the search result target word string is distinguished by attaching a number with parentheses. Yes.
- the pronunciation symbol of the same syllable “I” appears twice, but in the vector substitute information, the syllable “I” that appears twice.
- the phonetic symbols of "” the first phonetic symbol is represented by “I”
- the second phonetic symbol is a number with parentheses indicating that "I” is the second. It is represented by “2 (2)” with “(2)” appended thereto, so that each pronunciation symbol of the syllable “I” that appears twice is distinguished.
- pronunciation symbols of the same syllable that appear multiple times in the search result target word string can be expressed without distinction.
- a pronunciation symbol of the same syllable “I” that appears twice in the search result target word string “SEKAI-san” is a syllable “I” (identification ID) in the vector substitution information.
- “2”, which is the frequency at which the syllable “I” appears, can be expressed by a pair (I, 2).
- the memory built in the matching unit 56 in the case of storing the vector substitute information instead of the search result target vector, it is necessary when storing the search result target vector in the matching. Since it is not necessary to access the 0 component of the search result target vector (reading of the 0 component from the memory), the memory capacity can be reduced and matching can be speeded up.
- FIG. 45 is a diagram for explaining the calculation of the similarity between the speech recognition result and the search result target word string when the vector substitution information is stored instead of the search result target vector in the memory built in the matching unit 56. is there.
- the speech recognition result (recognition of the speech recognition result) is performed in the same manner as the search result target word string (the search result target pronunciation symbol string) is expressed by the vector substitution information instead of the search result target vector.
- the result pronunciation symbol string is also expressed by vector substitution information instead of the recognition result vector.
- of the search result target vector V TITLE (i) is further required.
- of the recognition result vector V UTR can be obtained by calculating the square root of the sum of the numbers of pronunciation symbols as components constituting the vector substitution information of the speech recognition result.
- is also obtained in the same manner as the size of the recognition result vector V UTR
- the inner product V UTR ⁇ V TITLE (i) between the recognition result vector V UTR and the search result target vector V TITLE (i) is set to 0 as the initial value of the inner product V UTR ⁇ V TITLE (i). If the phonetic symbols that make up the vector substitution information are sequentially used as the attention symbol, and there is a pronunciation symbol that matches the attention symbol in the vector substitution information of the search result target word string, the inner product V UTR ⁇ V It can be obtained by incrementing TITLE (i) by 1.
- the cosine distance and the correction distance as the similarity between the voice recognition result and the search result target word string can be obtained using the voice recognition result and the vector substitution information of the search result target word string.
- the inner product V UTR / V TITLE (i) matches the attention symbol among the pronunciation symbols constituting the vector substitution information of the speech recognition result in the vector substitution information of the search result target word string. If there is a phonetic symbol to be found, the method of obtaining the inner product V UTR ⁇ V TITLE (i) by incrementing by 1 (hereinafter also referred to as the first inner product calculation method) will store the memory in the matching unit 56. It is necessary to access each of the phonetic symbols constituting the vector substitution information of the stored search result target word string to check whether or not it matches the target symbol.
- the pronunciation symbols constituting the vector substitution information of the search result target word string among the pronunciation symbols constituting the vector substitution information of the search result target word string, the pronunciation symbols that do not match the pronunciation symbols constituting the vector substitution information of the speech recognition result must also be accessed. Therefore, it takes time to calculate the inner product V UTR ⁇ V TITLE (i), and thus to match.
- the matching unit 56 creates in advance a reverse index that can search for a search result target word string having the pronunciation symbol in the vector substitution information from the pronunciation symbol, from the vector substitution information of the search result target word string.
- the inner product V UTR ⁇ V TITLE (i) can be calculated using the reverse index.
- the vector substitution information is an index that can search for the syllable pronunciation symbol of the search result target word string from the search result target word string, but according to the reverse lookup index,
- the reverse search that is, the search result target word string having the pronunciation symbol in the vector substitution information can be searched from the pronunciation symbol.
- FIG. 46 is a diagram for explaining a method of creating a reverse index from the vector substitution information of the search result target word string.
- the matching unit 56 associates, for all pronunciation symbols that can be components of vector substitution information, a pronunciation symbol and a search result target ID that specifies a search result target word string having the pronunciation symbol as a component of vector substitution information. Thus, a reverse index is created.
- search result target word string having the phonetic symbol “I” as a component of vector substitution information
- search result target word string having a search result target ID 3
- search result target It can be immediately detected (searched) that the search result target word string is ID 3.
- FIG. 47 is a diagram for explaining a method for calculating the inner product V UTR ⁇ V TITLE (i) using the reverse index (hereinafter also referred to as a second inner product calculation method).
- the matching unit 56 sets the initial value of the inner product V UTR ⁇ V TITLE (i) for each search result word string to 0, and generates phonetic symbols constituting the vector substitution information of the speech recognition result,
- a search result target word string (a search result target ID) having a pronunciation symbol matching the target symbol as a component of the vector substitution information is sequentially detected from the reverse lookup index as the target symbol.
- the matching unit 56 sets the inner product V UTR ⁇ V TITLE (i) for the search result target word string to 1 Only increments.
- the phonetic symbols that do not match the phonetic symbols constituting the vector substitution information of the speech recognition result among the phonetic symbols of the reverse lookup index are not accessed.
- the inner product V UTR ⁇ V TITLE (i) can be calculated in a short time by using the inner product calculation method, and as a result, matching can be speeded up.
- a calculation part that can be performed before the voice recognition in the voice recognition unit 51 is performed is performed in advance and stored in the memory built in the matching unit 56. By doing so, it is possible to speed up matching.
- of the search result target vector V TITLE (i) are required.
- of the target vector V TITLE (i) can be calculated before speech recognition is performed.
- is calculated in advance and stored in the memory built in the matching unit 56, thereby speeding up matching. be able to.
- FIG. 48 is a flowchart for explaining processing of the voice search device 50 of FIG. 9 (and FIG. 10).
- step S11 the voice search device 50 performs necessary preprocessing.
- the voice search device 50 reads, for example, a program title, performer name, detailed information, and the like, which are constituent elements of the EPG recorded in the recording medium 63, and stores a search result target storage unit.
- the data is supplied to 53 and stored as a search result target word string.
- the speech recognition unit 51 performs a process of generating a language model using the search result target word string stored in the search result target storage unit 53 as preprocessing.
- the pre-processing in step S11 is performed, for example, every day at a predetermined time.
- the pre-processing in step S11 is performed when a recorded program recorded on the recording medium 63 is changed, or when an EPG recorded on the recording medium 63 is changed (updated).
- the voice recognition unit 51 recognizes the input voice in step S12.
- the speech recognition in the speech recognition unit 51 is performed using the language model generated by the latest preprocessing.
- the speech recognition result obtained by the speech recognition unit 51 performing speech recognition of the input speech is supplied to the matching unit 56 as a recognition result pronunciation symbol string via the pronunciation symbol conversion unit 52.
- search result target word string stored in the search result target storage unit 53 becomes a search result target pronunciation symbol string in the matching unit 56 via the morpheme analysis unit 54 and the phonetic symbol conversion unit 55. Supplied.
- step S ⁇ b> 13 the matching unit 56 recognizes each of the search result target word strings stored in the search result target storage unit 53 for each recognition result pronunciation symbol string supplied from the speech recognition unit 51 via the pronunciation symbol conversion unit 52. And the search result target pronunciation symbol string supplied from the search result target storage unit 53 via the morpheme analysis unit 54 and the phonetic symbol conversion unit 55, and the matching result is supplied to the output unit 57.
- the matching unit 56 calculates, for example, a correction distance as a similarity to the speech recognition result for each search result target word string stored in the search result target storage unit 53, and the similarity is matched. As a result, it is supplied to the output unit 57.
- the matching unit 56 determines whether the recognition result pronunciation symbol string excluding the specific phrase and the search result target pronunciation symbol string Take matching.
- step S ⁇ b> 14 the output unit 57 searches the word string corresponding to the input speech from the search result target word strings stored in the search result target storage unit 53 based on the matching result from the matching unit 56.
- a search result word string (which is a search result target word string) is selected and output.
- the output unit 57 selects a search result target word string having a similarity to the speech recognition result from the search result target word string stored in the search result target storage unit 53 within the search result word string. Select as output.
- the search result target word string is, for example, a program title, performer name, or detailed information
- the output unit 57 outputs the title of the program having the performer name as metadata together with the performer name or instead of the performer name. Can be selected as a search result word string.
- FIG. 49 shows a configuration example of an embodiment of a computer in which a program for executing the series of processes described above is installed.
- the program can be recorded in advance in a hard disk 105 or a ROM 103 as a recording medium built in the computer.
- the program can be stored (recorded) in the removable recording medium 111.
- a removable recording medium 111 can be provided as so-called package software.
- examples of the removable recording medium 111 include a flexible disk, a CD-ROM (Compact Disc Read Only Memory), an MO (Magneto Optical) disc, a DVD (Digital Versatile Disc), a magnetic disc, and a semiconductor memory.
- the program can be installed on the computer from the removable recording medium 111 as described above, or can be downloaded to the computer via the communication network or the broadcast network and installed on the built-in hard disk 105. That is, the program is transferred from a download site to a computer wirelessly via a digital satellite broadcasting artificial satellite, or wired to a computer via a network such as a LAN (Local Area Network) or the Internet. be able to.
- a network such as a LAN (Local Area Network) or the Internet.
- the computer includes a CPU (Central Processing Unit) 102, and an input / output interface 110 is connected to the CPU 102 via the bus 101.
- CPU Central Processing Unit
- the CPU 102 executes a program stored in a ROM (Read Only Memory) 103 accordingly. .
- the CPU 102 loads a program stored in the hard disk 105 into a RAM (Random Access Memory) 104 and executes it.
- the CPU 102 performs processing according to the flowchart described above or processing performed by the configuration of the block diagram described above. Then, the CPU 102 outputs the processing result as necessary, for example, via the input / output interface 110, from the output unit 106, transmitted from the communication unit 108, and further recorded in the hard disk 105.
- the input unit 107 includes a keyboard, a mouse, a microphone, and the like.
- the output unit 106 includes an LCD (Liquid Crystal Display), a speaker, and the like.
- the processing performed by the computer according to the program does not necessarily have to be performed in chronological order in the order described as the flowchart. That is, the processing performed by the computer according to the program includes processing executed in parallel or individually (for example, parallel processing or object processing).
- the program may be processed by one computer (processor), or may be distributedly processed by a plurality of computers. Furthermore, the program may be transferred to a remote computer and executed.
- the language of the input voice is not limited to Japanese or English.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
・・・(1)
=VUTR・VTITLE(i)/(|VUTR||VUTR|×√(|VTITLE(i)|/|VUTR|))
=VUTR・VTITLE(i)/(|VUTR|√(|VTITLE(i)||VUTR|))
・・・(2)
=VUTR・VTITLE(i)/|VUTR|2
・・・(3)
・・・(4)
Claims (8)
- 入力音声を音声認識する音声認識部と、
前記入力音声に対応する単語列の検索結果の対象となる単語列である複数の検索結果対象単語列それぞれについて、前記検索結果対象単語列の発音を表す発音シンボルの並びである検索結果対象発音シンボル列と、前記入力音声の音声認識結果の発音を表す発音シンボルの並びである認識結果発音シンボル列とのマッチングをとるマッチング部と、
前記検索結果対象発音シンボル列と前記認識結果発音シンボル列とのマッチング結果に基づいて、前記複数の検索結果対象単語列からの、前記入力音声に対応する単語列の検索の結果である検索結果単語列を出力する出力部と
を備える検索装置。 - 前記発音シンボルは、音節、又は、音素の発音を表すシンボルであり、
前記マッチング部は、前記検索結果対象発音シンボル列と前記認識結果発音シンボル列とのマッチングにおいて、前記検索結果対象発音シンボル列を表すベクトルである検索結果対象ベクトルと、前記認識結果発音シンボル列を表すベクトルである認識結果ベクトルとについて、前記検索結果対象発音シンボル列と認識結果発音シンボル列との長さの相違の影響を低減するように、ベクトル空間法のコサイン距離を補正した補正距離を求める
請求項1に記載の検索装置。 - 前記入力音声の音声認識結果を、前記認識結果発音シンボル列に変換する発音シンボル変換部をさらに備える
請求項2に記載の検索装置。 - 前記発音シンボル変換部は、さらに、前記検索結果対象単語列を、前記検索結果対象発音シンボル列に変換する
請求項3に記載の検索装置。 - 前記マッチング部は、前記コサイン距離を求める演算において、前記検索結果対象ベクトルの大きさに代えて、前記検索結果対象ベクトルの大きさと前記認識結果ベクトルの大きさとの乗算値の平方根を用いることで、前記補正距離を求める
請求項2に記載の検索装置。 - 前記マッチング部は、前記コサイン距離を求める演算において、前記検索結果対象ベクトルの大きさに代えて、前記認識結果ベクトルの大きさを用いることで、前記補正距離を求める
請求項2に記載の検索装置。 - 入力音声に対応する単語列を検索する検索装置が、
前記入力音声を音声認識し、
前記入力音声に対応する単語列を検索する対象の複数の検索結果対象単語列それぞれについて、前記検索結果対象単語列の発音を表す発音シンボルの並びである検索結果対象発音シンボル列と、前記入力音声の音声認識結果の発音を表す発音シンボルの並びである認識結果発音シンボル列とのマッチングをとり、
前記検索結果対象発音シンボル列と前記認識結果発音シンボル列とのマッチング結果に基づいて、前記複数の検索結果対象単語列からの、前記入力音声に対応する単語列の検索の結果である検索結果単語列を出力する
ステップを含む検索方法。 - 入力音声を音声認識する音声認識部と、
前記入力音声に対応する単語列を検索する対象の複数の検索結果対象単語列それぞれについて、前記検索結果対象単語列の発音を表す発音シンボルの並びである検索結果対象発音シンボル列と、前記入力音声の音声認識結果の発音を表す発音シンボルの並びである認識結果発音シンボル列とのマッチングをとるマッチング部と、
前記検索結果対象発音シンボル列と前記認識結果発音シンボル列とのマッチング結果に基づいて、前記複数の検索結果対象単語列からの、前記入力音声に対応する単語列の検索の結果である検索結果単語列を出力する出力部と
して、コンピュータを機能させるためのプログラム。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/511,401 US9817889B2 (en) | 2009-12-04 | 2010-12-02 | Speech-based pronunciation symbol searching device, method and program using correction distance |
CN201080053823.0A CN102667773B (zh) | 2009-12-04 | 2010-12-02 | 搜索设备、搜索方法及程序 |
JP2011544293A JPWO2011068170A1 (ja) | 2009-12-04 | 2010-12-02 | 検索装置、検索方法、及び、プログラム |
RU2012121711/08A RU2012121711A (ru) | 2009-12-04 | 2010-12-02 | Устройство поиска, способ поиска программы |
EP10834620A EP2509005A1 (en) | 2009-12-04 | 2010-12-02 | Search device, search method, and program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009276996 | 2009-12-04 | ||
JP2009-276996 | 2009-12-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011068170A1 true WO2011068170A1 (ja) | 2011-06-09 |
Family
ID=44115016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/071605 WO2011068170A1 (ja) | 2009-12-04 | 2010-12-02 | 検索装置、検索方法、及び、プログラム |
Country Status (7)
Country | Link |
---|---|
US (1) | US9817889B2 (ja) |
EP (1) | EP2509005A1 (ja) |
JP (1) | JPWO2011068170A1 (ja) |
KR (1) | KR20120113717A (ja) |
CN (1) | CN102667773B (ja) |
RU (1) | RU2012121711A (ja) |
WO (1) | WO2011068170A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102867005A (zh) * | 2011-07-06 | 2013-01-09 | 阿尔派株式会社 | 检索装置、检索方法以及车载导航装置 |
WO2015118645A1 (ja) * | 2014-02-06 | 2015-08-13 | 三菱電機株式会社 | 音声検索装置および音声検索方法 |
Families Citing this family (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014519071A (ja) * | 2011-03-28 | 2014-08-07 | アンビエンツ | 音響コンテキストを使用する検索システム及び方法 |
KR101231438B1 (ko) * | 2011-05-25 | 2013-02-07 | 엔에이치엔(주) | 외래어 발음 검색 서비스를 제공하는 검색결과 제공 시스템 및 방법 |
CN103065630B (zh) * | 2012-12-28 | 2015-01-07 | 科大讯飞股份有限公司 | 用户个性化信息语音识别方法及系统 |
US10424291B2 (en) * | 2012-12-28 | 2019-09-24 | Saturn Licensing Llc | Information processing device, information processing method, and program |
US9305064B1 (en) * | 2013-05-24 | 2016-04-05 | Google Inc. | Keyword-based conversational searching using voice commands |
JP6223744B2 (ja) * | 2013-08-19 | 2017-11-01 | 株式会社東芝 | 方法、電子機器およびプログラム |
US9889383B2 (en) * | 2013-10-03 | 2018-02-13 | Voyetra Turtle Beach, Inc. | Configuring headset voice morph based on player assignment |
US20150120723A1 (en) * | 2013-10-24 | 2015-04-30 | Xerox Corporation | Methods and systems for processing speech queries |
KR102092164B1 (ko) | 2013-12-27 | 2020-03-23 | 삼성전자주식회사 | 디스플레이 장치, 서버 장치 및 이들을 포함하는 디스플레이 시스템과 그 컨텐츠 제공 방법들 |
CN103761840A (zh) * | 2014-01-21 | 2014-04-30 | 小米科技有限责任公司 | 遥控器寻找方法、装置、设备及系统 |
US20150340024A1 (en) * | 2014-05-23 | 2015-11-26 | Google Inc. | Language Modeling Using Entities |
WO2016029045A2 (en) * | 2014-08-21 | 2016-02-25 | Jobu Productions | Lexical dialect analysis system |
KR102298457B1 (ko) * | 2014-11-12 | 2021-09-07 | 삼성전자주식회사 | 영상표시장치, 영상표시장치의 구동방법 및 컴퓨터 판독가능 기록매체 |
CN104598527B (zh) * | 2014-12-26 | 2018-09-25 | 论客科技(广州)有限公司 | 一种语音搜索方法及装置 |
US10019514B2 (en) * | 2015-03-19 | 2018-07-10 | Nice Ltd. | System and method for phonetic search over speech recordings |
US10249297B2 (en) * | 2015-07-13 | 2019-04-02 | Microsoft Technology Licensing, Llc | Propagating conversational alternatives using delayed hypothesis binding |
WO2017017738A1 (ja) * | 2015-07-24 | 2017-02-02 | 富士通株式会社 | 符号化プログラム、符号化装置、及び符号化方法 |
CN106024013B (zh) * | 2016-04-29 | 2022-01-14 | 努比亚技术有限公司 | 语音数据搜索方法及系统 |
US10990757B2 (en) | 2016-05-13 | 2021-04-27 | Microsoft Technology Licensing, Llc | Contextual windows for application programs |
US10068573B1 (en) * | 2016-12-21 | 2018-09-04 | Amazon Technologies, Inc. | Approaches for voice-activated audio commands |
US10726056B2 (en) * | 2017-04-10 | 2020-07-28 | Sap Se | Speech-based database access |
US11043221B2 (en) * | 2017-04-24 | 2021-06-22 | Iheartmedia Management Services, Inc. | Transmission schedule analysis and display |
US20180329592A1 (en) * | 2017-05-12 | 2018-11-15 | Microsoft Technology Licensing, Llc | Contextual windows for application programs |
CN109104634A (zh) * | 2017-06-20 | 2018-12-28 | 中兴通讯股份有限公司 | 一种机顶盒工作方法、机顶盒及计算机可读存储介质 |
CN107369450B (zh) * | 2017-08-07 | 2021-03-12 | 苏州市广播电视总台 | 收录方法和收录装置 |
CN107809667A (zh) * | 2017-10-26 | 2018-03-16 | 深圳创维-Rgb电子有限公司 | 电视机语音交互方法、语音交互控制装置及存储介质 |
US10546062B2 (en) * | 2017-11-15 | 2020-01-28 | International Business Machines Corporation | Phonetic patterns for fuzzy matching in natural language processing |
CN107808007A (zh) * | 2017-11-16 | 2018-03-16 | 百度在线网络技术(北京)有限公司 | 信息处理方法和装置 |
CN107832439B (zh) * | 2017-11-16 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | 多轮状态追踪的方法、系统及终端设备 |
US10832657B2 (en) * | 2018-03-01 | 2020-11-10 | International Business Machines Corporation | Use of small unit language model for training large unit language models |
KR20200056712A (ko) | 2018-11-15 | 2020-05-25 | 삼성전자주식회사 | 전자 장치 및 그 제어 방법 |
CN110600016B (zh) * | 2019-09-20 | 2022-02-25 | 北京市律典通科技有限公司 | 卷宗推送方法和装置 |
JP2022074509A (ja) * | 2020-11-04 | 2022-05-18 | 株式会社東芝 | 差分抽出装置、方法及びプログラム |
US11620993B2 (en) * | 2021-06-09 | 2023-04-04 | Merlyn Mind, Inc. | Multimodal intent entity resolver |
CN113889146A (zh) * | 2021-09-22 | 2022-01-04 | 北京小米移动软件有限公司 | 音频识别方法、装置、电子设备和存储介质 |
CN114969339B (zh) * | 2022-05-30 | 2023-05-12 | 中电金信软件有限公司 | 一种文本匹配方法、装置、电子设备及可读存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001242884A (ja) | 2000-02-28 | 2001-09-07 | Sony Corp | 音声認識装置および音声認識方法、並びに記録媒体 |
JP2002252813A (ja) * | 2001-02-23 | 2002-09-06 | Fujitsu Ten Ltd | 番組検索装置及び番組検索プログラム |
JP2005150841A (ja) * | 2003-11-11 | 2005-06-09 | Canon Inc | 情報処理方法及び情報処理装置 |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4839853A (en) * | 1988-09-15 | 1989-06-13 | Bell Communications Research, Inc. | Computer information retrieval using latent semantic structure |
US7251637B1 (en) * | 1993-09-20 | 2007-07-31 | Fair Isaac Corporation | Context vector generation and retrieval |
US7725307B2 (en) * | 1999-11-12 | 2010-05-25 | Phoenix Solutions, Inc. | Query engine for processing voice based queries including semantic decoding |
JP4393648B2 (ja) * | 2000-01-11 | 2010-01-06 | 富士通株式会社 | 音声認識装置 |
CN1151489C (zh) * | 2000-11-15 | 2004-05-26 | 中国科学院自动化研究所 | 中国人名、地名和单位名的语音识别方法 |
US7043431B2 (en) * | 2001-08-31 | 2006-05-09 | Nokia Corporation | Multilingual speech recognition system using text derived recognition models |
US7353164B1 (en) * | 2002-09-13 | 2008-04-01 | Apple Inc. | Representation of orthography in a continuous vector space |
US7401019B2 (en) * | 2004-01-15 | 2008-07-15 | Microsoft Corporation | Phonetic fragment search in speech data |
US7961851B2 (en) * | 2006-07-26 | 2011-06-14 | Cisco Technology, Inc. | Method and system to select messages using voice commands and a telephone user interface |
US8166029B2 (en) * | 2006-09-07 | 2012-04-24 | Yahoo! Inc. | System and method for identifying media content items and related media content items |
US20080162125A1 (en) * | 2006-12-28 | 2008-07-03 | Motorola, Inc. | Method and apparatus for language independent voice indexing and searching |
US7912724B1 (en) * | 2007-01-18 | 2011-03-22 | Adobe Systems Incorporated | Audio comparison using phoneme matching |
US7983915B2 (en) * | 2007-04-30 | 2011-07-19 | Sonic Foundry, Inc. | Audio content search engine |
CN100470633C (zh) * | 2007-11-30 | 2009-03-18 | 清华大学 | 语音点歌方法 |
US8065300B2 (en) * | 2008-03-12 | 2011-11-22 | At&T Intellectual Property Ii, L.P. | Finding the website of a business using the business name |
-
2010
- 2010-12-02 WO PCT/JP2010/071605 patent/WO2011068170A1/ja active Application Filing
- 2010-12-02 KR KR1020127013649A patent/KR20120113717A/ko not_active Application Discontinuation
- 2010-12-02 RU RU2012121711/08A patent/RU2012121711A/ru not_active Application Discontinuation
- 2010-12-02 US US13/511,401 patent/US9817889B2/en not_active Expired - Fee Related
- 2010-12-02 JP JP2011544293A patent/JPWO2011068170A1/ja not_active Ceased
- 2010-12-02 EP EP10834620A patent/EP2509005A1/en not_active Withdrawn
- 2010-12-02 CN CN201080053823.0A patent/CN102667773B/zh not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001242884A (ja) | 2000-02-28 | 2001-09-07 | Sony Corp | 音声認識装置および音声認識方法、並びに記録媒体 |
JP2002252813A (ja) * | 2001-02-23 | 2002-09-06 | Fujitsu Ten Ltd | 番組検索装置及び番組検索プログラム |
JP2005150841A (ja) * | 2003-11-11 | 2005-06-09 | Canon Inc | 情報処理方法及び情報処理装置 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102867005A (zh) * | 2011-07-06 | 2013-01-09 | 阿尔派株式会社 | 检索装置、检索方法以及车载导航装置 |
WO2015118645A1 (ja) * | 2014-02-06 | 2015-08-13 | 三菱電機株式会社 | 音声検索装置および音声検索方法 |
CN105981099A (zh) * | 2014-02-06 | 2016-09-28 | 三菱电机株式会社 | 语音检索装置和语音检索方法 |
JPWO2015118645A1 (ja) * | 2014-02-06 | 2017-03-23 | 三菱電機株式会社 | 音声検索装置および音声検索方法 |
Also Published As
Publication number | Publication date |
---|---|
KR20120113717A (ko) | 2012-10-15 |
US20130006629A1 (en) | 2013-01-03 |
CN102667773B (zh) | 2015-02-04 |
RU2012121711A (ru) | 2013-11-27 |
EP2509005A1 (en) | 2012-10-10 |
CN102667773A (zh) | 2012-09-12 |
JPWO2011068170A1 (ja) | 2013-04-18 |
US9817889B2 (en) | 2017-11-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2011068170A1 (ja) | 検索装置、検索方法、及び、プログラム | |
JP5610197B2 (ja) | 検索装置、検索方法、及び、プログラム | |
US7949530B2 (en) | Conversation controller | |
US9418152B2 (en) | System and method for flexible speech to text search mechanism | |
US7949532B2 (en) | Conversation controller | |
US11056104B2 (en) | Closed captioning through language detection | |
US7842873B2 (en) | Speech-driven selection of an audio file | |
EP1909263A1 (en) | Exploitation of language identification of media file data in speech dialog systems | |
US8688725B2 (en) | Search apparatus, search method, and program | |
US20130090921A1 (en) | Pronunciation learning from user correction | |
US11501764B2 (en) | Apparatus for media entity pronunciation using deep learning | |
JP5326169B2 (ja) | 音声データ検索システム及び音声データ検索方法 | |
JP2009128508A (ja) | 音声データ検索システム | |
Nouza et al. | Making czech historical radio archive accessible and searchable for wide public | |
JP2011118775A (ja) | 検索装置、検索方法、及び、プログラム | |
JP2011118774A (ja) | 検索装置、検索方法、及び、プログラム | |
JP5196114B2 (ja) | 音声認識装置およびプログラム | |
JP5366050B2 (ja) | 音響モデル学習装置、音声認識装置、及び音響モデル学習のためのコンピュータプログラム | |
KR100811226B1 (ko) | 악센트구 매칭 사전선택을 이용한 일본어음성합성방법 및시스템 | |
Ohta et al. | Evaluating spoken language model based on filler prediction model in speech recognition. | |
WO2024118649A1 (en) | Systems, methods, and media for automatically transcribing lyrics of songs | |
Yu | Efficient error correction for speech systems using constrained re-recognition | |
De Villiers | Lecture transcription systems in resource–scarce environments | |
Leath | Audient: An acoustic search engine | |
JP2005099604A (ja) | 会話制御装置、会話制御方法、およびゲームシステム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201080053823.0 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10834620 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010834620 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011544293 Country of ref document: JP |
|
ENP | Entry into the national phase |
Ref document number: 20127013649 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2012121711 Country of ref document: RU |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13511401 Country of ref document: US |