WO2010098209A1 - 音声検索装置および音声検索方法 - Google Patents
音声検索装置および音声検索方法 Download PDFInfo
- Publication number
- WO2010098209A1 WO2010098209A1 PCT/JP2010/051937 JP2010051937W WO2010098209A1 WO 2010098209 A1 WO2010098209 A1 WO 2010098209A1 JP 2010051937 W JP2010051937 W JP 2010051937W WO 2010098209 A1 WO2010098209 A1 WO 2010098209A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- search
- voice
- keyword
- speech
- searching
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000005070 sampling Methods 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 description 25
- 238000006243 chemical reaction Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000002269 spontaneous effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/12—Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present invention relates to a voice search device and a voice search method. More specifically, the present invention relates to an apparatus and a method for performing a search for speech at high speed and efficiently.
- Non-Patent Documents 1 and 2 proposes a method of creating index data from a speech database and realizing high-speed speech search using the index data.
- Patent Document 1 there is a description of the elimination and speeding up of notation fluctuation in document retrieval by combining a suffix array and dynamic programming.
- Patent Document 1 The prior art described in Patent Document 1 is directed to document search and is not an ambiguous search for a phoneme string by speech recognition according to the present invention. It is self-evident that the computation time increases significantly only by combining a simple suffix array and dynamic programming.
- Non-Patent Documents 1 and 2 when the speech database becomes large, the conventional speech retrieval speed-up method has to create index data having a scale suitable for the database. For this reason, a high-speed secondary storage device is required, which is not desirable from the viewpoint of cost.
- the secondary storage device takes longer to access than the main storage device, there is a demerit that the search speed is also reduced.
- index data from words or subwords.
- search keyword or sub-keyword
- word or the sub-word are assumed to be completely matched, there is a possibility that sufficient search performance cannot be obtained with the current speech recognition with many misrecognitions.
- the present invention does not require a secondary storage device, the search speed is high, the cost required for the search is low, and the fuzzy search that exhibits good search performance is provided for the above problems.
- An object of the present invention is to provide a voice search device and a voice search method for performing the above.
- the voice search device and the voice search method for performing the fuzzy search are specifically configured as follows.
- the invention according to claim 1 is a voice search device for searching voice data obtained by sampling the input voice using voice as an input, and for database voice recognition for recognizing voice recorded in a voice database.
- a phoneme sequence generator that generates a phoneme sequence from a word sequence recognized by the database speech recognizer, and a suffix array that generates a suffix array from the phoneme sequence generated by the phoneme generator A generation unit; an input device for inputting a search keyword; an input phoneme generation unit for generating a phoneme string from the search keyword input by the input device; and searching for the search keyword on the suffix array by dynamic programming
- a voice search unit comprising: means for setting a first threshold value; and means for searching for a search target by dynamic programming using the first threshold value. Yes.
- This speech search device is a speech search device that searches speech data obtained by sampling the input speech, using speech as an input, and performs a fuzzy search using both a suffix array and dynamic programming. Is what you do. Since matching with a search keyword is performed in units of phonemes, it is possible to perform a search even if the word or subword registered in the index does not completely match.
- the invention according to claim 2 is the voice search device according to claim 1, wherein the voice search unit further includes means for dividing the search keyword by phoneme when the search keyword is longer than a predetermined length. And means for determining a second threshold value used for searching for the keyword divided by the search keyword dividing means from the first threshold value, and the means for searching for the search object uses the second threshold value.
- the gist of the speech search apparatus is a means for searching for a search object by dynamic programming.
- the voice search device having the above-described configuration divides a search keyword, changes a search first threshold for matching more than one place, and lengths of keywords in order to prevent an exponential explosion of processing time.
- the number of divided phonemes and the determination of division / non-division are determined according to the above, and high-speed search is realized.
- the search first threshold value for matching two or more locations is changed according to the following equation 1 (ie, Equation 1).
- p is the number of divisions
- t is the original first threshold obtained by the means for determining the search threshold for the plurality of divided search keywords
- t ′ is the second threshold after being changed by the threshold changing means. It is.
- the invention according to claim 3 is the speech search apparatus according to claim 1 or 2, wherein the speech search unit repeatedly performs the search while sequentially increasing the first threshold, and sequentially searches the search results.
- the gist of the present invention is a voice search unit that is a voice search unit provided with a threshold adjustment means to be presented.
- the above configuration has a threshold adjustment function for repeatedly searching the first threshold value of the search by sequentially increasing iterative learning search (a kind of iterative deepening search) and sequentially presenting the search results. . While the user of the device confirms the search results presented in the initial stage, the search speed is improved sensible by presenting new search results sequentially while updating the threshold value. .
- the invention according to claim 4 is the voice search device according to any one of claims 1 to 3, wherein the voice search unit further includes means for determining presence / absence of keyword division based on a length of the search keyword;
- the gist of the present invention is a speech retrieval unit comprising a keyword segmenting means for determining the number of phonemes after the keyword segmentation.
- the apparatus configured as described above can determine the presence or absence of keyword division based on the length of the search keyword, and can determine the number of phonemes after the keyword division.
- the invention according to claim 5 is the speech search device according to any one of claims 1 to 4, wherein the means for searching for a search object by the dynamic programming further comprises a phoneme discrimination feature in the dynamic programming.
- the gist of the speech search apparatus is a means for searching for a search object provided with means for calculating similarity between phonemes using a distance between phonemes based on.
- the above-mentioned distance between phonemes includes, for example, a Hamming distance that is a difference in phoneme discrimination characteristics. Therefore, in the above configuration, the similarity between phonemes is calculated by using this Hamming distance.
- the invention according to claim 6 is a speech search method for searching speech data obtained by sampling the input speech, using speech as input, converting the speech data into a phoneme string and creating a suffix array Accepting a search keyword and converting it into a phoneme string, Including a step of setting a first threshold value used in a search, a step of searching for a search target by dynamic programming using the first, and a step of outputting a result searched by the search step.
- the gist of the featured voice search method is as follows.
- the speech search method with the above configuration uses a suffix array (hereinafter sometimes referred to as “Suffix Array”) and dynamic programming (hereinafter sometimes referred to as “DP (Dynamic Programming) matching”) and fuzzy search. Is to do. Since matching with a search keyword is performed in units of phonemes, it is possible to perform a search even if the word or subword registered in the index does not completely match.
- DP Dynamic Programming
- the invention according to claim 7 is the speech search method according to claim 6, further comprising: dividing the search keyword by phoneme when the search keyword is longer than a predetermined length; and dividing the search keyword Determining a second threshold to be used for searching for the keyword divided by the step from the first threshold, and the step of searching for the search target is performed by dynamic programming using the second threshold.
- the gist of the speech search method is a step of searching for a search target.
- a search keyword is divided, two or more matching methods are used, a search first threshold is changed, the number of divided phonemes according to the keyword length, The non-division decision is made and high-speed search is realized.
- the second threshold value for searching for the divided keywords can be determined based on Equation 1 shown in Equation 1.
- the invention according to claim 8 is the speech search method according to claim 6 or 7, further comprising a threshold adjustment step of repeatedly searching while sequentially increasing the first threshold.
- the main point is the voice search method.
- the voice search method having the above-described configuration includes a threshold adjustment function that repeatedly searches the first threshold value of the search by sequentially increasing iterative lengthening search (a kind of iterative deepening search) and sequentially presents the search results. It is a thing. In the initial search with a small first threshold, the search is close to a binary search due to the characteristics of the Suffix Array, so that a very high-speed search is possible.
- the invention according to claim 9 is the voice search method according to any one of claims 6 to 8, further comprising the step of determining the presence or absence of keyword division based on the length of the search keyword;
- the gist of the present invention is a speech search method including a keyword dividing step for determining the number of phonemes.
- the voice search method configured as described above is processed so that the presence or absence of keyword division is determined based on the length of the search keyword, and the number of phonemes after the keyword division can be determined.
- the invention according to claim 10 is the speech search method according to any one of claims 6 to 9, wherein the step of searching for the search object includes a distance between phonemes based on a phoneme discrimination feature in the dynamic programming.
- the gist of the present invention is a speech search method characterized by having a step of calculating similarity between phonemes.
- the speech search method having the above-described configuration can perform processing such as calculating similarity between phonemes using a distance between phonemes based on a phoneme discrimination feature (for example, a hamming distance of a difference between phoneme discrimination features) in the dynamic programming method. It has become.
- a phoneme discrimination feature for example, a hamming distance of a difference between phoneme discrimination features
- a high-speed secondary storage device is not required, and the cost required for preparing secondary storage can be reduced. That is, it is possible to provide a voice search device and a voice search method that have a high search speed and low cost, and also have good search performance.
- the horizontal axis of the graph represents the first threshold value, and the vertical axis represents the search recall, relevance rate, and processing time. It is a figure of time until the first search result is output when the 1st threshold is made the lowest for the search keyword of 12 phonemes concerning the example of the present invention.
- the horizontal axis of the graph represents the first threshold value, and the vertical axis represents the search recall, relevance rate, and processing time. It is a figure of time until the first search result is output when the 1st threshold is made the lowest for the search keyword of 18 phonemes concerning the example of the present invention.
- the horizontal axis of the graph represents the first threshold value, and the vertical axis represents the search recall, relevance rate, and processing time.
- the horizontal axis of the graph represents the first threshold value, and the vertical axis represents the search recall, relevance rate, and processing time. It is a figure of time until a half correct answer keyword is detected for the search keyword of 6 phonemes concerning the example of the present invention.
- the horizontal axis of the graph represents the first threshold value, and the vertical axis represents the search recall, relevance rate, and processing time. It is a figure of time until a half correct keyword is detected for the search keyword of 12 phonemes concerning the example of the present invention.
- the horizontal axis of the graph represents the first threshold value
- the vertical axis represents the search recall, relevance rate, and processing time. It is a figure of time until a 1st threshold value which concerns on the Example of this invention is set to the initial value 0.0, and the search keyword of 24 phonemes is searched from 6 phonemes, and a search result group is shown to a user.
- the horizontal axis of the graph is the voice conversion time (unit: hour) of the pseudo voice database (Mainichi Newspaper Corpus), and the vertical axis is the search processing time (unit: millisecond).
- the first threshold value according to the embodiment of the present invention is updated from the state of FIG.
- FIG. 12 to 0.2 a search keyword of 24 phonemes is searched again from 6 phonemes, and the time until the search result group is presented to the user is updated.
- FIG. The horizontal axis of the graph is the voice conversion time (unit: hour) of the pseudo voice database (Mainichi Newspaper Corpus), and the vertical axis is the search processing time (unit: millisecond).
- the time until the first threshold value according to the embodiment of the present invention is further updated from the state of FIG. 13 to 0.4, a search keyword of 24 phonemes is searched again from 6 phonemes, and the search result group is presented to the user.
- the horizontal axis of the graph is the voice conversion time (unit: hour) of the pseudo voice database (Mainichi Newspaper Corpus), and the vertical axis is the search processing time (unit: millisecond).
- a search keyword from 6 phonemes to 24 phonemes is searched by setting the first threshold value to an initial value 0.0, and a search result group is presented to the user.
- FIG. The horizontal axis of the graph is the voice conversion time (unit: hour) of the pseudo voice database (Mainichi Newspaper Corpus), and the vertical axis is the search processing time (unit: millisecond).
- the first threshold value according to the embodiment of the present invention is updated from the state of FIG.
- FIG. 15 to 0.2 a search keyword of 24 phonemes is newly searched from 6 phonemes, and the time until the search result group is presented to the user is updated.
- FIG. The horizontal axis of the graph is the voice conversion time (unit: hour) of the pseudo voice database (Mainichi Newspaper Corpus), and the vertical axis is the search processing time (unit: millisecond).
- the time until the first threshold value according to the embodiment of the present invention is further updated from the state of FIG. 16 to 0.4, a search keyword of 24 phonemes is searched again from 6 phonemes, and the search result group is presented to the user.
- the horizontal axis of the graph is the voice conversion time (unit: hour) of the pseudo voice database (Mainichi Newspaper Corpus), and the vertical axis is the search processing time (unit: millisecond).
- FIG. The horizontal axis of the graph is the voice conversion time (unit: hour) of the pseudo voice database (Mainichi Newspaper Corpus), and the vertical axis is the search processing time (unit: millisecond).
- audio is input at the time of start, and audio data obtained by sampling the input audio (for example, sampling bit number 16 bits, sampling frequency 44.1 kHz) is ambiguous by using both Suffix Array and DP matching. Search.
- voice data recorded in the voice database is converted into a phoneme string (a), and a Suffix Array is created from the phoneme string (a) (S11).
- a search keyword is received, and the search keyword is converted into a phoneme string (S12).
- a first threshold value (denoted as threshold value 1 in FIG. 1) used in the search is set (S12).
- the length of the search keyword converted to the phoneme string is a predetermined length or more (for example, the number of phonemes is 9 or more), this is divided, but if it is less than that, it is not divided. Then, the division / non-division is determined (S13).
- the search keyword is divided into a predetermined number of phonemes (S14).
- the number of phonemes after the division can be determined in advance. For example, by setting the number of divided phonemes to 3, when the number of phonemes of the search keyword is 9, it is possible to divide into 3 for every three phonemes. When the number of phonemes of the search keyword is 10 to 12, it can be divided into four.
- the similarity of the divided keywords is determined by calculating the distance between phoneme discrimination features included in the speech data. That is, the second threshold value (denoted as threshold value 2 in FIG. 1) is determined from the first threshold value according to the formula shown in Equation 1, and the keyword divided by using the second threshold value is DP on the Suffix Array. Matching is performed (S15). This result is temporarily stored as the first stage candidate (b) (S15), and the final candidate (c) is determined from the positional relationship of the first stage candidate (b) (S16). The final candidate (c) is DP-matched on the Suffix Array (a) using the first threshold value, and the result is output (presented to the user) (S16). This completes the primary search.
- the first threshold is updated to a slightly higher value (for example, 0.2 is added), and the search step is repeated again (S17, S18). Since the first threshold value is changed to a slightly high value, the second threshold value calculated based on the first threshold value is also a slightly high value.
- a search is performed for similar words (words of similar phoneme strings) that are slightly separated from each other between phoneme discrimination features.
- the repetition of the search step can be processed so as to end when the first threshold value reaches a predetermined value or when the total number of search results reaches a predetermined number (S18). For example, the processing can be configured to end when the first threshold reaches 1.4 or the search result reaches 100.
- the search keyword is short (for example, the number of phonemes is 8 or less)
- the search keyword is not divided and DP matching is performed on the Suffix Array (a) using the first threshold (S19).
- the only threshold used is the first threshold.
- the result obtained here is output as it is (presented to the user) (S19). Since the search keyword is not divided, it is not necessary to refer to the positional relationship of the matched results.
- the first threshold is updated to a slightly higher value (for example, 0.2 is added) (S20), and the search step is repeated again (S21). This is for searching for similar words (words of similar phoneme strings) having a distance between phoneme discrimination features.
- the repetition of the search can be configured to terminate the process when the updated threshold reaches a predetermined value or when the number of search results reaches a predetermined number.
- the initial search with the first threshold value reduced is a search condition close to a binary search, so that a phoneme string very close to the search keyword can be searched at high speed. Then, by gradually increasing the first threshold, a kind of iterative deepening search is possible. Furthermore, by sequentially outputting (presenting to the user) before updating the first threshold, it is possible to sequentially output from a phoneme string that approximates the search keyword.
- the keyword division in the above-described embodiment, the number of phonemes when determining the length of the search keyword is exemplified as 9 or more. However, when the number of phonemes after the division is 6, the boundary between the search keywords Can be 18. This is because when the number of phonemes after the division is small, the number of first stage candidates (b) becomes enormous and the processing speed may be slow. Therefore, it is possible to further speed up the search time by adjusting the number of phonemes after division.
- the search method described above it is possible to adopt a configuration in which the processing is terminated without updating the first threshold value.
- the phoneme string obtained by the search is limited to the one that approximates the search keyword.
- the first threshold value a little larger in advance, it is possible to search many phoneme strings in one search step. it can.
- step (S13) for determining whether or not there is a division for keyword division such a step is omitted and processing is performed so as to divide into a predetermined number of phonemes or division. It is also possible to process as not.
- the second threshold value should be divided into three or more in order to calculate the second threshold value from the first threshold value according to the mathematical formula shown in Equation 1.
- a step of determining whether the number of divisions when dividing into phonemes is less than 3 or more than 3 is required.
- the embodiment of the voice search device is configured as shown in the internal configuration block diagram of FIG.
- a large-scale voice data sampled in advance (for example, sampling bit number 16 bits, sampling frequency 44.1 kHz) is stored in the voice database 25, and the voice search unit 29 performs DP matching with the Suffix Array creation unit 28. It is used together to realize a means for performing fuzzy searches.
- the speech search apparatus 31 of the present embodiment is provided with a speech database 25, a database speech recognizer 26, a speech phoneme string generation unit 27, and a Suffix Array creation unit 28 in order to create a Suffix Array from speech data. .
- input devices 21 and 24 and a phoneme string generation unit 23 are provided in order to create a phoneme string of the input search keyword.
- One of the input devices 21 and 24 is a voice input device (for example, a microphone) 21, and the other is a character input device (for example, a keyboard) 24.
- a voice input device for example, a microphone
- a character input device for example, a keyboard
- the voice input device for example, the microphone
- the voice recognizer 22 needs to be provided.
- a keyword that is input as a word string or whose speech is converted into a word string is converted into a phoneme string by the phoneme string generator 23.
- the description of “speech / character phoneme string generation unit” in FIG. 2 means that both voice input and character input are supported.
- the information of the suffix array created from the speech data and the information of the phoneme string of the input search keyword are configured to be searched in the speech search unit 29.
- the voice search unit 29 is divided by means for setting a first threshold used in the search, means for dividing the search keyword by phoneme when the search keyword is longer than a predetermined length, and a search keyword dividing means.
- the voice search unit 29 means for determining the similarity by calculating the distance between the phoneme discrimination features included in the voice data is realized by the voice search unit 29.
- the first threshold value is changed according to the above formula 1 (Equation 1), and the second threshold value is set to match the means for dividing the input search keyword with phonemes and the divided search keywords without fail.
- the means for obtaining and the means for searching for the search object determined by the first threshold value and the second threshold value are both realized by the voice search unit 29 in FIG.
- the means for repeatedly searching while sequentially increasing the first threshold value of the search is realized by the voice search unit 29, and the search result is also realized by the voice search unit 29 for the threshold value adjusting means.
- the at the same time, means for sequentially outputting the search results (presenting to the user) is realized by the display device (for example, display) 30 or the sound output device (for example, speaker) 31.
- the means for determining whether or not the keyword is divided based on the length of the search keyword is realized in the voice search unit 29, and the keyword dividing means for determining the number of phonemes after the keyword division is the voice / character phoneme string generation. This is realized by the unit 23 and the voice search unit 29.
- the voice search device displays information such as characters and images related to the search on the display device 30 (for example, a display), and the voice information is displayed as the voice search result. It is reproduced as sound from the audio output device 31 (for example, a speaker). These may be configured to include only one of them.
- a ROM reads from a main memory
- a RAM hereinafter referred to as a memory
- a CPU controls the central processing unit
- a CPU controls the central processing unit
- an HDD controls the central processing unit
- an audio input / output interface for example, an interface capable of processing a sampling bit number of 16 bits and a sampling frequency of 44.1 kHz
- the voice database is stored in the HDD
- FIG. 3 is an explanatory diagram for creating a suffix array (Suffix Array) from a speech database.
- the speech data stored in the speech database 25 is converted into a word string using the database speech recognizer 26, and the word string is further converted into a phoneme string (a) by the speech phoneme string generation unit 27.
- the Suffix Array generating unit 28 creates a Suffix Array from the phoneme string and stores it in the memory or HDD.
- the search keyword When the search keyword is received by voice (input by the voice input device 21), it is converted into a word string using the voice recognizer 22, and converted into a phoneme string by the voice / character phoneme string generator 23. Even when it is received as a text (character string) (input by the character input device 24), it is converted into a phoneme string by the voice / character phoneme string generator 23.
- the average first threshold per phoneme used in the search is set to a low value (for example, 0.0) by the voice search unit 29.
- FIG. 4 illustrates the fuzzy search by DP matching on the Suffix Array.
- search is performed by DP matching on the Suffix Array.
- the first threshold is used when the keyword is not divided, and the value (second threshold) obtained by changing the first threshold by the above formula 1 (Equation 1) is used when the keyword is divided. .
- the first stage candidate (b) of the search result is obtained.
- (b) is presented to the user by the display device 30 and the audio output device 31 as a result.
- the final candidate (c) is DP-matched using the phoneme string (a) and the first threshold value, and the search result is presented to the user by the display device 30 and the voice output device 31.
- the first threshold is updated to a slightly higher value (for example, 0.2 is added), and then the process returns to DP matching using the first threshold.
- the audio data of CSJ (Corpus of Spontaneous Japan) corpus (male speaker, 390 hours) is the target of the audio in Fig. 2 on a personal computer (Intel (registered trademark) Pentium (registered trademark) D 2.8 GHz, memory 4 GB)
- FIG. 9 shows. 6 to 9, the horizontal axis of the graph represents the first threshold value, and the vertical axis represents the search reproduction rate, the matching rate, and the processing time.
- Search keyword 6 phonemes (see FIG. 6), 12 phonemes (see FIG. 7), 18 phonemes (see FIG. 8), 24 phonemes (see FIG. 9), when the first threshold is the lowest
- the time until the first search result was output was several milliseconds each.
- FIGS. 10 and 11 show the time required to detect half of the correct keywords included in the corpus.
- the horizontal axis of the graph represents the first threshold value
- the vertical axis represents the search recall, relevance rate, and processing time.
- the search keyword 6 phonemes see FIG. 10
- 12 phonemes see FIG. 11
- Non-Patent Document 1 describes that it takes 2.17 seconds to search for a search keyword of 5.2 mora (within a range of 5 to 11 phonemes) from a speech database of 2031 hours.
- the search time from 6 phonemes to 24 phonemes is searched by setting the first threshold value to 0.0, and the time until the initial search result group is presented to the user is several millimeters.
- the speech search device 32 of FIG. 2 is constructed in C ++ language on a personal computer (Intel (registered trademark) Core2Duo E8600 3.3 GHz, memory 8 GB) for newspaper article data equivalent to 10000 hours converted to speech.
- the results of the search experiment are shown in FIGS.
- FIG. 15 it takes several milliseconds to search a search keyword from 6 phonemes to 24 phonemes by setting the first threshold value to 0.0 and present the first search result group to the user.
- the time from when the first threshold value is updated to 0.2 and a search keyword from 6 phonemes to 24 phonemes is searched again, and the newly obtained search result group is presented to the user. Is a few milliseconds.
- FIG. 15 shows that the first threshold value is updated to 0.2 and a search keyword from 6 phonemes to 24 phonemes is searched again, and the newly obtained search result group is presented to the user. Is a few milliseconds.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
その場合、検索キーワード(あるいはサブキーワード)と単語、あるいはサブワードとの完全一致が前提になるため、誤認識の多い現状の音声認識では十分な検索性能が得られない可能性がある。
を備えた音声検索部であることを特徴とする音声検索装置を要旨としている。
検索で用いる第1の閾値を設定するステップと、前記第1を用いて動的計画法により検索対象を検索するステップと、前記検索のステップにより検索された結果を出力するステップと
を含むことを特徴とする音声検索方法を要旨としている。
2の閾値を前記第1の閾値から決定するステップとを備え、前記検索対象を検索するステップは、前記第2の閾値を用いて動的計画法により検索対象を検索するステップであることを特徴とする音声検索方法を要旨とする。
と、入力された検索キーワードの音素列の情報は、音声検索部29において検索処理される構成となっている。この音声検索部29には、検索で用いる第1の閾値を設定する手段と、検索キーワードが所定長さ以上であるとき、該検索キーワードを音素により分割する手段と、検索キーワードの分割手段によって分割されたキーワードに対する検索に用いる第2の閾値を第1の閾値から決定する手段と、第1および第2の閾値の少なくともいずれか一方を用いて動的計画法により検索対象を検索する手段とが備えられている。
で用いる音素あたりの平均第1の閾値を低い値(例えば0.0)に設定する。
秒から600ミリ秒である。以上から、高速に音声検索を行うことができていることが分かる。
22 音声認識器
23 音声/文字用音素列生成部
24 文字入力装置
25 音声データベース
26 データベース用音声認識器
27 音声用音素列生成部
28 Suffix Array生成部
29 音声検索部
30 表示装置
31 音声出力装置
32 音声検索装置
Claims (10)
- 音声を入力として、前記入力された音声をサンプリングして得られる音声データを検索する音声検索装置であって、
音声データベースに記録される音声を認識するデータベース用音声認識器と、
前記データベース用音声認識器によって認識された単語列から音素列を生成する音声用音素列生成部と、
前記音声用音素生成部によって生成された音素列から接尾辞配列を生成するSuffix
Array生成部と、
検索キーワードを入力する入力装置と、
前記入力装置により入力された検索キーワードから音素列を生成する入力音素生成部と、前記接尾辞配列上で検索キーワードを動的計画法により検索する音声検索部と、
前記音声検索部により検索された結果を出力する出力装置とを備え、
前記音声検索部は、検索で用いる第1の閾値を設定する手段と、
前記第1の閾値を用いて動的計画法により検索対象を検索する手段と
を備えた音声検索部であることを特徴とする音声検索装置。 - 請求項1に記載の音声検索装置であって、
前記音声検索部は、さらに、
検索キーワードが所定長さ以上であるとき、該検索キーワードを音素により分割する手段と、
前記検索キーワードの分割手段によって分割されたキーワードに対する検索に用いる第2の閾値を前記第1の閾値から決定する手段とを備え、
前記検索対象を検索する手段は、前記第2の閾値を用いて動的計画法により検索対象を検索する手段である
ことを特徴とする音声検索装置。 - 請求項1または2に記載の音声検索装置であって、
前記音声検索部は、前記第1の閾値を逐次的に増加させながら繰り返し検索し、検索結果を逐次的に提示する閾値調整手段を備えた音声検索部であることを特徴とする音声検索装置。 - 請求項1ないし3のいずれかに記載の音声検索装置であって、
前記音声検索部は、さらに、
検索キーワードの長さによりキーワード分割の有無を判定する手段と、
前記キーワード分割後の音素数の決定を行うキーワード分割手段と
を備えた音声検索部であることを特徴とする音声検索装置。 - 請求項1ないし4のいずれかに記載の音声検索装置であって、
前記動的計画法により検索対象を検索する手段は、さらに、
動的計画法において音素弁別特徴に基づく音素間距離を用いて音素間の類似性を算出する手段を備えた検索対象を検索する手段であることを特徴とする音声検索装置。 - 音声を入力として、前記入力された音声をサンプリングして得られる音声データを検索する音声検索方法であって、
音声データを音素列に変換し、接尾辞配列を作成するステップと、
検索キーワードを受け付け、音素列に変換するステップと、
検索で用いる第1の閾値を設定するステップと、
前記第1を用いて動的計画法により検索対象を検索するステップと、
前記検索のステップにより検索された結果を出力するステップと
を含むことを特徴とする音声検索方法。 - 請求項6に記載の音声検索方法であって、さらに、
検索キーワードが所定長さ以上であるとき、該検索キーワードを音素により分割するステップと、
前記検索キーワードの分割ステップによって分割されたキーワードに対する検索に用いる第2の閾値を前記第1の閾値から決定するステップとを備え、
前記検索対象を検索するステップは、前記第2の閾値を用いて動的計画法により検索対象を検索するステップである
ことを特徴とする音声検索方法。 - 請求項6または7に記載の音声検索方法であって、さらに、
前記第1の閾値を逐次的に増加させながら繰り返し検索する閾値調整ステップを含むことを特徴とする音声検索方法。 - 請求項6ないし8のいずれかに記載の音声検索方法であって、さらに、
前記検索キーワードの長さによりキーワード分割の有無を判定するステップと、
前記キーワード分割後の音素数の決定を行うキーワード分割ステップと
を含むことを特徴とする音声検索方法。 - 請求項6ないし9のいずれかに記載の音声検索方法であって、
前記検索対象を検索するステップは、前記動的計画法において音素弁別特徴に基づく音素間距離を用いて音素間の類似性を算出するステップを有することを特徴とする音声検索方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP10746090.9A EP2402868A4 (en) | 2009-02-26 | 2010-02-10 | VOICE SEARCH DEVICE AND VOICE SEARCH METHOD |
US13/203,371 US8626508B2 (en) | 2009-02-26 | 2010-02-10 | Speech search device and speech search method |
CN201080009141.XA CN102334119B (zh) | 2009-02-26 | 2010-02-10 | 声音检索装置及声音检索方法 |
JP2011501548A JP5408631B2 (ja) | 2009-02-26 | 2010-02-10 | 音声検索装置および音声検索方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009-044842 | 2009-02-26 | ||
JP2009044842 | 2009-02-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010098209A1 true WO2010098209A1 (ja) | 2010-09-02 |
Family
ID=42665420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/051937 WO2010098209A1 (ja) | 2009-02-26 | 2010-02-10 | 音声検索装置および音声検索方法 |
Country Status (5)
Country | Link |
---|---|
US (1) | US8626508B2 (ja) |
EP (1) | EP2402868A4 (ja) |
JP (1) | JP5408631B2 (ja) |
CN (1) | CN102334119B (ja) |
WO (1) | WO2010098209A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103282902A (zh) * | 2010-11-09 | 2013-09-04 | 泰必高软件公司 | 字尾数组候选选择和索引数据结构 |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010008601A (ja) * | 2008-06-25 | 2010-01-14 | Fujitsu Ltd | 案内情報表示装置、案内情報表示方法及びプログラム |
US9311914B2 (en) * | 2012-09-03 | 2016-04-12 | Nice-Systems Ltd | Method and apparatus for enhanced phonetic indexing and search |
KR101537370B1 (ko) * | 2013-11-06 | 2015-07-16 | 주식회사 시스트란인터내셔널 | 녹취된 음성 데이터에 대한 핵심어 추출 기반 발화 내용 파악 시스템과, 이 시스템을 이용한 인덱싱 방법 및 발화 내용 파악 방법 |
WO2015143708A1 (zh) * | 2014-03-28 | 2015-10-01 | 华为技术有限公司 | 后缀数组的构造方法及装置 |
JP6400936B2 (ja) * | 2014-04-21 | 2018-10-03 | シノイースト・コンセプト・リミテッド | 音声検索方法、音声検索装置、並びに、音声検索装置用のプログラム |
JP6003971B2 (ja) * | 2014-12-22 | 2016-10-05 | カシオ計算機株式会社 | 音声検索装置、音声検索方法及びプログラム |
JP6585112B2 (ja) * | 2017-03-17 | 2019-10-02 | 株式会社東芝 | 音声キーワード検出装置および音声キーワード検出方法 |
KR101945234B1 (ko) | 2017-07-14 | 2019-02-08 | (주)인터버드 | 마지막 알파벳 제거 알고리즘을 이용한 반도체 부품 검색 방법 |
CN110970022B (zh) * | 2019-10-14 | 2022-06-10 | 珠海格力电器股份有限公司 | 一种终端控制方法、装置、设备以及可读介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0535292A (ja) * | 1991-07-26 | 1993-02-12 | Fujitsu Ltd | 動的計画法照合装置 |
JP2005257954A (ja) * | 2004-03-10 | 2005-09-22 | Nec Corp | 音声検索装置、音声検索方法および音声検索プログラム |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5749066A (en) * | 1995-04-24 | 1998-05-05 | Ericsson Messaging Systems Inc. | Method and apparatus for developing a neural network for phoneme recognition |
DE69613556T2 (de) * | 1996-04-01 | 2001-10-04 | Hewlett Packard Co | Schlüsselworterkennung |
CN1604185B (zh) * | 2003-09-29 | 2010-05-26 | 摩托罗拉公司 | 利用可变长子字的语音合成系统和方法 |
JP3945778B2 (ja) * | 2004-03-12 | 2007-07-18 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 設定装置、プログラム、記録媒体、及び設定方法 |
KR100664960B1 (ko) * | 2005-10-06 | 2007-01-04 | 삼성전자주식회사 | 음성 인식 장치 및 방법 |
US7831425B2 (en) * | 2005-12-15 | 2010-11-09 | Microsoft Corporation | Time-anchored posterior indexing of speech |
KR100735820B1 (ko) * | 2006-03-02 | 2007-07-06 | 삼성전자주식회사 | 휴대 단말기에서 음성 인식에 의한 멀티미디어 데이터 검색방법 및 그 장치 |
JP4786384B2 (ja) * | 2006-03-27 | 2011-10-05 | 株式会社東芝 | 音声処理装置、音声処理方法および音声処理プログラム |
US8831943B2 (en) * | 2006-05-31 | 2014-09-09 | Nec Corporation | Language model learning system, language model learning method, and language model learning program |
JP4791984B2 (ja) * | 2007-02-27 | 2011-10-12 | 株式会社東芝 | 入力された音声を処理する装置、方法およびプログラム |
EP2135231A4 (en) * | 2007-03-01 | 2014-10-15 | Adapx Inc | SYSTEM AND METHOD FOR DYNAMIC LEARNING |
JP5072415B2 (ja) * | 2007-04-10 | 2012-11-14 | 三菱電機株式会社 | 音声検索装置 |
WO2008142836A1 (ja) * | 2007-05-14 | 2008-11-27 | Panasonic Corporation | 声質変換装置および声質変換方法 |
GB2453366B (en) * | 2007-10-04 | 2011-04-06 | Toshiba Res Europ Ltd | Automatic speech recognition method and apparatus |
US9959870B2 (en) * | 2008-12-11 | 2018-05-01 | Apple Inc. | Speech recognition involving a mobile device |
-
2010
- 2010-02-10 WO PCT/JP2010/051937 patent/WO2010098209A1/ja active Application Filing
- 2010-02-10 EP EP10746090.9A patent/EP2402868A4/en not_active Withdrawn
- 2010-02-10 US US13/203,371 patent/US8626508B2/en not_active Expired - Fee Related
- 2010-02-10 JP JP2011501548A patent/JP5408631B2/ja not_active Expired - Fee Related
- 2010-02-10 CN CN201080009141.XA patent/CN102334119B/zh not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0535292A (ja) * | 1991-07-26 | 1993-02-12 | Fujitsu Ltd | 動的計画法照合装置 |
JP2005257954A (ja) * | 2004-03-10 | 2005-09-22 | Nec Corp | 音声検索装置、音声検索方法および音声検索プログラム |
Non-Patent Citations (5)
Title |
---|
GO KURIKI ET AL.: "Renzoku Tango Onsei Ninshiki Kekka no Yomi Keiretsu o Riyo shita Jisho Mitorokugo no Onsei Bunsho Kensaku", IEICE TECHNICAL REPORT, vol. 108, no. 142, 10 July 2008 (2008-07-10), pages 61 - 66 * |
K. THAMBIRATNAM, S. SRIDHARAN: "Dynamic Match Phone-Lattice Searches For Very Fast And Accurate Unrestricted Vocabulary Keyword Spotting", ICASSP 2005, vol. 1, 2005, pages 465 - 468, XP010792075, DOI: doi:10.1109/ICASSP.2005.1415151 |
N. KANDA ET AL.: "Open-Vocabulary Keyword Detection from Super-Large Scale Speech Database", IEEE MMSP, 2008, pages 939 - 944, XP031356761, DOI: doi:10.1109/MMSP.2008.4665209 |
See also references of EP2402868A4 |
TATSUO YAMASHITA ET AL.: "Suffix Array o Mochiita Full Text Ruiji Yorei Kensaku", IPSJ SIG TECHNICAL REPORTS, vol. 97, no. 86, 12 September 1997 (1997-09-12), pages 23 - 30 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103282902A (zh) * | 2010-11-09 | 2013-09-04 | 泰必高软件公司 | 字尾数组候选选择和索引数据结构 |
Also Published As
Publication number | Publication date |
---|---|
EP2402868A1 (en) | 2012-01-04 |
CN102334119A (zh) | 2012-01-25 |
JPWO2010098209A1 (ja) | 2012-08-30 |
US20120036159A1 (en) | 2012-02-09 |
CN102334119B (zh) | 2014-05-21 |
EP2402868A4 (en) | 2013-07-03 |
JP5408631B2 (ja) | 2014-02-05 |
US8626508B2 (en) | 2014-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5408631B2 (ja) | 音声検索装置および音声検索方法 | |
US8332205B2 (en) | Mining transliterations for out-of-vocabulary query terms | |
JP5059115B2 (ja) | 音声キーワードの特定方法、装置及び音声識別システム | |
Siu et al. | Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery | |
JP5318230B2 (ja) | 認識辞書作成装置及び音声認識装置 | |
JP5257071B2 (ja) | 類似度計算装置及び情報検索装置 | |
JP4930379B2 (ja) | 類似文検索方法、類似文検索システム及び類似文検索用プログラム | |
KR20090130028A (ko) | 분산 음성 검색을 위한 방법 및 장치 | |
WO1996023298A2 (en) | System amd method for generating and using context dependent sub-syllable models to recognize a tonal language | |
US11978434B2 (en) | Developing an automatic speech recognition system using normalization | |
Bhati et al. | Self-expressing autoencoders for unsupervised spoken term discovery | |
KR102167157B1 (ko) | 발음 변이를 적용시킨 음성 인식 방법 | |
Xu et al. | Language independent query-by-example spoken term detection using n-best phone sequences and partial matching | |
JP7423056B2 (ja) | 推論器および推論器の学習方法 | |
JP5436307B2 (ja) | 類似文書検索装置 | |
KR100542757B1 (ko) | 음운변이 규칙을 이용한 외래어 음차표기 자동 확장 방법및 그 장치 | |
JP4270732B2 (ja) | 音声認識装置、音声認識方法、及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体 | |
Toma et al. | MaRePhoR—An open access machine-readable phonetic dictionary for Romanian | |
JP2001312293A (ja) | 音声認識方法およびその装置、並びにコンピュータ読み取り可能な記憶媒体 | |
JP5669707B2 (ja) | 類似文書検索装置 | |
JP2938865B1 (ja) | 音声認識装置 | |
JP2000267693A (ja) | 音声処理装置及び索引作成装置 | |
Trung et al. | An image based approach for speech perception | |
US20230107475A1 (en) | Exploring Heterogeneous Characteristics of Layers In ASR Models For More Efficient Training | |
Viana-Cámara et al. | Evolutionary optimization of contexts for phonetic correction in speech recognition systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201080009141.X Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10746090 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2011501548 Country of ref document: JP |
|
REEP | Request for entry into the european phase |
Ref document number: 2010746090 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010746090 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13203371 Country of ref document: US |