WO2014033855A1 - Dispositif de recherche de parole, support de stockage lisible par ordinateur et procédé de recherche audio - Google Patents

Dispositif de recherche de parole, support de stockage lisible par ordinateur et procédé de recherche audio Download PDF

Info

Publication number
WO2014033855A1
WO2014033855A1 PCT/JP2012/071850 JP2012071850W WO2014033855A1 WO 2014033855 A1 WO2014033855 A1 WO 2014033855A1 JP 2012071850 W JP2012071850 W JP 2012071850W WO 2014033855 A1 WO2014033855 A1 WO 2014033855A1
Authority
WO
WIPO (PCT)
Prior art keywords
subword
search
string
keyword
score
Prior art date
Application number
PCT/JP2012/071850
Other languages
English (en)
Japanese (ja)
Inventor
龍 武田
直之 神田
康成 大淵
貴志 住吉
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2012/071850 priority Critical patent/WO2014033855A1/fr
Priority to JP2014532631A priority patent/JP5897718B2/ja
Publication of WO2014033855A1 publication Critical patent/WO2014033855A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • the present invention relates to a voice search device for searching a portion corresponding to a keyword input by a user from voice data to be searched.
  • a large amount of audio data is accumulated in the audio database. For example, a call center records thousands of hours of audio data per day.
  • Voice data is recorded in the voice database for training of operators and confirmation of received contents, and the voice database is used as necessary.
  • time information at which the voice is recorded is given to the voice data, and desired voice data is searched based on the time information.
  • the search based on the time information it is necessary to grasp in advance the time when the desired voice is spoken. For this reason, the search based on the time information is unsuitable for use in searching for a voice in which a specific utterance is made.
  • the conventional search method has to listen to the voice data from the beginning to the end.
  • a technology has been developed that searches the voice database for the location where a specific keyword is spoken.
  • a subword search method that is one of representative methods will be described.
  • voice data is converted into a subword string by a subword recognition process.
  • a subword is a name indicating a general unit system (for example, phonemes and syllables) smaller than a word.
  • the subword search method the subword string converted from the input keyword is compared with the subword string of the speech data, and the distance between the subwords of the two subword strings is calculated as a score. By sorting and outputting the search results in descending order of the calculated scores, it becomes possible to detect the time when the keyword is spoken on the voice data.
  • JP 2010-267012 A Patent Literature 1
  • JP 2011-175046 A Patent Literature 2
  • Japanese Patent Application Laid-Open No. 2005-228561 discloses that “subword recognition converts speech data into a first subword string in units of phonemes, and converts a search keyword input by the user into a second subword string in units of phonemes.
  • the first subword string at least one section having the smallest edit distance with the second subword string is determined as a search result of the search keyword, and the time when the search keyword is uttered is detected and selected by the user.
  • the correct answer or incorrect answer data is added to the search result, and the subword replacement probability is calculated based on the correct answer or incorrect answer data "(see the summary).
  • the phoneme / syllable recognition step assigns an index to at least one or more detection candidates by using a virtual distance between phonemes and between syllables to perform first detection. It has a function of presenting a detection candidate by a distance from the second detection candidate or the third detection candidate based on the acoustic similarity with the candidate.
  • the phoneme / syllable recognition step assigns an index to at least one or more detection candidates by using a virtual distance between phonemes and between syllables to perform first detection. It has a function of presenting a detection candidate by a distance from the second detection candidate or the third detection candidate based on the acoustic similarity with the candidate.
  • the distance between a subword of a search keyword and a subword of a search candidate is calculated using a general recognition error tendency (misrecognition tendency) and a general acoustic similarity. Even if the misrecognition tendency is different for each keyword, since the distance is calculated using a general misrecognition tendency, erroneous detection of search candidates due to misrecognition of voice data increases. For this reason, there is a drawback in that the search accuracy is lowered, for example, search candidates that do not match the search keyword are sorted in the higher rank.
  • An object of the present invention is to provide a voice search device that improves keyword search accuracy by learning in advance the tendency of erroneous recognition of subwords for each search keyword.
  • a typical example of the invention disclosed in the present application is as follows. That is, it is a voice search device that searches a portion corresponding to a keyword input by a user from first voice data to be searched, and uses the second voice data to show an acoustic model and a language feature indicating an acoustic feature.
  • An acoustic / language model generation unit that generates a language model; and a first subword sequence conversion unit that converts the second audio data into a first subword sequence in units of subwords using the acoustic model and the language model;
  • a second subword string conversion unit that converts an assumed keyword that may be designated as the keyword into a second subword string in units of subwords, and compares the first subword string and the second subword string;
  • a misrecognition tendency calculating unit that calculates a misrecognition tendency of the first subword string with respect to the second subword string; and the first audio data
  • a third subword string conversion unit for converting the third subword string in units of subwords using the acoustic model and the language model; and a fourth subword string for converting the keywords into fourth subword strings in units of subwords.
  • the candidate search based on the misrecognition tendency calculated by the subword string conversion unit, the candidate search unit that searches the portion corresponding to the keyword from the first speech data as a search candidate, and the misrecognition tendency calculation unit
  • a score calculation unit that calculates a score based on a subword score for the fourth subword sequence of the third subword sequence of search candidates searched by the unit, a score calculated by the score calculation unit, and a search candidate corresponding to the score
  • a search result output unit for outputting a search result including.
  • the present invention it is possible to improve the keyword search accuracy by learning in advance the misrecognition tendency of subwords for each search keyword.
  • FIG. 1 is a block diagram of the speech data retrieval apparatus 1 according to the first embodiment of the present invention.
  • the voice data search device 1 includes a pre-processing unit that calculates in advance a misrecognition tendency for each assumed keyword, and a search processing unit that searches voice data based on the input keyword.
  • the pre-processing unit includes learning-labeled speech data 101, an acoustic / language model learning unit 102, an acoustic model 103, a language model 104, a speech recognition unit 105, an assumed keyword generation unit 106, a query subword string error table.
  • a learning unit 107, a query subword string error table 108, search target speech data 109, an indexing unit 110, and an index table 111 are included.
  • the search processing unit includes a keyword input unit 112, a subword string conversion unit 113, a candidate search unit 114, a subword string distance evaluation unit 115, a search result integration unit 116, and a search result display unit 117.
  • the acoustic / language model learning unit 102, the speech recognition unit 105, the assumed keyword generation unit 106, the query subword sequence error table learning unit 107, the indexing unit 110, the subword sequence conversion unit 113, the candidate search unit 114, and the subword sequence distance evaluation unit 115, the search result integration unit 116, and the search result display unit 117 are realized by executing a program stored in a memory (not shown) by a CPU (not shown) of the voice data search device 1.
  • the learning-labeled speech data 101 stores text indicating speech content, speech waveforms, and the like.
  • the text indicating the utterance content may be, for example, a text transcribed from an audio track extracted from a television, a reading speech corpus, a normal conversation, and the like.
  • the learning-labeled voice data 101 may also store information indicating the identification information (ID) of the speaker and the presence or absence of noise.
  • ID identification information
  • the acoustic / language model learning unit 102 uses the learning-labeled speech data 101 to set parameters of a statistical model expressing speech features and parameters of a statistical model expressing language features.
  • the acoustic model 103 stores parameters of a statistical model that expresses speech features.
  • the language model 104 stores parameters of a statistical model expressing language features.
  • the speech recognition unit 105 refers to the acoustic model 103 and the language model 104, recognizes the speech data 101 with learning label, and outputs a subword sequence (for example, a speech-recognized phoneme sequence).
  • the assumed keyword generation unit 106 outputs a subword string of assumed keywords that can be search keywords.
  • the assumed keyword is set in advance.
  • the query subword string error table learning unit 107 determines the subword string (hereinafter referred to as learning subword string) of the learning labeled speech data 101 output from the speech recognition unit 105 and the utterance content included in the learning labeled speech data 101.
  • learning subword string the subword string of the learning labeled speech data 101 output from the speech recognition unit 105 and the utterance content included in the learning labeled speech data 101.
  • Numerical value of the misrecognition tendency of the learning subword string for the assumed keyword subword string for each assumed keyword using the text shown and the subword string of the assumed keyword output from the assumed keyword generation unit 106 (hereinafter referred to as the assumed keyword subword string). And recorded in the query subword string error table 108.
  • Search target voice data 109 is voice data to be searched.
  • the search target audio data 109 is, for example, audio data extracted from a television, audio data recorded at a conference, audio data recording a telephone line call, and the like. Note that the search target audio data 109 may be a plurality of files for each type.
  • the search target voice data 109 may be given information such as speaker identification information.
  • the indexing unit 110 converts the search target speech data 109 into a subword string using the acoustic model 103 and the language model 104.
  • the indexing unit 110 includes a subword sequence of the search target speech data 109 (hereinafter referred to as a search target subword sequence), an acoustic likelihood of the search target speech data 109, an N-gram index based on the subword of the search target speech data 109, and other An index table 111 including information is generated, and the generated index table 111 is stored in a storage area (not shown).
  • the keyword input unit 112 receives a keyword input by the user.
  • the subword string conversion unit 113 converts the keyword received by the keyword input unit 112 into a subword string (hereinafter referred to as keyword subword string), and outputs the keyword subword string to the candidate search unit 114.
  • the candidate search unit 114 refers to the keyword subword string output from the subword string conversion unit 113 and the index table 111, specifies a part where the keyword is uttered in the search target speech data 109 as a search candidate, and selects the specified search candidate. The result is output to the subword string distance evaluation unit 115.
  • the subword string distance evaluation unit 115 the distance (score) between the keyword subword string output from the subword string conversion unit 113 and the subword string (search candidate subword string) corresponding to each search candidate output from the candidate search unit 114 is obtained as a query. Calculation is performed with reference to the subword string error table 108 and the language model 104. Then, the subword string distance evaluation unit 115 outputs the search candidate and the calculated score to the search result integration unit 116.
  • the search result integration unit 116 sorts the search candidates output by the subword string distance evaluation unit 115 based on the search candidate scores, and outputs the search results to the search result display unit 117 as search results.
  • the search result display unit 117 includes a search candidate file name, time, and score display area output from the search result integration unit 116, generates a search result display screen in which the search candidates are sorted in score order, and the generated search result Send the display screen to the output device.
  • each component of the speech data retrieval apparatus 1 has been described as being mounted on the same computer. However, each component may be mounted on another computer.
  • the voice data search device 1 is configured by a system including a terminal and a server, the terminal includes a keyword input unit 112 and a search result display unit 117, and the server includes other components. Also good.
  • the pre-processing unit and the search processing unit may be implemented on different computers.
  • the search target speech data 109 is stored in an external storage
  • the index table 111, the query subword string error table 108, the acoustic model 103, and the language model 104 are generated in advance by another computer, and the generated index table 111, the query subword string error table 108, the acoustic model 103, and the language model 104 are copied to the computer that executes the search process.
  • the speech data retrieval apparatus 1 uses a sound / language model learning unit 102 to recognize speech data in a statistical model (acoustic model) expressing speech features and a statistical model (language model) expressing language features.
  • the parameter setting process is executed.
  • FIG. 2 is a flowchart of the parameter setting process of the acoustic model and the language model according to the first embodiment of the present invention.
  • the problem of recognizing speech data is reduced to, for example, a posterior probability maximization search problem (Maximum A Postiori Problem).
  • a posterior probability maximization search problem a solution that is a recognition result of speech data is obtained based on an acoustic model and a language model learned from a large amount of speech data for learning.
  • the acoustic / language model learning unit 102 sets parameters of the acoustic model and the language model using the learning-labeled speech data 101 (201), and ends the process.
  • a hidden Markov model HMM
  • N-Gram can be adopted for setting the language model parameters.
  • the voice data recognition technique and the parameter setting technique of the acoustic model and the language model are widely known techniques, and thus description thereof is omitted.
  • the voice data search device 1 executes index table generation processing for generating the index table 111 in order to enable the search target voice data 109 to be searched by the indexing unit 110.
  • FIG. 3 is a flowchart of the index table generation process of the first embodiment of the present invention.
  • the indexing unit 110 divides audio data of a plurality of audio files constituting the search target audio data 109 into appropriate lengths (301). For example, the indexing unit 110 divides the search target audio data 109 at this position when the time during which the audio power of the search target audio data 109 is equal to or less than a predetermined threshold ⁇ p continues for a predetermined threshold ⁇ t or more. Note that information indicating the original file and information indicating the start time and end time of the divided audio section are given to each divided audio data (audio section).
  • the dividing method of the search target voice data 109 includes, for example, a method using the number of zero crossings, a method using GMM (Gaussian Mixture Model), a method using a voice recognition technique, and the like.
  • GMM Global System for Mobile Communications
  • Various methods are widely known. In this embodiment, any of these methods may be used.
  • the indexing unit 110 performs subword recognition processing on all speech segments divided in the process of step 301, and converts all speech segments divided in the process of step 301 into subword strings (302). And 303). Specifically, the indexing unit 110 converts the speech segment divided by the process of step 301 into subwords in units of syllables or phonemes, and generates a subword sequence. The indexing unit 110 registers the converted subword string (subword recognition result) and the time corresponding to the subword string in the index table 111.
  • the indexing unit 110 registers N-Gram index information in the index table 111 for the purpose of speeding up the search (304), and ends the processing.
  • the N-Gram index information is a well-known method in the normal text search technique, and thus the description thereof is omitted.
  • the process of step 304 does not necessarily have to be executed.
  • the keyword search of the search target audio data 109 becomes possible.
  • description will be made on the assumption that only the so-called 1-best subword recognition result is registered in the index table 111, but a plurality of subword recognition results may be registered in the index table 111 in N-best format or network format. Good.
  • index table generation processing only needs to be executed once at the time of the first operation, for example.
  • the assumed keyword generation unit 106 executes an assumed keyword subword string conversion process for converting an assumed keyword into a subword string.
  • the assumed keyword generation unit 106 can employ the process shown in FIG. 4 and the process shown in FIG.
  • FIG. 4 is a flowchart of an assumed keyword subword conversion process according to the first embodiment of the present invention.
  • the assumed keyword generation unit 106 converts all preset assumed keywords into subword strings (401, 402), and ends the process.
  • FIG. 6 is an explanatory diagram of conversion of an assumed keyword into a subword string according to the embodiment of this invention.
  • the conversion process to a subword string is executed based on a preset conversion rule and a general dictionary. For example, if the conversion rule is set so that “re” is converted to “ri” and “search” is converted to “s-3 ⁇ -tS”, “research” is “r”. -Is-3 ⁇ -tS ".
  • a predetermined word is converted into a subword string by adding a conversion rule and a dictionary manually.
  • FIG. 5 is a flowchart of an assumed keyword subword conversion process according to the first embodiment of the present invention.
  • the assumed keyword generation unit 106 converts all the assumed keywords set in advance into subword strings, and refers to the related word dictionary stored in the speech data search device 1 to associate related words associated with the assumed keywords. Is added to the assumed keyword, and the added assumed keyword is also converted into a subword string (501, 502). Thus, keywords that can be searched can be expanded.
  • the assumed keyword generation unit 106 may statistically calculate the degree of association of each word from a large amount of text data, and may set the top N cases having a high degree of association of the assumed keyword set in advance as related words. Note that methods for statistically calculating the degree of association of each word from a large amount of text data have been widely studied in the field of natural language processing, and any method can be adopted.
  • FIG. 7 is a flowchart of the process of generating the query subword string error table 108 by the query subword string error table learning unit 107 according to the first embodiment of this invention.
  • the query subword string error table learning unit 107 includes an assumed keyword subword string obtained by converting the assumed keyword by the assumed keyword generation unit 106 and a learning subword string obtained by converting the learning-labeled speech data 101 by the speech recognition unit 105. If input, the process is executed.
  • the query subword string error table learning unit 107 searches the occurrence position of each assumed keyword subword string from the learning subword string.
  • a subword string corresponding to an assumed keyword subword string starting from the searched appearance position of the learning subword string is referred to as a corresponding subword string.
  • the query subword string error table learning unit 107 sets the alignment so that the edit distance between each assumed keyword subword string and the corresponding subword string is minimized (701, 702).
  • the edit distance indicates how many times characters need to be inserted, deleted, and replaced in order to match one word with the other.
  • the query subword string error table learning unit 107 may use dynamic programming to calculate the edit distance between the assumed keyword subword string and the corresponding subword string.
  • the query subword string error table learning unit 107 can efficiently calculate the edit distance between the assumed keyword subword string and the corresponding subword string by using dynamic programming. Since dynamic programming is a well-known technique, description thereof is omitted.
  • the query subword string error table learning unit 107 counts the number of subword errors for each assumed keyword based on the alignment set in step 702 (703). The processing in steps 701 to 703 will be described in detail with reference to FIGS.
  • the query subword string error table learning unit 107 calculates a subword error probability for each assumed keyword based on the number of subword errors counted in the processing of step 703, and registers it in the query subword string error table 108 (704). The process is terminated.
  • the processing in step 704 will be described in detail with reference to FIGS.
  • FIG. 8 is an explanatory diagram of the assumed keyword subword string and the corresponding subword string of the first embodiment of the present invention.
  • FIG. 8 illustrates an example in which the learning subword sequence (subword recognition result) output from the speech recognition unit 105 is a 1-best recognition result. However, the case where the learning subword sequence is an N-best recognition result is also described. Embodiments can be applied.
  • the correct phoneme string “ris3 ⁇ tS” in FIG. 8 is an assumed keyword subword string, and the speech recognition results 1 to 3 are corresponding subword strings.
  • the speech recognition result 1 “i” in the assumed keyword subword sequence is replaced with “I”
  • the speech recognition result 2 “tS” in the assumed keyword subword sequence is deleted, and in the speech recognition result 3, “t” and “r”. Is inserted, and “3 ⁇ ” in the assumed keyword subword string is replaced with “E”.
  • FIG. 9 is an explanatory diagram of the alignment of the assumed keyword subword string and the corresponding subword string in the first embodiment of the present invention.
  • FIG. 9 describes the alignment between the assumed keyword subword string “ris3 ⁇ tS” and the corresponding subword string that is the speech recognition result 1 shown in FIG.
  • the alignment between the assumed keyword subword string and the corresponding subword string is set so that the edit distance is minimized.
  • the alignment is set to “r” of the assumed keyword subword string and “r” of the speech recognition result 1
  • alignment is set to “i” and “I”
  • Alignment is set to “ ⁇ ” and “t”
  • alignment is set to “3 ⁇ ” and “3 ⁇ ”
  • alignment is set to “tS” and “tS”.
  • the query subword string error table learning unit 107 compares the set alignment subwords and counts the number of alignments where the subwords do not match as the number of subword errors.
  • the query subword string error table learning unit 107 calculates a subword error probability in the process of step 704. This subword error probability calculation process will be described with reference to FIGS.
  • step 704 the outline of the processing in step 704 will be described.
  • the query subword string error table learning unit 107 calculates the subword error probability using the maximum likelihood estimation method will be described as an example.
  • a method of calculating the subword error probability that one subword “a” in a certain assumed keyword subword string is erroneously recognized as the subword “b” in the corresponding subword string will be described.
  • the query subword string error table learning unit 107 multiplies the number of times that the subword “a” appears in a certain assumed keyword subword string by the number of the assumed keyword subword string, thereby generating the number of occurrences “Na” of the subword “a”. Is calculated.
  • the query subword string error table learning unit 107 calculates the number “Nb” of the number of times that the subword “a” of the assumed keyword subword string is erroneously recognized as the subword “b” in the corresponding subword string. Then, the query subword string error table learning unit 107 calculates Nb / Na and calculates a subword error probability.
  • FIG. 10 is a specific explanatory diagram of the subword error probability calculation process of the first embodiment of the present invention.
  • the assumed keyword subword string 1001 is “ri ⁇ s ⁇ 3 ⁇ tS” and “f ⁇ O ⁇ r ⁇ k ⁇ ⁇ s ⁇ t”. Then, the erroneous recognition pattern of the subword “s” in the corresponding subword string of each assumed keyword subword string 1001 is registered in 1002 shown in FIG.
  • the subword error probability is the total number of times that the subword “s” is erroneously recognized as a subword regardless of the assumed keyword subword string. It is calculated by dividing by the number of occurrences of the subword “s”. Since the total number of appearances of “s” in FIG. 10 is “19”, the denominator for calculating the total error probability 1003 is 19.
  • the number of times the subword “s” is erroneously recognized as the subword “I” is four times in the assumed keyword subword string “ri-s-3 ⁇ -tS”, and the assumed keyword subword string “fOr— Since k ⁇ ⁇ s ⁇ t ”is one time, the total subword error probability that the subword“ s ”is erroneously recognized as the subword“ I ”is“ 5/19 ”.
  • the subword error probability is calculated for each assumed keyword.
  • the number of occurrences of the subword “s” is 9, and the number of times the subword “s” is erroneously recognized as the subword “I”. Is 4 times, so that the probability of subword “s” being mistaken for subword “I” is 4/9, as indicated by 1004.
  • the number of appearances of the subword “s” is 10, and the subword “s” is erroneously recognized as the subword “I”. Since the number of times is 1, as indicated by 1004, the probability that the subword “s” is mistaken for the subword “I” is 1/10.
  • the point is that the subword error probability is calculated for each assumed keyword. Since the tendency of misrecognition of subwords varies from word to phrase, the difference in misrecognition tendency can be accurately calculated by calculating the subword error probability for each assumed keyword.
  • the query subword string error table learning unit 107 can calculate the subword error probability for each assumed keyword in the same procedure. In this case, the data amount handled by the query subword string error table learning unit 107 is N times.
  • the query subword string error table learning unit 107 does not need to calculate the total error probability 1003 in FIG. 10, but calculates the total error probability 1003 and registers it in the query subword string error table 108. May be.
  • FIG. 11 is an explanatory diagram of the query subword string error table 108 according to the first embodiment of this invention.
  • query subword string error table 108 for each assumed keyword subword string, a subword error probability that each subword constituting the assumed keyword subword string is erroneously recognized by another subword is registered.
  • the query subword string error table 108 includes an assumed keyword subword string 1101, an assumed keyword subword 1102, and a subword 1103.
  • an assumed keyword subword string is registered.
  • the assumed keyword subword 1102 subwords constituting the assumed keyword subword string are registered. All subwords are registered in the subword 1103.
  • a subword error probability that is erroneously recognized as a certain subword 1103 by a certain assumed keyword subword 1102 is registered.
  • the subword error probability that the subword “r” of the assumed keyword subword string “ris3 ⁇ tS” is erroneously recognized as the subword “m” is 0.02.
  • FIG. 12 is an explanatory diagram of a process of calculating the number of subword errors between the assumed keyword subword string and the corresponding subword string using the joint 2-Gram according to the first embodiment of the present invention.
  • the query subword string error table learning unit 107 calculates the number of subword errors using the joint N-Gram, thereby determining whether or not there is a match between the assumed keyword subword string and the corresponding subword string and the corresponding subword.
  • the number of subword errors can be calculated in consideration of the relationship with the N ⁇ 1 previous subword.
  • the query subword string error table learning unit 107 determines whether or not the subwords in the assumed keyword subword string match, the subwords up to N ⁇ 1 before the subword in the assumed keyword subword string, and When the subword up to N-1 subwords before the corresponding subword string and the aligned subword are given, it is counted to which subword the target subword and the aligned subword of the corresponding subword string are erroneously recognized. .
  • R, i, r ⁇ I indicates the target subword “i” of the assumed keyword subword string, the subword “r” immediately before the target subword of the assumed keyword subword string, and the subword “ When “r” is given, it indicates that the target subword “i” is erroneously recognized as the subword “I”.
  • the query subword string error table learning unit 107 stores subword transitions such as “r, i, r ⁇ I” shown in FIG. 12, and counts the number of subword errors.
  • the query subword string error table learning unit 107 calculates a subword error probability based on the subword transition. Specifically, the query subword sequence error table learning unit 107 determines the subword error probability of the subword transition “r, i, r ⁇ I” from the corresponding subword sequence of the assumed keyword subword sequence “ris3 ⁇ ⁇ tS”. The number Na of occurrences of a pair of “r, i, r” from the transition, and the number of times Nb in which the subword aligned with the target subword “i” of the assumed keyword subword string becomes “I” And subword error probability can be calculated by calculating Nb / Na.
  • the query subword string error table learning unit 107 may cluster the assumed keywords and share subword transitions between the same classes to calculate a subword error probability.
  • the query subword string error table learning unit 107 may cluster the assumed keywords based on the edit distance between the assumed keyword subword strings, or may cluster the assumed keywords using the k-means method or the like.
  • the query subword string error table learning unit 107 calculates a subword error probability based on the subword transition in the same class. Since clustering using the k-means method is well known, the description thereof is omitted.
  • Subword error probabilities calculated using joint (N-1) -Gram as subword error probabilities calculated using joint (N-1) -Gram, and subword errors where subword transitions do not appear
  • FIG. 13 is an explanatory diagram of the query subword string error table 108 when all the assumed keywords of the embodiment of the present invention are assigned to one class.
  • the query subword string error table 108 includes subword transitions 1301 and recognition results 1302.
  • the subword transition 1301 a set of a target subword of the assumed keyword subword string, a first subword of the target subword of the assumed keyword subword string, and a target subword of the corresponding subword string is registered.
  • the recognition result 1302 the recognition result in the corresponding subword string of the target subword of the assumed keyword subword string is registered.
  • the subword error probability recognized in the recognition result 1302 in which the target subword of the assumed keyword subword string in the certain subword transition 1301 is registered.
  • this subword error probability is a subword error probability according to joint 1-Gram.
  • the total error probability 1003 shown in FIG. 10 may be registered in the query subword string error table 108 shown in FIG.
  • the speech data search apparatus 1 can accept keyword input from the user.
  • the keyword input unit 112 shown in FIG. 1 accepts a keyword input by the user.
  • the keyword input unit 112 may directly accept a keyword via an input device (for example, a keyboard and a touch pad), or may accept a keyword input by another computer via a network.
  • the keyword input unit 112 may accept a keyword input by voice and convert it into a keyword character string using voice recognition.
  • the keyword input unit 112 outputs the accepted keyword to the sub-word string conversion unit 113.
  • the subword string conversion unit 113 converts the keyword input from the keyword input unit 112 into a subword string (keyword subword string) and outputs it to the candidate search unit 114. Note that the method of converting a keyword into a subword string by the subword string converter 113 is the same as the method of converting the assumed keyword into a subword string by the assumed keyword generator 106, and thus the description thereof is omitted.
  • FIG. 14 is a flowchart of processing of the candidate search unit 114 according to the first embodiment of this invention.
  • the candidate search unit 114 When the keyword subword string is input from the subword string converter 113, the candidate search unit 114 refers to the index table 111 and searches the search target speech data 109 for a keyword utterance location candidate (search candidate) (1401). The process is terminated. For example, the candidate search unit 114 allows the overlap, divides the keyword subword string into N-grams, and sets the N-gram index in the index table 111 corresponding to the divided N-gram as a search candidate.
  • FIG. 15 is an explanatory diagram of keyword subword strings divided every 3-gram according to the first embodiment of this invention.
  • the keyword subword string “r i s 3 ⁇ tS” is divided every 3-gram as “r i s”, “i s 3 ⁇ ”, and “s 3 ⁇ tS”.
  • the N-gram index of the index table 111 is a technique that is widely and generally used in the field of document search, and thus description thereof is omitted.
  • FIG. 16 is a flowchart of processing of the sub-word string distance evaluation unit 115 according to the first embodiment of this invention.
  • the subword string distance evaluation unit 115 refers to the query subword string error table 108 and calculates distances between the keyword subword string and subword strings (search candidate subword strings) corresponding to all search candidates searched by the candidate search unit 114. (1601, l602), the process is terminated.
  • the distance calculation method using the query subword string error table 108 shown in FIG. 11 will be specifically described.
  • the subword string distance evaluation unit 115 sets the alignment between the keyword subword string and the search candidate subword string so that the edit distance is minimized.
  • the subword string distance evaluation unit 115 acquires a record corresponding to the keyword subword string from the records registered in the query subword string error table 108.
  • the sub-word string distance evaluation unit 115 selects one sub-word to be processed from the keyword sub-word string, and the sub-word (second sub-word) of the search candidate sub-word string whose alignment is set to the selected sub-word (first sub-word). Then, it is determined whether or not the selected subword matches. When the first subword and the second subword match, the subword string distance evaluation unit 115 adds “1” to the score.
  • the subword string distance evaluation unit 115 converts the row of the first subword from the acquired record of the query subword string error table 108 and the column corresponding to the second subword. Get the subword error probability from the corresponding item. Then, the subword string distance evaluation unit 115 adds the acquired subword error probability to the score.
  • the subword string distance evaluation unit 115 ends the process when processing is performed on all the subwords in the keyword subword string, and the process is not performed on all subwords in the keyword subword string. Then, the unprocessed subword is selected as the processing target subword, and the processing is executed on the processing target subword.
  • a search candidate with a higher score is more likely to match the keyword.
  • a search candidate with a lower score may have a higher possibility of matching with the keyword. .
  • the subword string distance evaluation unit 115 leaves the score as it is.
  • the subword string distance evaluation unit 115 adds a value obtained by subtracting the subword error probability corresponding to these subwords from “1” to the score. As a result, a search candidate with a lower score is more likely to match the keyword.
  • the subword string distance evaluation unit 115 may use an endpoint-free Viterbi algorithm or dynamic programming for calculating the scores of the keyword subword string and the search candidate subword string. The details of the endpoint-free Viterbi algorithm and the dynamic programming will be omitted.
  • the subword string distance evaluation unit 115 sets the whole keyword as one class, and is a subword based on one subword error probability or joint 1, 2, 3-gram. An error probability may be calculated, and the calculated subword error probability may be used for calculating the score.
  • methods for approximating the appearance probability of an unknown word with N-gram or a known subword error probability are widely known, and thus description thereof is omitted.
  • the subword string distance evaluation unit 115 approximates the appearance probability of the keyword subword string and the subword string appearance probability of the candidate section with an N-gram probability, and constrains the approximated N-gram probability as a prior probability and a normalized term.
  • the score may be calculated using the Viterbi algorithm.
  • the subword string distance evaluation unit 115 calculates a score for each recognition result, and calculates the weighted sum of the calculated scores. The score of the section is used. Thus, a score based on the distance is given to each search candidate.
  • FIG. 17 is a flowchart illustrating processing of the search result integration unit 116 according to the first embodiment of this invention. Based on the subword string score of each search candidate calculated by the subword string distance evaluation unit 115, the search results are sorted to the search result display unit 117 according to the degree of matching of the keywords (1701), and the process ends. To do.
  • the processing in step 1701 can use a well-known quick sort or radix sort.
  • the search result includes the file name, time, and score of each search candidate.
  • the search result integration unit 116 may output the search result to another application or may output it to another computer.
  • FIG. 18 is a flowchart showing processing of the search result display unit 117 according to the first embodiment of this invention.
  • the search result display unit 117 generates a search screen 1900 (see FIG. 19) for displaying the search results input from the search result integration unit 116 in descending order of matching with the keyword, and the generated search screen is not shown in the display device. (1801), and the process ends.
  • FIG. 19 is an explanatory diagram of a search screen 1900 according to the first embodiment of this invention.
  • the search screen 1900 includes a file name 1901, a time 1902, a score 1903, and a play button 1904.
  • the file name 1901 displays the name of the search candidate file
  • the time 1902 displays the time when the search candidate appears in the file
  • the score 1903 displays the search candidate score.
  • the voice data search apparatus 1 plays back the voice data near the time displayed at the time 1902 corresponding to the pressed playback button 1904.
  • the user can confirm the content of the sound near the search candidate by actually listening to the reproduced sound data.
  • the search screen 1900 may be output not to the display device but to other output devices (such as a printer or a storage device) and other computers.
  • the speech data retrieval apparatus 1 learns a subword misrecognition tendency for each assumed keyword in advance, and thereby searches for a subword sequence of subword keywords in consideration of a misrecognition tendency for each phrase. Scores with candidate subword strings can be calculated, and the accuracy of voice data search can be improved.
  • the speech data search apparatus 1 calculates a score (acoustic score) related to the acoustic of the keyword and the search candidate, and the calculated acoustic score and the score related to the subword (subword score) calculated in FIG. 16 of the first embodiment. Based on the above, the keyword is searched from the search target audio data 109. Thereby, the speech data retrieval apparatus 1 can further improve the retrieval accuracy.
  • a score acoustic score
  • FIG. 20 is a block diagram of the speech data retrieval apparatus 1 according to the second embodiment of the present invention.
  • the voice data search apparatus 1 of this embodiment includes an acoustic distance evaluation unit 2016 in addition to the voice data search apparatus 1 of the first embodiment, and the search result integration section 2017 is different from the search result integration section 116 of the first embodiment. .
  • the acoustic distance evaluation unit 2016 refers to the acoustic model 103 and the language model 104, calculates an acoustic score indicating an acoustic distance (closeness) between the keyword and the search candidate, and uses the calculated acoustic score as a search result integration unit 2017. Output to.
  • the acoustic score can be expressed using, for example, a ratio between the acoustic likelihood (or appearance probability) of the keyword and the acoustic likelihood (appearance probability) of the search candidate. Since various methods can be used for the calculation method of the acoustic score, the description is omitted.
  • the search result integration unit 2017 calculates a search score obtained by integrating the subword score calculated by the subword string distance evaluation unit 115 and the acoustic score calculated by the acoustic distance evaluation unit 2106, and sets a search candidate as a keyword based on the search score.
  • the search results sorted in the order of coincidence are output to the search result display unit 117. Details of the search result integration unit 2017 will be described with reference to FIG.
  • FIG. 21 is a flowchart showing the processing of the search result integration unit 2107 of the second embodiment of the present invention.
  • the search result integration unit 2107 weights and adds the subword score calculated by the subword string distance evaluation unit 115 and the acoustic score calculated by the acoustic distance evaluation unit 2016, thereby integrating the subword score and the acoustic score.
  • the retrieved search score is calculated (2101).
  • the search score is calculated when the search result integration unit 2107 calculates Equation 1.
  • S Aw + B (1-w) (Formula 1)
  • the subword score is A
  • the acoustic score is B
  • the search score is S
  • the weighting coefficient is w.
  • the weighting coefficient is a preset value.
  • the search score is calculated by weighting and adding the subword score and the acoustic score for each recognition result. calculate.
  • the acoustic score is calculated using the acoustic likelihood (or appearance probability) of the search candidate, it can be regarded as the appearance probability of the subword string of the search candidate.
  • the subword score can measure the distance between two subwords, but does not consider the appearance probability of the subword string.
  • the possibility of being recognized by the subword string of search candidate A that is, the appearance probability of the subword string of search candidate A
  • the possibility of being recognized by the subword string of search candidate B that is, search candidate
  • the appearance probability of the subword string of B is high, there is a high possibility that the search candidate A is erroneously recognized. For this reason, as for the sorting order of the search candidates A and B, it is more likely that the search accuracy is improved when the search candidate B is set higher than the search candidate A.
  • the speech data search device 1 sorts the search candidates using only the acoustic score, the speech data search device 1 cannot take into account the misrecognition tendency by referring to the query subword string error table 108.
  • the speech data search apparatus 1 of the present embodiment sorts the search candidates based on the acoustic score and the subword score, it is possible to sort the search candidates in consideration of the appearance probability and the misrecognition tendency of the search candidates. Search accuracy can be improved.
  • the speech data search apparatus 1 learns in advance a subword misrecognition tendency for each assumed keyword and uses the subword misrecognition tendency to search for a keyword from the search target speech data 109 as well as a search candidate. Is received from the user, the sub-word string of the search candidate designated as correct is compared with the keyword sub-word string, a misrecognition tendency is calculated, and the calculated misrecognition tendency is stored in the query sub-word string error table 108. Register with. Thereby, the misrecognition tendency becomes more accurate, and the search accuracy can be improved.
  • FIG. 22 is a block diagram of the speech data retrieval apparatus 1 according to the third embodiment of the present invention.
  • the speech data search apparatus 1 of this embodiment includes a search result display correction section 2217 instead of the search result display section 117.
  • the phoneme string error table update section 2218 is provided.
  • the search result display correction unit 2217 includes an interface for accepting a determination by the user as to whether or not the search candidate matches the keyword, and whether or not each search candidate matches the keyword. A label indicating determination by the user is assigned to each search candidate.
  • the phoneme sequence error table update unit 2218 calculates a subword error probability between the search candidate subword sequence determined to match the keyword and the keyword subword sequence, and registers the calculated subword error probability in the query subword sequence error table 108. To do. Details of the phoneme sequence error table update unit 2218 will be described with reference to FIG.
  • FIG. 23 is an explanatory diagram of a search screen 2300 according to the third embodiment of this invention.
  • the search screen 2300 is displayed by the search result display correction unit 2217.
  • the search screen 2300 includes a file name 1901, a time 1902, a score 1903, a play button 1904, and a correct / incorrect determination button 2301.
  • the correct / incorrect determination button 2301 includes a first button indicating that the search candidate matches the keyword, and a second button indicating that the search candidate does not match the keyword.
  • the playback button 1904 After the user presses the playback button 1904 to play back the audio data corresponding to the search candidate, and the search candidate matches the keyword, the user presses the first button.
  • the search candidate does not match the keyword, the user presses the second button.
  • the search result display correction unit 2217 considers that the determination as to whether or not the search candidate by the user matches the keyword has ended if the user's operation is not accepted for a certain period of time on the search screen 2300, and labels the determination result of the user as a label. And the search candidate with the label is output to the phoneme sequence error table update unit 2218.
  • FIG. 24 is a flowchart showing processing of the phoneme string error table update unit 2218 according to the third embodiment of the present invention.
  • the phoneme string error table update unit 2218 when a search candidate with a label is input from the search result display correction unit 2217, all search candidate subword strings (search candidate subword string) indicating that the label matches the keyword. ), The alignment is set so that the edit distance between the keyword subword string (keyword subword string) and the search candidate subword string is minimized (2401, 4022).
  • the phoneme sequence error table update unit 2218 counts the number of subword errors according to the combination of subwords or joint N-grams according to the format of the query subword sequence error table 108 (2403).
  • the processing in step 2403 is the same as the processing in step 703 performed by the query subword string error table learning unit 107 shown in FIG.
  • the phoneme sequence error table updating unit 2218 calculates a subword error probability based on the number of subword errors counted in the processing of step 2403, and based on the calculated subword error probability, the corresponding part of the query subword sequence error table 108 is calculated.
  • the subword error probability is updated (2404), and the process is terminated.
  • a method of updating the subword error probability in the query subword string error table 108 based on the MAP estimation will be specifically described.
  • the subword error probability that a certain subword “r” included in the keyword subword string is erroneously recognized as the subword “s” in the search candidate subword string is calculated as 20/1420.
  • the denominator of the subword error probability indicates the number of appearances of the subword “r”, and the numerator indicates the number of times that “r” is erroneously recognized as “s”.
  • 0.05 is registered in the query subword string error table 108 as the subword error probability that the subword “r” in the assumed keyword matching the keyword is erroneously recognized as the subword “s”.
  • the phoneme string error table update unit 2218 updates the subword error probability of the query subword string error table 108 to the value calculated by calculating Expression 2 using the preset value N. (20 + 0.05 * N) / (1420 + N) (Formula 2)
  • the subword error probability of the actual search target speech data 109 can be reflected in the query subword sequence error table 108 calculated in advance, and the subword sequence distance evaluation unit 115 can accurately recognize the search target speech data 109. Trends can be used and search accuracy can be improved.
  • the phoneme sequence error table update unit 2218 does not need to update the subword error probability every time a search candidate is input from the search result display correction unit 2217, and a predetermined number or more of search candidates are input from the search result display correction unit 2217. If so, the word error probability may be updated.
  • the phoneme sequence error table update unit 2218 may perform the same processing by regarding that the data has increased N times.
  • the speech data search apparatus 1 not only learns and uses the misrecognition tendency of the speech data with label 101 for learning in the subword for each assumed keyword in advance, but also has search candidates by the user. Based on the determination result of whether or not it matches the keyword, the misrecognition tendency of the search target speech data 109 is calculated, and the calculated misrecognition tendency is reflected in the query subword string error table 108 to improve the search accuracy. be able to.
  • this embodiment is applicable not only to the speech data search apparatus 1 of the first embodiment but also to the speech data search apparatus 1 of the second embodiment.
  • FIG. 25 (Fourth embodiment)
  • FIG. 26 (Fourth embodiment)
  • a fourth embodiment of the present invention will be described with reference to FIGS. 25 and 26.
  • FIG. 25 is a block diagram of the voice data search system of the fourth embodiment of the present invention.
  • the voice data search system includes a private branch exchange (PBX, Private Branch eXchange) device 2503, a call recording device 2504, a storage device 2506 for storing search target voice data 2505, and a computer group 2510 for executing search processing.
  • PBX Private Branch eXchange
  • Each device is connected by a telephone line or a network, and the components in the computer are connected by a bus.
  • the PBX device 2503 is connected to a customer telephone device 2501 used by a customer via a public telephone line network N1.
  • the PBX device 2503 is connected to an operator telephone 2502 used by an operator in the call center.
  • the PBX device 2503 relays a call between the customer telephone device 2501 and the operator telephone device 2502 in the call center.
  • the configuration of the call recording device 2504 is the same as that of a general-purpose computer that includes a CPU and a memory and executes a control program for controlling itself.
  • the call recording device 2504 acquires the voice signal uttered by the customer from the PBX device 2503 or the operator telephone device 2502 and acquires the voice signal uttered by the operator from the operator telephone device 2502.
  • the voice signal uttered by the operator may be obtained from a headset and a recording device connected to the operator telephone 2502.
  • the call recording device 2504 performs A / D conversion on the acquired audio signal, converts it into digital data (audio data) in a predetermined format (for example, WAV format), and stores it as search target audio data 2505 in the storage device 2506. To do. Note that the conversion process of the audio signal into the audio data may be executed in real time.
  • FIG. 26 is an explanatory diagram showing an example of a format of audio data according to the fourth embodiment of the present invention.
  • the voice file storing voice data includes operator ID 2601, speaker ID 2602, time 2603, time length 2604, and binary waveform data 2605 with 16-bit code.
  • the operator ID 2601 is registered with the operator ID.
  • the speaker ID 2602 the ID of the customer who has made a call with the operator is registered.
  • the time at which a call is started between the operator and the customer is registered.
  • the time length 2604 the time from the start to the end of the call is registered.
  • Audio data is registered in the 16-bit signed binary waveform data 2605.
  • the operator ID 2601, speaker ID 2602, and time length 2604 can be acquired from the PBX device 2503 or the like.
  • the computer group 2510 includes a computer 2540, storage devices 2520 and 2530, a keyboard 2550, and a display device 2551.
  • the computer 2540 is connected to the storage devices 2520 and 2530, the keyboard 2550, and the display device 2551.
  • the storage device 2520 stores a language model 2521, an acoustic model 2522, an index table 2523, and a query subword string error table 2524.
  • the storage device 2530 stores learning-labeled voice data 2531 and an assumed keyword 2532.
  • the language model 2521 corresponds to the language model 104 shown in FIG. 1
  • the acoustic model 2522 corresponds to the language model 104 shown in FIG. 1
  • the index table 2523 corresponds to the index table 111 shown in FIG. 1
  • the query subword string error table. 2524 corresponds to the query subword string error table 108 shown in FIG.
  • the learning labeled speech data 2531 corresponds to the learning labeled speech data 101 shown in FIG. 1
  • the assumed keyword 2532 corresponds to an assumed keyword preset in the assumed keyword generation unit 106 shown in FIG.
  • the language model 2521, the acoustic model 2522, and the query subword string error table 2524 may be calculated by a computer other than the computer 2540 using the learning-labeled speech data 2531.
  • the computer 2540 executes the search process of the third embodiment and has a CPU 2541 and a memory 2542.
  • the memory 2542 stores a speech recognition module 2543, an indexing module 2544, a search module 2545, and a query subword string error table learning module 2546.
  • the voice recognition module 2543 has the function of the acoustic / language model learning unit 102.
  • the indexing module 2544 has the function of the indexing unit 110.
  • the search module 2545 has functions of a keyword input unit 112, a subword string conversion unit 113, a candidate search unit 114, a subword string distance evaluation unit 115, a search result integration unit 116, and a search result display correction unit 2217.
  • the query subword string error table learning module 2517 has functions of a query subword string error table learning unit 107 and a phoneme string error table update unit 2218. These modules are appropriately executed by the control command of the CPU 2541.
  • the computer operates properly in the same procedure as in the third embodiment.
  • the indexing module 2544 accesses the search target audio data 2505 at regular intervals, executes an indexing process on the difference of the search target audio data 2505, and adds the indexing process result to the index table 2523.
  • the voice data search device 1 of the third embodiment can be introduced into a call center.
  • the voice data search device 1 installed in the call center is not limited to the third embodiment, and may be the voice data search device 1 of the first embodiment and the second embodiment.
  • FIG. 27 is an explanatory diagram of a general content cloud system.
  • the content cloud system includes a storage 2704, an ETL (Extract Transform Load) module 2705, a content storage 2706, a search engine module 2709, a metadata server module 2711, a multimedia server module 2713, and an application program 2715.
  • ETL Extract Transform Load
  • the content cloud system operates on a general-purpose computer having one or more CPUs, memories, and storage devices, and the content cloud system has various modules. Each module may be executed by an independent computer. In this case, each computer and the module are connected via a network or the like, and each module communicates data via the network and distributes processing. Run it.
  • the content cloud system receives the request transmitted by the application program 2716 via a network or the like, and transmits information corresponding to the received request to the application program 2715.
  • data in an arbitrary format such as audio data 2701, medical data 2702, and mail data 2703 is input, and these data 2701 to 2703 are temporarily stored in the storage 2704.
  • the voice data 2701 may be a call voice of a call center
  • the medical data 2702 and the mail data 2703 may be document data.
  • these data 2701 to 2703 may be structured or may not be structured.
  • the ETL 2705 monitors the storage 2704, and when new data 2701 to 2703 are stored in the storage 2704, the information extraction processing module corresponding to the stored data 2701 to 2703 is executed, and the stored data 2701 to 2703 Predetermined information (metadata) is extracted from 2703.
  • the ETL 2705 then archives the extracted metadata 2707 in the content storage 2706 and stores it.
  • Examples of the ETL2705 information extraction processing module include an index module and an image recognition module.
  • Examples of metadata include time, N-gram index, object name as an image recognition result, image feature amount, and image. There are related words and speech recognition results.
  • the information extraction processing module of the ETL 2705 all programs for extracting some information from the data 2701 to 2703 stored in the storage 2704 can be adopted, and since known techniques can be adopted for this program, various kinds of programs are used here. Description of the information extraction module is omitted.
  • the metadata may be compressed in data size by a data compression algorithm.
  • the information extraction processing module of ETL 2705 extracts the metadata, and then the file name of the original data from which the metadata was extracted, the date of registration in the storage of the original data, the type of the original data, the metadata text information, etc. May be executed in the Relational Data Base (RDB).
  • RDB Relational Data Base
  • the content storage 2706 stores metadata 2707 extracted by the ETL 2705 and data 2701 to 2703 before information extraction processing by the ETL 2705 temporarily stored in the storage 2704.
  • the search engine module 2709 executes a text search process based on the index 2710 generated by the ETL 2705 and transmits the search result to the application program 2715.
  • a known technique can be applied to the search engine module 2709 and the search processing algorithm.
  • the search engine module 2709 includes a module that searches not only text but also data such as images and sounds.
  • the metadata server module 2711 manages metadata stored in the RDB 2712. For example, if the ETL 2705 registers the file name of the original data from which the metadata is extracted, the registration date of the original data in the storage, the type of the original data, the metadata text information, and the like in the RDB 2712, a request from the application program 2715 The information registered in the RDB 2712 corresponding to is transmitted to the application program 2715.
  • the multimedia server module 2713 stores a graph database (DB) 2714 structured in a graph format by associating metadata extracted by the ETL 2705 with each other. For example, the original audio file (or image data), related words, and the like are associated in a network format with the recognition result “apple” stored in the metadata 2707 of the content storage 2706.
  • DB graph database
  • the multimedia server module 2713 transmits meta information corresponding to the request from the application program 2715 to the application program 2715. For example, when the multimedia server module 2713 receives a request “apple”, the multimedia server module 2713 refers to the graph DB 2714, and transmits related metadata such as an apple image, an average market price, and an artist song name to the application program 2715.
  • FIG. 28 is an explanatory diagram of a content cloud system according to a fifth embodiment of the present invention.
  • various processes of the voice data search device 1 are modularized.
  • the indexing unit 110 of the voice data search device 1 is modularized into an indexing module 2801, and the keyword input unit 112 is used.
  • the subword string conversion unit 113, the candidate search unit 114, the subword string distance evaluation unit 115, and the search result integration unit 116 are modularized in a search module 2802.
  • the indexing module 2801 is mounted on the storage 2704, and the search module 2802 is mounted on the multimedia server module 2713.
  • the acoustic model 103, the language model 104, and the query subword sequence error table 108 are calculated in advance by another computer, the acoustic model 103 and the language model 104 are mounted in the storage 2704, and the query subword sequence error table 108 is It is mounted on the media server module 2713.
  • the indexing module 2801 When the audio data 2701 is input to the storage 2704, the indexing module 2801 is called by the ETL 2705 and executes an indexing process on the input audio data 2701. Then, the indexing module 2801 stores the index data generated by the indexing process in the content storage 2706.
  • the search module 2802 refers to the index data 2708 and the query subword string error table 2802, and the keyword is extracted from the audio data 2701.
  • the uttered part is searched, and the search result including the file name where the keyword is uttered, the time when the keyword is uttered, and the score is output to the application program 2715 and the multimedia server control program that input the keyword.
  • the details of the processing of the indexing module 2801 and the search module 2802 are the same as the processing of the speech data retrieval apparatus 1 of the first to third embodiments, and thus the description thereof is omitted.
  • the search module 2802 may be implemented in the search engine module 2709. In this case, when a voice data search request including a keyword is input from the application program 2715 to the search engine module 2709, the search module 2802 searches the voice data 2701 and outputs the search result to the search engine module 2709.
  • the voice data search device 1 can be applied to the content cloud system.

Abstract

La présente invention concerne un dispositif de recherche de parole qui recherche des premières données de parole qui doivent être recherchées pour une partie correspondant à un mot-clé qui est saisi par un utilisateur. Un dispositif de recherche de parole, utilisant des secondes données de parole, génère un modèle sonore qui désigne des caractéristiques sonores et un modèle de langage qui désigne des caractéristiques de langage. Le dispositif de recherche de parole convertit les secondes données audio en une première chaîne de sous-mots, convertit un mot-clé présumé en une deuxième chaîne de sous-mots, calcule une tendance de défaut de reconnaissance par rapport à la première chaîne de sous-mots et la deuxième chaîne de sous-mots, convertit des premières données de parole en une troisième chaîne de sous-mots, et convertit un mot-clé en une quatrième chaîne de sous-mots. Le dispositif de recherche de parole recherche les premières données sonores pour une partie correspondant au mot-clé en tant que candidat de recherche. Sur la base de la tendance de défaut de reconnaissance, le dispositif de recherche de parole calcule un score sur la base d'un score de sous-mots par rapport à la troisième chaîne de sous-mots par rapport à la quatrième chaîne de sous-mots du candidat de recherche qui est recherché par une unité de recherche de candidat, et sort un résultat de recherche qui comprend le score et le candidat de recherche correspondant au score.
PCT/JP2012/071850 2012-08-29 2012-08-29 Dispositif de recherche de parole, support de stockage lisible par ordinateur et procédé de recherche audio WO2014033855A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2012/071850 WO2014033855A1 (fr) 2012-08-29 2012-08-29 Dispositif de recherche de parole, support de stockage lisible par ordinateur et procédé de recherche audio
JP2014532631A JP5897718B2 (ja) 2012-08-29 2012-08-29 音声検索装置、計算機読み取り可能な記憶媒体、及び音声検索方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2012/071850 WO2014033855A1 (fr) 2012-08-29 2012-08-29 Dispositif de recherche de parole, support de stockage lisible par ordinateur et procédé de recherche audio

Publications (1)

Publication Number Publication Date
WO2014033855A1 true WO2014033855A1 (fr) 2014-03-06

Family

ID=50182705

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/071850 WO2014033855A1 (fr) 2012-08-29 2012-08-29 Dispositif de recherche de parole, support de stockage lisible par ordinateur et procédé de recherche audio

Country Status (2)

Country Link
JP (1) JP5897718B2 (fr)
WO (1) WO2014033855A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017168524A1 (fr) * 2016-03-28 2017-10-05 株式会社日立製作所 Dispositif de serveur d'analyse, système d'analyse de données et procédé d'analyse de données
WO2020009027A1 (fr) * 2018-07-06 2020-01-09 株式会社 東芝 Système de recherche d'informations
CN112735412A (zh) * 2020-12-25 2021-04-30 北京博瑞彤芸科技股份有限公司 一种根据语音指令搜索信息的方法和系统
US20220157311A1 (en) * 2016-12-06 2022-05-19 Amazon Technologies, Inc. Multi-layer keyword detection
CN116578677A (zh) * 2023-07-14 2023-08-11 高密市中医院 一种针对医疗检验信息的检索系统和方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005257954A (ja) * 2004-03-10 2005-09-22 Nec Corp 音声検索装置、音声検索方法および音声検索プログラム
JP2009128508A (ja) * 2007-11-21 2009-06-11 Hitachi Ltd 音声データ検索システム
JP2009216986A (ja) * 2008-03-11 2009-09-24 Hitachi Ltd 音声データ検索システム及び音声データの検索方法
JP2010267012A (ja) * 2009-05-13 2010-11-25 Hitachi Ltd 音声データ検索システム及び音声データ検索方法
JP2010277036A (ja) * 2009-06-01 2010-12-09 Mitsubishi Electric Corp 音声データ検索装置
JP2011175046A (ja) * 2010-02-23 2011-09-08 Toyohashi Univ Of Technology 音声検索装置および音声検索方法
JP2011197410A (ja) * 2010-03-19 2011-10-06 Nippon Hoso Kyokai <Nhk> 音声認識装置、音声認識システム、及び音声認識プログラム

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005257954A (ja) * 2004-03-10 2005-09-22 Nec Corp 音声検索装置、音声検索方法および音声検索プログラム
JP2009128508A (ja) * 2007-11-21 2009-06-11 Hitachi Ltd 音声データ検索システム
JP2009216986A (ja) * 2008-03-11 2009-09-24 Hitachi Ltd 音声データ検索システム及び音声データの検索方法
JP2010267012A (ja) * 2009-05-13 2010-11-25 Hitachi Ltd 音声データ検索システム及び音声データ検索方法
JP2010277036A (ja) * 2009-06-01 2010-12-09 Mitsubishi Electric Corp 音声データ検索装置
JP2011175046A (ja) * 2010-02-23 2011-09-08 Toyohashi Univ Of Technology 音声検索装置および音声検索方法
JP2011197410A (ja) * 2010-03-19 2011-10-06 Nippon Hoso Kyokai <Nhk> 音声認識装置、音声認識システム、及び音声認識プログラム

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NAOYUKI KANDA ET AL.: "Evaluation of multistage rescoring strategy for open-vocabulary spoken term detection", DAI 2 KAI PROCEEDINGS OF THE SPOKEN DOCUMENT PROCESSING WORKSHOP, 1 March 2008 (2008-03-01), pages 73 - 78 *
NAOYUKI KANDA ET AL.: "Spoken Term Detection from Large Scale Speech Database Using Multistage Rescoring", THE TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, vol. J95-D, no. 4, 1 April 2012 (2012-04-01), pages 969 - 981 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017168524A1 (fr) * 2016-03-28 2017-10-05 株式会社日立製作所 Dispositif de serveur d'analyse, système d'analyse de données et procédé d'analyse de données
US20220157311A1 (en) * 2016-12-06 2022-05-19 Amazon Technologies, Inc. Multi-layer keyword detection
US11646027B2 (en) * 2016-12-06 2023-05-09 Amazon Technologies, Inc. Multi-layer keyword detection
WO2020009027A1 (fr) * 2018-07-06 2020-01-09 株式会社 東芝 Système de recherche d'informations
JP2020009140A (ja) * 2018-07-06 2020-01-16 株式会社東芝 情報検索システム
JP7182923B2 (ja) 2018-07-06 2022-12-05 株式会社東芝 情報検索システム
CN112735412A (zh) * 2020-12-25 2021-04-30 北京博瑞彤芸科技股份有限公司 一种根据语音指令搜索信息的方法和系统
CN112735412B (zh) * 2020-12-25 2022-11-22 北京博瑞彤芸科技股份有限公司 一种根据语音指令搜索信息的方法和系统
CN116578677A (zh) * 2023-07-14 2023-08-11 高密市中医院 一种针对医疗检验信息的检索系统和方法
CN116578677B (zh) * 2023-07-14 2023-09-15 高密市中医院 一种针对医疗检验信息的检索系统和方法

Also Published As

Publication number Publication date
JP5897718B2 (ja) 2016-03-30
JPWO2014033855A1 (ja) 2016-08-08

Similar Documents

Publication Publication Date Title
US10347244B2 (en) Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
US20230317074A1 (en) Contextual voice user interface
US6839667B2 (en) Method of speech recognition by presenting N-best word candidates
CN107016994B (zh) 语音识别的方法及装置
US10339920B2 (en) Predicting pronunciation in speech recognition
US7668718B2 (en) Synchronized pattern recognition source data processed by manual or automatic means for creation of shared speaker-dependent speech user profile
US8731926B2 (en) Spoken term detection apparatus, method, program, and storage medium
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
US8527272B2 (en) Method and apparatus for aligning texts
US8200490B2 (en) Method and apparatus for searching multimedia data using speech recognition in mobile device
US9418152B2 (en) System and method for flexible speech to text search mechanism
KR102390940B1 (ko) 음성 인식을 위한 컨텍스트 바이어싱
JP5440177B2 (ja) 単語カテゴリ推定装置、単語カテゴリ推定方法、音声認識装置、音声認識方法、プログラム、および記録媒体
WO2003010754A1 (fr) Systeme de recherche a entree vocale
US11605373B2 (en) System and method for combining phonetic and automatic speech recognition search
JP5326169B2 (ja) 音声データ検索システム及び音声データ検索方法
JP5897718B2 (ja) 音声検索装置、計算機読み取り可能な記憶媒体、及び音声検索方法
US6963834B2 (en) Method of speech recognition using empirically determined word candidates
WO2014203328A1 (fr) Système de recherche de données vocales, procédé de recherche de données vocales et support d&#39;informations lisible par ordinateur
Suzuki et al. Music information retrieval from a singing voice using lyrics and melody information
KR101483947B1 (ko) 핵심어에서의 음소 오류 결과를 고려한 음향 모델 변별 학습을 위한 장치 및 이를 위한 방법이 기록된 컴퓨터 판독 가능한 기록매체
KR101424496B1 (ko) 음향 모델 학습을 위한 장치 및 이를 위한 방법이 기록된 컴퓨터 판독 가능한 기록매체
JP5590549B2 (ja) 音声検索装置および音声検索方法
Decadt et al. Transcription of out-of-vocabulary words in large vocabulary speech recognition based on phoneme-to-grapheme conversion
JP2010277036A (ja) 音声データ検索装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12883515

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014532631

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12883515

Country of ref document: EP

Kind code of ref document: A1