WO2008150003A1

WO2008150003A1 - Keyword extraction model learning system, method, and program

Info

Publication number: WO2008150003A1
Application number: PCT/JP2008/060506
Authority: WO
Inventors: Kentaro Nagatomo
Original assignee: Nec Corporation
Priority date: 2007-06-06
Filing date: 2008-06-02
Publication date: 2008-12-11
Also published as: JPWO2008150003A1; JP5360414B2

Abstract

Keyword extraction model leaning means (110) inputs an input to a linked system (120), voice data, and information correlating them. The keyword extraction model learning means (110) assumes that an input to the linked system (120) is a keyword and learns a keyword extraction model for presuming a keyword or its utterance expression contained in the voice data according to the input to the linked system (120), the voice data, and the information on the correlation.

Description

Key word extraction model learning system, method and program

The present invention relates to a key extraction model learning system, a key extraction system, an information input system, an information search system, a keyword extraction model learning method, a keyword extraction method, a key extraction model learning system for learning a keyword extraction model for extracting a keyword from speech, An information input method, an information search method, and a keyword refined model learning program, in particular, a keyword extraction model learning system characterized by learning using an input to a cooperation destination system and a voice corresponding to the input, The present invention relates to a keyword extraction system, an information input system, an information search system, a keyword extraction model learning method, a key key extraction method, an information input method, an information search method, and a keyword extraction model learning program. Background technology:

When speech recognition technology is used as the front end for information input systems and information retrieval systems that input words, phrases (a set of words), sentences, etc., specific words and phrases are extracted from the speech data. The “keyword extraction” technology is often used. In the following, for convenience, not only words but also phrases and sentences are subject to extraction as meaningful input to the system that serves as the back end of the keyword extraction means (hereinafter referred to as the linked system). Are all expressed as “keywords”. Conventional keyword extraction techniques have been implemented in two main ways. One is a method called “word spotting” that determines whether or not a predetermined keyword is included in the speech. The other method is to convert the entire speech into text by so-called speech recognition (speech-to-text conversion) and then extract the keyword using text processing technology (hereinafter referred to as text processing method). Yes.)

RC Rose and DB Paul, "a hidden Markov model Dased keyword An example of the “word spotting” method is described in “recognition system”, in Proc. ICASSP 90, pp.129-132. Non-Patent Document 1 provides a model that estimates whether or not a part of the input speech is the same as the keyboard for each of the keywords listed in advance, and connects the prepared models in parallel (keyword Network).

Non-Patent Document 1 describes that non-keyword models are arranged in parallel (filer network). If the input speech has the maximum likelihood for any of the keyword models among the models placed in parallel, it can be considered that the key phrase has appeared. In the method described in Non-Patent Document 1, a background model is arranged in parallel with the entire keyword filter network. The background model is designed so that linguistic bias is not easily applied to any speech. Then, by using the difference between the likelihood for the keyword model and the likelihood for the background model, the rejection judgment of the extraction result using the normalized likelihood is performed. By adopting the structure as described above, it is possible to extract a keyword that is robust against the acoustic conditions of the input speech.

In the key model, there is a problem of absorbing a non-keyword similar to the keypad. For this reason, likelihood normalization using a filler model or background model was proposed. In particular, adding a filler is known as an easy tuning tool. For example, when the model of the keyword “Wakayama” mistakenly extracts the utterance “Okayama”, a method of adding “Okayama” as a filer is known.

For example, Japanese Patent Application Laid-Open No. 2 0 0 5-0 9 2 3 1 0 (hereinafter referred to as Patent Document 1) discloses a technique for finding a word similar to a key word as a filler from a large vocabulary dictionary and adding it. Has been. In addition, the “Speech Recognition Interface of Personal Robot PaPeRo” (Iwasawa, 13th AI Challenge Study Group, p. 17 7-2 2, hereinafter referred to as Non-Patent Document 2) is filled with fillers as syllable continuous dictionaries. The technique to generate from is described.

In practice, the individual filler models do not work so precisely. For example, the voice “Wakayama” is input to two models “Okayama” and “Wakayama”. There are cases where the likelihood of the “Okayama” model increases even when applied. This is a problem that occurs when the “Wakayama” model is not sufficiently learned for the input “Wakayama” speech. In such a case, similar to the technique described in Non-Patent Document 2, an ad hoc method such as adding a wirayama (for example, “Akayama”) that matches the sound of “Wakayama” as a variant of “Wakayama”. Various countermeasures are known.

The text processing method, which is another implementation method of keyword extraction, is a method that has come to be used with the spread of so-called dictation technology. Basically, it can be composed of a simple combination of large vocabulary continuous speech recognition technology and character string matching technology, so it tends to be used more often when the focus is on subsequent processing rather than key word extraction. In recent years, the recognition accuracy of dictation technology has improved, and a combination with more advanced natural language processing technology has been proposed. For example, D. uler, R. Schwartz, R. Weischedel and R. Stone, "Named entity extraction from broadcast news", in Proc. The DARPA Broadcast News Workshop. Herndon, Virginia, 1999, pp. 37-40. Non-Patent Document 3) describes a combination of dictation technology and named entity extraction, which is one of natural language processing technologies. For example, a proper expression is a text having a certain structure such as “person name” or “place name”, and is considered a kind of keyword here. Disclosure of the invention:

Problems to be solved by the invention

However, in the above-described conventional technology, it is difficult to collect appropriate keywords in advance for the cooperation destination system using the keyword extraction process, which is very troublesome.

In the prior art, the mainstream was focused on how to extract the keywords accurately. In the prior art, it is assumed that the keywords to be extracted are known or can be easily collected. For example, as in the technique described in Non-Patent Document 2, the processing on the backend side for each extracted key key is clearly specified. In the case where the above is true, the above resolution is established. However, many of the linked systems operating in the real world can handle a very large number of inputs, and the keywords to be extracted also vary. Unless enough keywords can be collected to be used by the partner system, no matter how high the keyword extraction accuracy is, it cannot be said to be a practical keyword extraction system.

The first reason why key collection is difficult is that the keywords to be collected are completely different depending on the system with which the key extraction system is linked. For example, if linked with a ticket reservation system, it is necessary to extract the event name and ticket number. On the other hand, if it is linked with a train transfer guidance system, station names must be collected.

The second reason why it is difficult to collect keywords is that it is not enough to collect only the keywords themselves. Although it depends on the implementation method, a system with low key extraction accuracy is required unless sufficient fillers (non-key keys) are collected.

The third reason why it is difficult to collect keywords is that there are cases where it is practically impossible to collect sufficient keywords. For example, when keyword extraction technology is linked with a general-purpose search system such as Google (registered trademark) or Yahoo! (registered trademark), every word can be a keyword. In such cases, the keywords that can be extracted must be constrained under certain conditions. Commonly used are restrictions based on word attributes such as parts of speech, such as extracting only nouns. In practice, however, the user may wish to search for adjectives as well. Also, because the frequency of searching for the same noun is extremely low, there is no opportunity to extract it as a keyword, or it may be extracted as another word.

The fourth reason why it is difficult to collect keys is that the collected keys are not always spoken in their actual form. Keywords are usually collected based on keywords that can be accepted by the partner system. In the above example, when linking with the ticket reservation system, collect the keywords (ticket number and event name) that the ticket reservation system can accept. However, the user, for example, for the event name, is an abbreviation that the ticket reservation system does not intend. There is a possibility of speaking. One user may utter a ticket number separated by two digits, and another user may read it out with “no” between each digit.

The problem that the collected keyboard keys are not actually spoken as they are is close to the problem of fillers (non-keywords). However, it is clear that the conventional technology (for example, see Patent Document 1 and Non-Patent Document 2) cannot solve the problem. This is because such an utterance expression of the keyword (a modified expression in the utterance of the keyword) and the assumption that it is acoustically close to the original keyword are not valid.

The object of the present invention is as a building block necessary for constructing a key extraction system capable of extracting a key phrase suitable for input to a cooperation destination system, and can be used for the above-described applications. Keyword extraction model learning system, keyword extraction system, information input system, information retrieval system, keyword extraction model learning method, keyword extraction method, information input method, information retrieval method, and keyword extraction model learning that can easily construct a keyword extraction model To provide a program.

Another object of the present invention is to provide a keyword extraction model learning system, a keyword extraction system, an information input system, and an information search, which can easily construct a keyword extraction model that can extract modified expressions (utterance expressions) in keyword utterances. It is to provide a system, a keyword extraction model learning method, a keyword extraction method, an information input method, an information search method, and a keyword extraction model learning program.

Means for solving the problem

The key word extraction model learning system according to the present invention is a key word extraction model learning system for learning a key word extraction model for extracting a key word from voice, and learning using an input to a cooperative system and a voice corresponding to the input. It is characterized by having a keyword extraction model learning means.

The input to the linkage system may include at least text information.

The voice corresponding to the input to the cooperation system may include both a part corresponding to the input to the cooperation system and a part not corresponding to the input.

The keyword extraction model learning means may learn the keyword extraction model so as to return a high likelihood for the input to the cooperative system. The key extraction model learning means may learn the key extraction model so as to return a high likelihood to the voice corresponding to the input to the cooperation system or a part of the corresponding voice.

The key extraction model learning means may learn the key extraction model so as to return a low likelihood to a part of the voice not corresponding to the input to the cooperation system or not corresponding to the input.

The keyword extraction model learning means may use speech corresponding to an input to the cooperative system as learning data for model learning related to another input similar to the input to the cooperative system.

The keyword extraction model learning means may use speech corresponding to an input to the cooperative system as learning data that is similar to the input to the cooperative system and is a negative example of model learning related to another input.

The keyword extraction model learning means (for example, the keyword extraction model learning means 2 1 0) classifies the input to the cooperation system into one or more clusters based on the constraints given in advance, and collects learning about each cluster. You may go.

The keyword extraction model learned by the keyword extraction model learning means (for example, the keyword extraction model learning means 2 1 0) is a keyword interval model that returns the likelihood that a part of a certain speech is the utterance of one of the key words ( For example, the keyword extraction model) and a keyword recognition model that returns the likelihood that a part of speech is the utterance of each keyword, and the keyword extraction model learning means, Two types of models may be learned.

There may be provided a keyword extraction model learning unit that finds an input to the cooperative system and a section of speech that is highly likely to correspond to the input, and performs learning using the speech of this section.

The keyword extraction model learning means may perform learning using speech corresponding to the input or transcription of the speech corresponding to the input.

The keyword extraction system according to the present invention is characterized in that a keyword extraction model learned by the keyword extraction model learning means according to any one of claims 1 to 11 is used. An information input system according to the present invention (for example, cooperation destination system 1 2 0) is characterized by using the key word extraction system according to claim 12.

An information search system according to the present invention (for example, cooperation destination system 1 2 0)

1 Uses the keyword extraction system described in 2 above.

A keyword extraction model learning method according to the present invention is a keyword extraction model learning method for learning a keyword extraction model for extracting a keyword from speech, using an input to the linkage system and a speech corresponding to the input. It is characterized by including a key keyword extraction model learning step for performing learning.

The input to the linkage system may include at least text information.

In the key extraction model learning step, the key extraction model may be learned so as to return a high likelihood to the input to the cooperative system.

In the key extraction model learning step, the key extraction model may be learned so as to return a high likelihood to the speech corresponding to the input to the cooperative system or a part of the corresponding speech.

In the key extraction model learning step, the key extraction model may be learned so that a low likelihood is returned for a voice that does not correspond to the input to the cooperation system or a part of the voice that does not correspond.

In the key word extraction model learning step, speech corresponding to an input to the cooperation system may be used as learning data for model learning related to another input similar to the input to the cooperation system.

In the key word extraction model learning step, speech corresponding to a certain force to the cooperation system may be used as learning data of a negative example of model learning regarding another input that is not similar to the input to the cooperation system.

In the key word extraction model learning step, the input to the cooperation system may be classified into one or more clusters based on a predetermined constraint, and learning about each cluster may be performed collectively.

There is a key extraction model to learn in the key extraction model learning step. A keyword interval model that returns the likelihood that a part of speech is an utterance of one of the keywords, and a keyword recognition model that returns the likelihood that a part of speech is an utterance of each keyword, The two types of models may be learned in the keyword extraction model learning step.

In the keyword extraction model learning step, an input to the cooperation system and a section of speech that is highly likely to correspond to the input may be found, and learning may be performed using the speech of this section.

In the key word extraction model learning step, learning may be performed using speech corresponding to the input or transcription of the speech corresponding to the input.

The key word extraction method according to the present invention uses the key word extraction model learned by the key word extraction model learning method according to any one of claims 15 to 25. And

An information input method according to the present invention uses the key word extraction method according to claim 26.

The information search method according to the present invention is characterized by using the keyword extraction method according to claim 26.

A key word extraction model learning program according to the present invention is a key word extraction model learning program for learning a key word extraction model for extracting a key word from speech, the computer input to the linkage system, and the input It is characterized by executing a key extraction model learning process in which learning is performed using speech corresponding to.

The input to the linkage system may include at least text information.

The computer may execute a keyword extraction model learning process to learn a keyword extraction model so as to return a high likelihood to the input to the cooperation system.

In order to return high likelihood to the computer for the speech corresponding to the input to the linkage system or a part of the corresponding speech in the keyword extraction model learning process. A process of learning the mode extraction model may be executed.

Even if the computer is made to perform the keyword extraction model learning process, the key extraction model learning process is executed so that a low likelihood is returned to a part of the voice that does not correspond to the input to the cooperation system or the voice that does not correspond. Good.

Even if the computer is made to perform the keyword extraction model learning process, the voice corresponding to the input to the cooperation system is used as the learning data for the model learning related to another input similar to the input to the cooperation system. Good.

Executes processing that uses speech corresponding to an input to the linked system as learning data for a negative example of model learning related to another input that is not similar to the input to the linked system in the keyword extraction model learning process on the computer You may let them.

Even if the computer classifies the input to the cooperation system into one or more clusters based on the constraints given in advance and executes the process to collect learning about each cluster in the keyword extraction model learning process. Good.

The key extraction model trained by the key extraction model learning process is a key interval model that returns the likelihood that a part of a certain voice is the utterance of any key, and a part of a certain voice. It consists of two types of models: a key-based recognition model that returns the likelihood of a key word utterance, and causes a computer to execute the process of learning the two types of models in the keyword extraction model learning process. May be. In a keyword extraction model learning process, the computer finds an input to the linkage system and a section of speech that is highly likely to correspond to the input, and executes a process of learning using the speech of this section. Also good.

You may make a computer perform the process which learns using the speech corresponding to an input or the transcription of the audio | voice corresponding to the said input by keyword extraction model learning process.

A preferred embodiment of the keyword extraction system according to the present invention includes, for example, a keyword extraction means, and a keyword extraction model learning means for learning a keyword extraction model that can be used by the keyword extraction means. The model learning means receives the input text to the linked system of the keyword extraction system and the speech corresponding to the input or the transcription of the speech as learning data, and cooperates The input text to the destination system and its speech or transcription, and unknown text inferred from them, are considered as keywords, and the likelihood that a section of the input speech is a possibility of the key word is returned. It is characterized by operating a key extraction model for each key word or for each set of similar keywords.

Furthermore, another preferred embodiment of the key word extraction system according to the present invention is, for example, that some key word extraction models include voices for which the keyword extraction model shows a high likelihood. It has a keyword identification means to identify which one of the keywords, and the keyword extraction model learning means uses the same learning data as used for learning the keyword extraction model, and is used by the keyword identification means. It is characterized by operating to learn possible key identification models.

By adopting such a configuration and extracting the input to the cooperation destination system and the corresponding voice expression (utterance expression) and their variations as keywords, the object of the present invention can be achieved.

Effects of the invention ■

According to the present invention, keyword extraction suitable for the cooperation destination system can be easily realized. The reason is that the cooperation destination system accepts the text obtained as a result of keyword extraction as input in the first place, and conversely, if the cooperation destination system extracts text that can be accepted as input as a keyword, This is because data that is expected to be significant for at least the partner system can be targeted for keyword extraction.

In addition, these texts can be input to the linked system by key entry without relying on keyword extraction. In fact, it is common practice to provide an input I Z F (interface) on the front end in parallel with voice input by key input or multi-selection using the mouse.

Further, according to the keyword extraction of the present invention, it is possible to extract a keyword expressed by utterance. If a voice corresponding to the input (ie, keyword) to the linked system is obtained, the keyword can be transformed into any utterance expression from that voice. A sample is obtained. By using this sample, it is possible to construct a keyword extraction model that supports both keywords and their utterance expressions.

The problem here is that if only the input to the linked system and its speech are used as learning data, a model with sufficient flexibility cannot be constructed. The keyword extraction system of the present invention not only collects the input to the linked system and its speech (and its transcription) as the extraction target keyword, but also a key extraction model that can accept further variations from these. By learning, this problem can be addressed. Brief description of the drawings:

FIG. 1 is a block diagram showing a configuration example of the first embodiment.

FIG. 2 is a flow chart showing an example of the operation of the key word extraction means in the first embodiment.

FIG. 3 is a flowchart showing an example of operation of the keyword extraction model learning means in the first embodiment.

FIG. 4 is a block diagram showing a configuration example of the second embodiment.

FIG. 5 is a flowchart showing an example of the operation of the keypad extraction system according to the second embodiment.

FIG. 6 is a flowchart showing an example of the operation of the keyword extraction model learning means in the second exemplary embodiment.

FIG. 7 is a block diagram showing a configuration example of the key word extraction system according to this embodiment. Best Mode for Carrying Out the Invention:

Embodiment 1

Hereinafter, a first embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration example of the first embodiment. In the first embodiment shown in FIG. 1, a keyword extraction system 1 0 0 for extracting a key word from speech, a key extraction model learning means 1 i 0, and a predetermined keyword as an input are input. Line action Cooperation destination system 1 2 0.

Specifically, the keyword extraction system 100 is realized by an information processing apparatus such as a personal computer that operates according to a program. The keyword extraction system 100 includes a keyword extraction unit 10 0 1 and a keyword extraction model 1 0 2 learned by the keyword extraction model learning unit 1 1 0. The keyword extraction model 1 0 2 is a model for extracting keywords from speech.

The key map extraction means 1 0 1 applies the key key extraction model 1 0 2 to the input voice data. If the keyword extraction model 1 0 2 returns a likelihood that is greater than or equal to a predetermined threshold value for a section of speech data, the keyword extraction means 1 0 1 sets the speech section as a key section section, and extracts a keyword. The key word returned by the model as the maximum likelihood for the interval is output.

Keyword extraction means 1 0 1 Force S, The number of keywords to be output as extracted for one section of speech is not necessarily one. Multiple keywords may be output for one section of the exact same voice. In this case, it is preferable that the keyword extraction means 1 0 1 outputs not only the keyword but also additional information such as likelihood for each extracted keyword to the cooperation destination system 1 2 0.

Specifically, the keyword extraction model learning means 110 is realized by an information processing apparatus such as a personal computer that operates according to a program. The key extraction model learning means 1 1 0 inputs an input to the cooperation destination system 1 2 0, audio data, and information (corresponding information) for associating them. The keyword extraction model learning means 1 1 0 considers the input to the link destination system 1 2 0 as a keyword according to the algorithm described later, and based on the input to the link destination system 1 2 0, the audio data and the corresponding information The keyword extraction model 1 0 2 is estimated to estimate the key word contained in or the utterance expression. In the present embodiment, the keyword extraction model learning system includes a keyword extraction model learning means 110, a means for inputting input to the cooperation destination system 120, voice data, and information for associating them, and a keyword. This is realized by means for outputting to the extraction model 10 2.

Specifically, the linked system 1 2 0 is a personal computer that operates according to a program. This is realized by an information processing apparatus such as a computer. The cooperation destination system 120 receives text-based input from the key extraction means 110 or other means for extracting a key word from speech and performs some predetermined operation. The link destination system 1 2 0 may be, for example, an information input system that executes various processes based on the keywords input by the keyword extraction system 1 0 0. Further, the cooperation destination system 120 may be an information search system that performs an information search based on a keyword input by the keyword extraction system 100, for example.

Here, the input to the cooperation destination system 1 2 0 is, for example, text information input to the cooperation destination system 1 2 0. When there is an input including a plurality of different attributes in the cooperation destination system 1 2 0, the input including the attributes may be input to the keyword extraction model learning means 1 1 0. The information that associates the input to the linked system 1 2 0 with the voice data includes, for example, time information indicating which section in the voice is the utterance section corresponding to the input, and transcription of the utterance. Say etc.

The keyword extraction model 1 0 2 applies either a known keyword to a certain section of speech by applying a matching process based on a predetermined procedure to the feature amount extracted from the speech based on a predetermined procedure. It is possible to calculate whether any of the utterance expressions of the keyword is included. A keyword extraction model 1 0 2 may be prepared for each keyword and keyword utterance expression, and a single model or a plurality of models that can model all or a part of them may be prepared. It may be used.

The keyword extraction model 1 0 2 satisfies at least the following conditions. In other words, the keyword extraction model 1 0 2 is provided with a section of speech in a predetermined procedure, a character string, a phoneme string, a sound that matches one of a predetermined keyword or a plurality of keywords. Returns some value indicating high likelihood for the feature string.

In addition, the keyword extraction model 10 2 has a certain likelihood that a string, phoneme sequence, and acoustic feature sequence given as an utterance expression corresponding to a keyword show a high likelihood according to the keyword. Returns a value.

In addition, the keyword extraction model 1 0 2 can be applied to known keywords and keywords. For a character string, phoneme string, or acoustic feature quantity column that does not match any of the corresponding utterance expression string, phoneme string, or acoustic feature string, the keyword and key Returns some value indicating a slightly higher likelihood according to the utterance expression.

The keyword extraction model 10 2 returns a low likelihood for character strings, phoneme strings, and acoustic feature strings that do not fall under any of the above.

The likelihood that the keyword extraction means 1 0 1 uses for a section of speech is not limited to using the likelihood (= distance) of the keyword extraction model 1 0 2 for any key word directly, but by any means. Normalization processing may be performed. Further, the keyword extraction means 1 0 1 may be configured such that the rejection process based on the threshold is performed using some rejection means. For example, it is possible to use means such as rejecting based on whether or not the extracted key keys for a plurality of voices spoken within a certain time are a specific set. Further, the keyword extraction model learning means 110 may learn the keyword extraction model so as to return a high likelihood for the input to the cooperation system.

Next, the operation of the first embodiment will be described with reference to the drawings. First, the operation of the keyword extracting means 10 1 according to the first embodiment will be described. FIG. 2 is a flowchart showing an example of the operation of the keyword extracting means 100 1 in the first embodiment. It is assumed that an initial keyword extraction model or a keyword extraction model previously learned by the keyword extraction model learning means 110 is given as the keyword extraction model 10 2.

The operation of the keyword extraction means 1 0 1 differs in specific behavior depending on how the keyword extraction model 1 0 2 is selected.

If the keyword extraction model 10 2 is a model that calculates the likelihood for the acoustic feature quantity sequence, the keyword extraction means 1 0 1 calculates the acoustic feature quantity from the input speech signal (step S 1 0 1). . Next, the keyword extracting means 1 0 1 proceeds to step S 1 0 5, and inputs the input acoustic feature quantity sequence obtained in step S 1 0 1 to the keyword extraction model 1 0 2.

When the keyword extraction model 102 is a model for calculating the likelihood for the phoneme sequence, the keyword extraction means 101 calculates the acoustic feature quantity from the input speech signal (step S 1 0 1). Next, the keyword extraction unit 1 0 1 Calculate what kind of phoneme the feature sequence is close to (Step S 1 0 2) _{D Then,} go to Step S 1 0 5, and find the input phoneme for each phoneme of the obtained phoneme sequence and phoneme sequence The feature distance is input to the keyword extraction model 1 0 2.

If the model is a model for calculating the likelihood for a keyword extraction model 10 2 force S and a character string, the keyword extraction means 1 0 1 calculates an acoustic feature quantity from the input speech signal (step S 1 0 1). As acoustic features, power, Δ power, Δ Δ, pitch, cepstrum, Δ cepstrum, etc. can be used. Next, the keyword extracting means 1 0 1 calculates what kind of known phoneme the obtained input acoustic feature sequence is close to (step S 1 0 2). Further, it is calculated what kind of known syllable string or word string is the phoneme string obtained in step S 1 0 2 (step S 1 0 3). Then, the process proceeds to step S 1 0 5, and the obtained syllable string or word string and its likelihood are input to the keyword extraction model 1 0 2.

Keyword extraction model 1 0 2 If the model is a model that calculates the likelihood for a meta feature, the meta feature is obtained after steps S 1 0 1 to S 1 0 3 (step S 1 0 4 ), And input to the keyword extraction model 1 0 2 (step S 1 0 5). Meta features are, for example, feature values calculated based on one or more character strings obtained in step S 1 0 3 such as part-of-speech information, recent keyword extraction results, phoneme posterior probabilities, and word posterior probabilities. Say.

The keyword extraction model 10 2 may be a model that calculates the likelihood for one or more combinations of the above-described acoustic feature string, phoneme string, character string, and meta feature string. In that case, the keyword extracting means 1 0 1 inputs necessary information in step S 1 0 5 after appropriately passing through steps S 1 0 1 to S 1 0 4.

Further, the keyword extraction means 10 01 may execute the processing of steps S 1 0 1 to S 1 0 5 in a pipeline as necessary. By executing the pipeline, the keyword extraction processing time (throughput) can be shortened, and unnecessary computation can be reduced by using an appropriate branch hunting process together. For example, when the string Y is obtained from the phoneme string X, the likelihood calculation of the string Y can be skipped if the likelihood for the phoneme string X falls below a predetermined branch hunting threshold.

Key key extraction means 1 0 1 is required from the input voice in steps S 1 0 1 to S 1 0 4. After calculating the necessary information, in step S 1 0 5, matching with the keyword extraction model 1 0 2 is performed. As a result, the likelihood by the keyword extraction model 10 2 for one section of speech is calculated.

In step S 1 0 6, the keyword extraction means 1 0 1 makes a rejection decision of the likelihood not calculated in step S 1 0 5. For example, the keyword extraction means 1001 considers that a keyword has been extracted if a likelihood exceeding a predetermined threshold value is obtained for any of the keywords represented by the keyword extraction model 1002. .

Next, the operation of the keyword extraction model learning unit 110 according to the first embodiment will be described. FIG. 3 is a flowchart showing an example of the operation of the keyword extraction model learning means 110 in the first embodiment.

The specific operation of the keyword extraction model learning means 1 1 0 differs depending on how the keyword extraction model 1 0 2 is selected.

First, in step S 2 0 1, an initial keyword extraction model (initial model) is given. The keyword extraction model in the initial state is given as an initial value of the program, for example. If some or all of the keywords are known in advance, or if text information that is likely to be keywords is available to some extent, an initial model is built using them. If no such information is available, build an empty initial model. An already learned model may be given as an initial model. In this case, the keyword extraction model learning unit 110 performs additional learning on new learning data.

The keyword extraction model learning means 110 receives, as learning data, an input to the cooperation destination system, speech data corresponding to the input, and information for associating them (step S 2 0 2). In the following, the text information input to the collaborative system 1 2 0 as the learning data that is passed to the keyword extraction model learning means 1 1 0 as regular data (in order to consider it as a regular expression of keywords) ). Here, it is assumed that the speech includes at least such an utterance, which is the utterance itself corresponding to the regular keyword. The information for associating the speech corresponding to the regular keyword includes, for example, time information indicating which section in the speech signal is the utterance expression of the regular keyword. Or, write a regular keyword utterance expression You may include a string of characters.

A pair of a regular key and a voice associated with the regular key can be automatically found according to the regular keyword. For example, if the keyword extraction model is sufficiently learned, a variation of spoken speech that can output a regular key word is obtained. Therefore, if a voice that is known to contain a voice section corresponding to the regular keyword is given, the key-key extraction model learning means 1 1 0 can select any of the variations from the voice. It is possible to extract the section where such utterances are made. The learning data pairs obtained in this way can be used as learning data for enhancing robustness against acoustic fluctuations of speech (for example, those derived from speaker characteristics).

When the keyword extraction model 1 0 2 is a model for calculating the likelihood for the acoustic feature quantity sequence, the keyword extraction model learning means 1 1 0 calculates the acoustic feature quantity from the speech signal input as learning data ( Step S 2 0 3). If the keyword extraction model 1 0 2 is a model that calculates the likelihood for a phoneme sequence, the keyword extraction model learning means 1 1 0 calculates the phoneme sequence and its distance based on the acoustic features (step S 2 0 4). Further, in the case of a model that calculates the likelihood for the keyword extraction model 1 0 2 force character string, the keyword extraction model learning means 1 1 0 calculates the character string and its likelihood based on this phoneme string (step S 2 0 5). Furthermore, when the key word extraction model 10 2 is a model for calculating the likelihood for the meta feature quantity, the keyword extraction model learning means 1 1 0 uses the meta feature quantity and its likelihood based on the character string. The degree is calculated (step S 2 0 6).

The details of the processing of steps S 2 0 3 to S 2 0 6 are the same as the processing of steps S 1 0 1 to: I 0 4 in the key record extraction means 1 0 1.

Next, the keyword extraction model learning means 1 1 0 is configured to accept the acoustic feature value, phoneme string, character string, meta feature value, etc. obtained for one section of speech corresponding to the utterance expression of the regular keyword. The extraction model 1 0 2 is expanded (step S 2 0 7). At this time, the model is extended so that the collation result is output, assuming that the regular key of the utterance expression is matched, not the utterance expression that is the source of the extension.

For example, each of the regular keys is modeled by an HMM and arranged in parallel. In the case of a network model, the HMM for the utterance representation of a regular key key is placed in parallel with the original regular keyword HMM. Here, the likelihood of passing through this utterance expression HMM is treated as the likelihood of passing through the regular keyboard HMM. In the case of a keyword extraction model based on a tree structure dictionary, information indicating which regular key word is added to the leaves of the tree structure. Here, as information added to the leaf corresponding to the utterance expression of a certain regular key word, information indicating the regular key word that is the basis of the utterance expression is given.

If the keyword extraction model 1 0 2 calculates the likelihood for a non-acoustic input such as a character string, then in step S 2 0 8, the model for the regular key word itself given as training data is used. Extensions may be made.

If the model is based on a character string, the regular key and the first character string are used as they are. In the case of a model based on a syllable string, learning is performed after arbitrary reading processing is applied to the regular keyboard. For the reading process, for example, a method using a recognition dictionary or a method using a general-purpose morphological analyzer can be considered. In the case of a model based on a phoneme string, the reading information is similarly converted into a phoneme string by a predetermined method. For meta information, if it is within the range that can be obtained from regular keywords, learning is performed in the same way. For example, parts of speech and character types are information that can be extracted from regular keywords. Therefore, regular keywords themselves can be learned if the model uses such meta information. On the other hand, for example, a model that uses posterior probabilities cannot learn regular keywords. In addition, if the information that correlates the regular keyword with the speech data includes a transcription character string of the utterance expression of the regular key word, the key word extraction model 1 0 2 so that this character string expression can also be accepted. May be extended. The conditions and procedures for which this is possible follow the regular key rules.

In step S 2 0 9, the keyword extraction model learning means 1 1 0 further propagates the model extension if the model extension in step S 2 0 7 can be propagated to other keys. . For example, if the keyword extraction model 1 0 2 is a model based on a tree structure dictionary, if _f transitions from a node at a certain depth of the tree to a node at some depth are added, the depth This extension is propagated (shared) to the subtree sharing the previous structure. Furthermore, this ε transition is at the same depth as the connecting node A similar ε transition may be added between nodes.

If the regular key word given as training data has additional attributes than just giving a string, the propagation of the extension in step S 2 0 9 is limited to keywords with the same or similar attributes. You may go. For example, if the cooperation destination system 1 2 0 accepts the ticket number and artist name as input, and the ticket number and its utterance expression are newly given as learning data, the artist in the key extraction model 1 0 2 The above propagation may not be performed on the part related to the name.

The expansion of the model in steps S 2 0 7, S 2 0 8 and S 2 0 9 does not simply extend the model to accept the training data, but also the regular keywords and utterances that are accepted by the extension. A procedure for adjusting the likelihood given to an expression may be performed simultaneously. For example, in the model based on the tree structure dictionary, there is no penalty for the branch expanded in step S 2 0 8, and the branch expanded in step S 2 0 7 is given a light penalty. Processing such as adding a heavy penalty to the branch expanded in step S 2 09 may be performed.

If the keyword extraction model 1 0 2 is to be learned using not only positive examples but also negative examples, the keyword extraction model learning means 1 1 0 in step S 2 1 0 It is also possible to perform graph learning for unsupported speech parts. For example, in the case of a keyword network model in which a classifier such as SVM is prepared for each keyword, a negative example, that is, an acoustic feature obtained from speech that does not correspond to a regular keyword By providing phoneme strings, character strings, and meta features, it is possible to prevent a model that erroneously returns a high likelihood for non-keywords.

Also, utterances corresponding to regular keywords and regular keywords themselves may be used in step S 2 1 0 as a negative example. For example, a regular key word entered as learning data for the keyword と, its utterance expression, and voice data can be used as a negative example of a discriminator for another keyword B.

Finally, in step S 2 1 1, the keyword extraction model learning means 1 1 0 determines the model depending on the model expansion in steps S 2 0 7, S 2 0 8 and S 2 0 9. If it is necessary to recalculate the existing part, recalculate. For example, the word

The N-gram-based keyword extraction model 102 requires recalculation of backoff coefficients after model expansion (that is, the frequency of unknown N word sets increases). After that, the keyword extraction system 100 uses the keyword extraction model 10 2 learned by the keyword extraction model learning means 110 to execute the key word extraction process. As the above process is executed and the key extraction model is learned in consideration of the input to the cooperation destination system 120 and its utterance expression, the accuracy of the key extraction can be improved.

Embodiment 2

Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 4 is a block diagram showing a configuration example of the second embodiment. In the second embodiment shown in FIG. 4, a key word extraction system 2 0 0 for extracting a key word from speech, a key word extraction model learning means 2 1 0, and a predetermined operation is performed with the extracted keyword as an input. It is provided with the cooperation destination system 2 2 0.

The keyword extraction system 2 0 0 includes a keyword interval extraction means 2 0 1 for extracting an utterance interval corresponding to the keyword, and a keyword recognition means 2 0 for discriminating which keyword is uttered with respect to the extracted keyword utterance interval. 2, a keyword section extraction model 2 0 3 learned by the keyword extraction model learning means 2 10, and a key word recognition model 2 0 4 also learned by the keyword extraction model learning means 2 1 0.

Next, the operation of the second embodiment will be described with reference to the drawings. FIG. 5 is a flowchart showing an example of the operation of the keyword extraction system in the second embodiment. The keyword interval extraction means 2 0 1 applies the keyword interval extraction model 2 0 3 to the input speech data. If the applied keyword segment extraction model 20 3 returns a likelihood equal to or greater than a predetermined threshold for a certain segment of voice data, the keyword segment extraction means 2 0 1 uses the keyword segment as a keyword. It is specified as a section (step S 3 0 1).

Further, the keyword recognizing unit 202 performs a key recognition process using the key recognition model 20 04 for the identified keyword section, and Outputs the keyboard returned by the recognition model as the maximum likelihood for the speech segment.

(Step S 3 0 2).

FIG. 6 is a flowchart showing an example of the operation of the keyword extraction model learning means 210 in the second embodiment. The keyword extraction model learning means 2 1 0 inputs the input to the cooperation destination system 2 2 0, the voice data, and information that associates them (step S 4 0 1).

The keyword extraction model learning means 2 1 0 considers the input to the linked system 2 2 0 as a keyword, and estimates whether or not the keyword or its utterance expression appears in a certain speech interval 2 0 3 is learned (step S 4 0 2). In other words, the keyword interval extraction model 203 is a model that returns a likelihood indicating whether or not speech data includes a keyword.

At the same time, using the same input, and pair a section of speech corresponding to a keyword, and the combined keyword recognition model ₂ 0 4 you recognize whether appeared either Kiwado or spoken expressions in the speech segment Learn (Step S 4 0 3). In other words, the keyword recognition model is a model that returns a likelihood indicating which keywords are included in the speech data.

The learning procedure of the two models of the key word extraction model learning means 2 10 is almost the same as the learning procedure of the keyword extraction model learning means 110 of the first embodiment. The key segment extraction means 2 0 1 operates to select an optimum model from several prepared key recognition models 2 0 4 based on information returned by the key segment extraction model 2 0 3. Also good.

That is, when the keyword interval extraction model 2 0 3 has been learned to return the likelihood for a keyword group including several keywords, the keyword extraction model learning means 2 1 0 performs keystroke recognition. Model 2 0 4 is learned for each key group. In this way, each model can be trained with higher accuracy, and the keyword extraction accuracy is improved.

In the second embodiment, two identification operations necessary for keyword extraction, that is, an operation for identifying whether or not a certain voice segment is a key phrase, and which keyword is a certain voice segment are identified. More accurate because the behavior is modeled separately Can build a high model. In particular, when an identification model such as SVM is used, since the number of negative examples is relatively increased, learning with higher accuracy than the model in the first embodiment can be performed.

In the second embodiment, learning with higher generalization ability for similar keywords is possible.

If the keywords are similar, the utterance expression and the recognition result for the speech are similar to each other and may overlap. For example, the keyword A 1 which is a modification of the keyword A and the keyword B 1 which is a modification of the keyword B may be in exactly the same form. In the first embodiment, such overlap may reduce the learning accuracy of the model. On the other hand, in the second embodiment, at least the learning of the keyword interval extraction model 203 is not a problem. Because, for keyword segment extraction means 2 0 1, for speech segments that match keyword A 1 (= keyword B 1), whether “A” or “B” is not a problem. This is because it is sufficient to accurately estimate that either “A” or “B” appeared in this section.

Also for the keyword recognition model 204, there is a possibility that the second embodiment can learn a model with higher accuracy. This is because, in the first embodiment, it is necessary to learn that the keystroke extraction model 10 2 can reject the fillers before and after the keystroke. On the other hand, in the keyword recognition model 20 4 in the second embodiment, it is not necessary to consider the filters before and after the keypad. Needless to say, it is possible to learn a more accurate model when using a different key key recognition model for each key group.

If it is known in advance that some of the keywords form a group, they can be grouped when the initial model is built. If this is not the case, you can integrate multiple keywords that overlap each other. For example, if an utterance expression for a certain keyword is given and a high likelihood is obtained for a certain key group other than that keyword, then the key group for which the high likelihood is obtained is What is necessary is just to integrate the key word of the speech expression. Example Next, an example of the second embodiment will be described. FIG. 7 is a block diagram showing a configuration example of the keyword extraction system according to the present embodiment. As shown in FIG. 7, the case where the keyword extraction system 300 operates as a frontend of the product information search system 320 will be described.

The merchandise information search system 3 2 0 is provided with one or a plurality of search words, and presents product information including information on the search words. For example, formal names and abbreviations of products, product numbers in catalogs, product classifications (furniture, chairs, TVs, health equipment, etc.), words that describe product characteristics (white, pipes, large screen, stiff shoulders), etc. You can search for product information by entering. These search words can also be input using an input device such as a keyboard.

The user of the product information search system 3 2 0 shall search for necessary product information while responding to the customer by telephone or the like.

The keyword extraction system 3 0 0 includes a keyword segment extraction unit 3 0 1, a keyword recognition unit 3 0 2, N keyword cluster extraction models 3 0 3, and N keyword cluster recognition models 3 0 4. Including.

The keyword cluster extraction model 303 is a model in which a plurality of identification models such as SVM and CRF are arranged in parallel. Each keyword cluster extraction model 303 models a series of key words belonging to a certain cluster and their utterance expressions. Each keyword cluster extraction model 30 3 is based on features such as acoustic features of a certain speech segment, phoneme string, word sequence of recognition candidates up to the top n, and part-of-speech information of each word. Key words and utterance expressions belonging to) and negative examples (keywords and utterance expressions that do not belong to the cluster, and non-keywords, noise, etc.) can be learned with the highest accuracy.

The keyword segment extraction unit 301 calculates various features required by the keyword cluster extraction model 30 3 from the input speech. By inputting the various features calculated into the keyword cluster extraction model 30 3, the likelihood indicating whether a speech segment is one of the keywords represented by the keyword cluster extraction model 30 3 can be obtained. The cluster indicated by the key cluster extraction model 3 0 3 that returned the highest likelihood among the multiple keyword cluster extraction models 3 0 3 (maximum likelihood cluster) If the likelihood exceeds a predetermined threshold value, the keyword segment extraction unit 3 0 1 determines that one of the keywords belonging to the cluster has been uttered in the speech segment.

The key word recognition unit 30 2 is activated when the key word segment cutout unit 3 0 1 detects a voice segment corresponding to one of the key word clusters. The key key recognition unit 3 0 2 performs voice recognition processing for the voice segment using the key likelihood cluster recognition model 3 0 4 of the maximum likelihood cluster for the extracted voice segment. As a result, for a keyword that returns the highest likelihood, if the likelihood exceeds a predetermined threshold, the keyword recognizer 3 0 2 utters the keyword (or its utterance expression) in the speech segment. Judge that

The keyword cluster recognition model 3 0 4 is the likelihood of which one of the key words included in the key cluster or its utterance representation for a speech segment corresponding to a key cluster. Return as. For implementation, the HMM keyword network described in Non-Patent Document 1, a weighted tree structure dictionary, character N-gram, etc. can be used. Here, the case of using a keyboard network based on the syllable HMM is described.

Next, the operation of the keyword model learning unit 3 10 will be described. First, a learning data pair is input to the key model learning unit 3 10. The learning data pair corresponds to the search query (search word) entered in the product information search system 3 2 0 in the past, the user's utterance voice when the search query is issued, and the search voice in the utterance voice. Then, the corresponding time information at which the utterance that is supposed to be performed is included.

For example, the search query “A—3 0 CJ immediately after the user utters“ Can you give me the item number of the product you inquired? Yes. A… 3 0—D? , And issue a query to the product information search system 3 2 0. At this time, the search query “A—3 0 C”, the voice of this entire utterance, and “A 3 0 CJ and utterance” The pair of learning data including the relative time information is input to the keyword model learning unit 3 1 0. At this time, if the keyword model has already been sufficiently learned, the search query “A—3 0 C j Check if any of the possible utterance expressions are present in the speech and if it is found (in this case “30 C of A” is found) It is possible to automatically find the time information of this utterance section, the entire utterance, and the search query “A—30 C” as a learning data pair. When learning the word model is insufficient, the user manually associates when speaking (for example, if the utterance content is recognized on the screen and is displayed on the screen, select the corresponding speech part) Etc.), or the user or a third party explicitly associates the training data after the fact.

First, the keyword model learning unit 3 10 determines whether the new learning data belongs to any one of the known keyword clusters. If the search query given as learning data belongs to any known keyword cluster, thereafter, the keyword model learning unit 310 performs learning for the cluster. If it does not belong to any cluster, the keyword model learning unit 3 10 creates a new cluster.

Next, the keyword model learning unit 3 10 performs learning for the keyword cluster extraction model 3 0 3. The keyword model learning unit 3 1 0 uttered speech that seems to correspond to the search query from the uttered speech to the key cluster extraction model 3 0 3 corresponding to the selected (or created) cluster. Necessary feature information is extracted from the voice of the time. As a positive example, this is added to the learning data for this keyword cluster extraction model. Furthermore, necessary feature information is extracted in the same way for speech at times other than the utterance, and this is added to the learning data as a negative example. The keyword model learning unit 3 10 learns the keyword cluster extraction model 3 0 3 using the added learning data. The learning algorithm is appropriately used according to the model used (S VM, C R F, etc.).

Next, the keyword model learning unit 3 10 performs learning for the keyword cluster recognition model 3 0 4. The keyword model learning unit 3 1 0 performs the utterance that is considered to correspond to the search query among the utterances for the keyword cluster recognition model 3 0 4 corresponding to the selected (or created) cluster. Necessary feature information is extracted from the voice. When the syllable HMM is used, the keyword model learning unit 3 10 obtains a syllable string in which the acoustic feature extracted from the speech shows a high likelihood for a given acoustic model. The keyword model learning unit 3 10 generates an HMM for a key word using the extracted feature information as learning data. In addition, the key model learning unit 3 1 0 converts the search query string into a syllable string, and also creates this HMM. Both of the two HMMs generated in this way are the search query ( This is used to calculate the likelihood for (keyword).

Of course, it is possible to use an identification model such as SVM as the keyword cluster recognition model 304, or to use an N-gram or keyword network as the keyword cluster extraction model 303. It is.

The keyword model learning unit 3 10 further determines whether cluster integration is necessary. This is determined by how much overlap is seen in the keyword cluster extraction model. For example, the keyword model learning unit 3 10 counts the proportion of the positive examples of the learning data of the key cluster extraction model 3 0 3 of each cluster that matches the positive examples of other clusters. If this is above a predetermined threshold, it is determined that these clusters need to be integrated.

Industrial applicability

According to the present invention, an information search device equipped with a voice input IZF, an information recording device that extracts necessary information from voice and fills it in a predetermined form, a media search device that searches for voice related to a predetermined content, It can also be applied to applications such as information home appliances and software that operate by voice commands.

This application claims priority based on Japanese Patent Application No. 2 0 0 7-1 5 0 0 8 2 filed on June 6, 2000, and all the disclosures thereof Capture here.

Claims

1. A key word extraction model learning system for learning a key extraction model for extracting key words from speech,

It has a keyboard extraction model learning means for performing learning using an input to the cooperation system and a voice corresponding to the input.

This is a keyed drawing model learning system that specializes in that.

2. The key of claim 1, wherein the input to the linkage system includes at least text information.

-A model extraction model learning system. Surrounding

3. The key word extraction model learning system according to claim 1 or 2, wherein the speech corresponding to the input to the cooperative system includes both a part corresponding to the input to the cooperative system and a part not corresponding to the input. .

4. The keyword extraction model learning means learns the keyword extraction model so as to return a high likelihood with respect to the input to the cooperation system. A model extraction model learning system.

5. The keyword extraction model learning means learns the key method extraction model so as to return a high likelihood to the voice corresponding to the input to the cooperative system or a part of the corresponding voice. The key extraction model learning system according to any one of the above.

6. The keyword extraction model learning means learns the keyword extraction model so as to return a low likelihood to a voice that does not correspond to an input to the cooperative system or a part of the speech that does not correspond. The key extraction model learning system according to any one of the above.

7. The keyword extraction model learning means uses speech corresponding to an input to the cooperative system as learning data for model learning related to another input similar to the input to the cooperative system. The key map extraction model learning system according to any one of the above.

8. The keyword extraction model learning means uses speech corresponding to an input to the cooperative system as learning data of a negative example of model learning related to another input that is not similar to the input to the cooperative system. 8. The key extraction model learning system according to any one of 7 above.

9. The keyword extraction model learning means classifies the input to the cooperation system into one or more clusters based on a predetermined constraint, and performs learning related to each cluster collectively. The keyword extraction model learning system according to any one of the above.

1 0. The keyword extraction model learned by the keyword extraction model learning method includes a keyword interval model that returns the likelihood that a part of a certain speech is an utterance of one of the keywords, and a part of a certain speech. It consists of two models: a key recognition model that returns the likelihood of a key word utterance, and

10. The key word extraction model learning system according to claim 1, wherein the key word extraction model learning unit learns the two types of models.

1 1. Input to the linkage system and

Find a section of speech that is likely to correspond to the input,

Equipped with keyword extraction model learning means to learn using

The key word extraction model learning system according to any one of claims 1 to 10.

1 2. The keyword extraction model learning means learns using speech corresponding to input or transcription of speech corresponding to the input.

The key extraction model learning system according to any one of claims 1 to 11.

1 3. A key word extraction system characterized by using a keyword extraction model learned by the key word extraction model learning means according to claim 1.

14. An information input system using the key pad extraction system according to claim 13.

15. An information search system using the keyword extraction system according to claim 13.

1 6. A key extraction model learning method for learning a keyword extraction model for extracting keywords from speech,

A keyword extraction model learning method characterized by including a keyword extraction model learning step of performing learning using an input to a cooperative system and a voice corresponding to the input.

17. The method for learning a keyword extraction model according to claim 16, wherein the input to the linkage system includes at least text information.

1 8. The key extraction model according to claim 16, wherein the voice corresponding to the input to the linkage system includes both a part corresponding to the input to the linkage system and a part not corresponding to the input. Learning method.

1 9. In the key word extraction model learning step, the key word extraction model is learned so as to return a high likelihood to the input to the cooperative system. The key word extraction model learning method described.

Claiming a keyword extraction model so as to return a high likelihood to a voice corresponding to an input to the cooperative system or a part of the corresponding voice in the keyword extraction model learning step. Item 19. The keyword extraction model learning method according to any one of Items 9 to 9.

2 1. In the keyword extraction model learning step, the key word extraction model is trained so as to return a low likelihood to a speech that does not correspond to an input to the cooperative system or a part of the speech that does not correspond. 20. The key extraction model learning method according to any one of 2 0.

2 2. In the keyword extraction model learning step, the voice corresponding to an input to the cooperative system is used as learning data for model learning related to another input similar to the input to the cooperative system. 2 The key extraction model learning method described in any one of 1 above.

2 3. In the keyword extraction model learning step, the voice corresponding to an input to the cooperative system is used as learning data of a negative example of model learning related to another input that is not similar to the input to the cooperative system. The key word extraction model learning method according to any one of claims 6 to 22.

2 4. In the keyword extraction model learning step, the input to the cooperation system is classified into one or more clusters based on a predetermined constraint, and learning about each cluster is performed collectively. The key map extraction model learning method according to any one of the above.

2 5. Key extraction model Learning in the key extraction model learning step consists of a key interval model that returns the likelihood that a part of speech is utterance of one of the keywords, and a part of speech It consists of two models: a key recognition model that returns the likelihood of a key word utterance, and

The key extraction model learning method according to any one of claims 16 to 24, wherein the two types of models are learned in the key map extraction model learning step.

26. In the keyword extraction model learning step, an input to the cooperative system and a section of speech that is highly likely to correspond to the input are found, and learning is performed using the speech of this section and The method for learning a key word extraction model according to claim 1.

27. The learning according to any one of claims 16 to 26, wherein learning is performed using a speech corresponding to an input or a transcription of a speech corresponding to the input in the keyword extraction model learning step. A key extraction model learning method.

28. A key extraction using a key extraction model learned by the key extraction model learning method according to any one of claims 16 to 27. Method.

29. An information input method using the keyword extraction method according to claim 28.

30. An information search method using the key word extraction method according to claim 28.

3 1. A key extraction model learning program for learning a keyword extraction model for extracting keywords from speech, On the computer,

A key extraction model learning process that performs learning using the input to the cooperation system and the voice corresponding to the input.

Keyword extraction model learning program for execution.

3 2. The key word extraction model learning program according to claim 31, wherein at least text information is input to the linkage system.

3. The key extraction model according to claim 3, wherein the voice corresponding to the input to the cooperative system includes both a part corresponding to the input to the cooperative system and a part not corresponding to the voice. Learning program.

3 4.

In the key extraction model learning process, execute a process to learn the key extraction model so that a high likelihood is returned for the input to the cooperative system.

The key word extraction model learning program according to any one of claims 31 to 33.

3 5.

In the key extraction model learning process, the key extraction model learning process is executed so as to return a high likelihood to the voice corresponding to the input to the cooperative system or a part of the corresponding voice.

The key word extraction model learning program according to any one of claims 31 to 34.

3 6.

Executes key key extraction model learning processing to learn a key key extraction model so as to return a low likelihood to a part of speech that does not correspond to the input to the linkage system or part of the speech that does not correspond Make The key word extraction model learning program according to any one of claims 31 to 35.

3 7.

In the keyword extraction model learning process, execute a process that uses speech corresponding to an input to the linked system as learning data for model learning related to another input similar to the input to the linked system.

A keyword extraction model learning program according to any one of claims 3 1 to 36.

Mouth gram.

3 8.

The key extraction model learning process uses speech corresponding to an input to a collaborative system as learning data that is similar to the input to the collaborative system and is a negative example of model learning for another input. Execute

The key word extraction model / learning program according to any one of claims 3 1 to 37.

3 9.

In the keyword extraction model learning process, the input to the cooperation system is classified into one or more clusters based on the given constraints, and the process for performing learning related to each cluster is executed.

The keyword extraction model learning program according to any one of claims 31 to 38.

4 0. The keyword extraction model learned by the keyword extraction model learning process includes a keyword interval model that returns the likelihood that a part of a certain speech is an utterance of any key word, It consists of two models: a key recognition model that returns the likelihood that a part is an utterance of each key word, and On the computer,

In the key extraction model learning process, the process of learning the two types of models is executed.

The key word extraction model learning program according to any one of claims 31 to 39.

4 1.

In the keyword extraction model learning process, the input to the cooperation system and a section of speech that is highly likely to correspond to the input are found, and the process of performing learning using the speech of this section and is executed.

The key model extraction model learning program according to any one of claims 31 to 40.

4 2.

In the keyword extraction model learning process, execute a process to perform learning using the speech corresponding to the input or the transcription of the speech corresponding to the input.

The keyword extraction model learning program according to any one of claims 31 to 41.