WO2002067244A1

WO2002067244A1 - Speech recognition method for speech interaction, speech recognition system and speech recognition program

Info

Publication number: WO2002067244A1
Application number: PCT/JP2001/001165
Authority: WO
Inventors: Tadamitsu Ryu; Masato Numabe; Shinichiro Kubo
Original assignee: Cai Co., Ltd
Priority date: 2001-02-19
Filing date: 2001-02-19
Publication date: 2002-08-29
Also published as: JPWO2002067244A1

Abstract

A speech recognition method for speech interaction, a speech recognition system and speech recognition program, for effectively using a memory and hard disk of limited capacities and performing a speech recognition while permitting continued interaction without unnatural pauses even when a candidate is not retrieved after a preset time or some words cannot be recognized, the method comprising the step (S1) of recording in a storage one or two or more speech recognition dictionaries prepared for respective scenes, the step (S4) of inputting speech of a speaker via a speech input unit, the step (S7) of performing a speech recognition on the speech of the speaker and speech data obtained from a word spot by using one or two or more speech recognition dictionaries, the step (S8) of preparing an interactive-sequence-based responding text when the recognition is made within a preset time, and preparing an answer-back text prompting a re-input to the speaker when it is not made within a preset time, and the step (S9) of speech-synthesizing the prepared responding text or answer-back text.

Description

Specification

Speech recognition method for speech dialogue, speech recognition system and speech recognition program

The present invention relates to a voice recognition method and a voice recognition system for voice dialogue in which a voice of a speaker is recognized, a response sentence is created based on the obtained voice data, and the response is synthesized by voice synthesis. And a speech recognition program.

Technology background

Speech recognition refers to processing a human uttered voice by a computer and correctly recognizing the content.Using this voice recognition, it is possible to input sentences without using input means such as a keyboard. There is a wide range of applications, such as using the recognition results to operate machines and devices as intended, and applied research is being carried out in various fields. Speech dialogue is one of the applications of such speech recognition.

Spoken dialogue is a dialogue in which a human and a computer are talking as if they were talking to each other by a predetermined conversation program based on the result of recognizing the voice uttered by the human through voice recognition.

In speech recognition in conventional speech dialogue, dictation, which recognizes words spoken by human beings as they are from the head, and extracting key words from words spoken by humans Various methods such as “word spot” are used to recognize the word. The general mechanism of “dictation” is to first convert words spoken by humans into phoneme strings as input speech, replace the phoneme strings with word strings, parse them, and then convert them to character strings. In addition, it generates text by performing logic analysis and semantic analysis, synthesizes and outputs it. Since words also have homonyms, accurate recognition is performed by adding attribute information for each word.

On the other hand, in “word spot,” a computer analyzes words spoken by humans as speech, extracts features of the speech, and creates a time series of feature quantities. Then, the degree of similarity to the words included in the speech recognition dictionary that records and saves the time series of the feature values of each word provided in the computer in advance is calculated, and the words having a high degree of similarity are recognized as recognition results from the words. Output.

In general, even when using "dictation", "word spot" It is considered that a huge number of words must be registered in advance in the speech recognition dictionary used for speech recognition in order to increase the recognition rate even in the case of. However, if the number of words to be registered in the speech recognition dictionary is large, the memory capacity is required accordingly, and it takes too much time to match the input speech with the words recorded in the speech recognition dictionary, and the computer takes too much time. Unnecessary time was left before the evening responded, and there was a problem that it was not practical for voice conversation. In addition, if the number of words registered in the voice recognition dictionary is too large, the number of objects to be searched increases, and consequently the recognition rate decreases.

In addition, the conventional speech recognition system, especially in the case of "dictation", has a problem that the recognition rate is lowered instead of trying to recognize a sequence of words that have no meaning. For example, if a speaker is stuck in a word or stutters, it will try to recognize the word. As a result, not only was it recognized as a meaningless word, but it also caused the problem of inducing incorrect recognition of the surrounding words.

Furthermore, in the conventional speech recognition, a similarity between an input speech and a word included in the speech recognition dictionary is calculated, and a word having a high similarity is output from the speech recognition dictionary as a recognition result. Therefore, even if the words are not correctly recognized, candidate words are output for the time being. For this reason, there was a problem that the recognition rate declined and a meaningless response was returned.

By the way, even in conversations between humans, when the other party does not want to listen to the other party's conversation, the content cannot be recognized no matter what the other party speaks, and the person is in the upper sky. On the other hand, when the other party is willing to listen, it is possible to understand the contents of the story even if there is a lot of noise and the part that cannot be heard well matches. This difference is due to the fact that if you are interested in listening to the other person's story, as a listener, the scene that is currently being talked about is assumed in advance, and after some prediction of the word (word) that the other person will speak next, This is because recognition is performed. Therefore, if the story suddenly jumps to a topic that is different from the topic that is currently the topic, the listener will not be able to immediately understand it, and will misunderstand that he or she may have mistakenly heard for a moment.

Therefore, a speech recognition dictionary used for speech recognition for recognizing words uttered by the speaker is used for speech recognition using a speech recognition dictionary created in advance by collecting words used in a topic scene for each scene. , This speech recognition dictionary has become a topic Providing a voice recognition method, voice recognition system, and voice recognition program for voice dialogue that performs voice recognition quickly and efficiently by switching and using each scene, and effectively using limited memory and hard disk etc. The purpose is to

The present invention also provides a dialogue with a natural feeling without creating an unnatural interval by prompting the speaker to speak again when no candidate is retrieved or there is a word that cannot be recognized after a certain period of time. It is an object of the present invention to provide a voice recognition method, a voice recognition system, and a voice recognition program for voice dialogue in which voice recognition is performed while performing voice recognition.

Indication of launch

The invention according to claim 1 is a method for performing a dialogue by recognizing and processing a speaker's voice, creating a response sentence based on the obtained voice data, and synthesizing the response sentence. A voice recognition method that collects predetermined words appearing in a topical scene as scene words and records one or more voice recognition dictionaries created for each scene in storage such as a memory or a recording device. And the step of inputting the speaker's voice from the voice input unit, and analyzing the input speaker's voice using a single spot. ・ Single or two or more voice data obtained by word decomposition A step of performing speech recognition using a recognition dictionary, and, if the recognition is performed within a predetermined time, a response sentence based on a conversational sequence that generates a sentence according to a predetermined expression / phrase from the recognition result. Create A speech dialogue that includes a step of creating a reply sentence prompting the speaker to re-enter the input when the recognition is not performed within a predetermined time, and a step of speech-synthesizing the created response sentence or the reply sentence. To provide a speech recognition method for

According to the present invention, a computer and a human interact by recognizing a voice of a speaker and synthesizing a response sentence created in accordance with a predetermined expression / phrase based on the recognition result. Related to a voice recognition method. The speech recognition dictionary used for speech recognition is created for each topical scene. For example, business, politics and economy, computers, education, local information, movies and music, natural science, living and culture, sports, etc. Has been created. The words that appear in each scene are collected in the speech recognition dictionary for each scene as scene words. The speech recognition dictionary created in this way is Save to a storage device such as a hard disk or hard disk. When a large number of words to be recognized are comprehensively recorded in the speech recognition dictionary, there are many candidates for matching with the speech data for the input speaker's speech, which in turn lowers the recognition rate or takes longer to recognize. Sometimes. For this purpose, a speech recognition dictionary is created for each topic scene to shorten the recognition time. In addition, since words used in a topic scene are prepared in advance, the recognition rate is improved. Further, the storage capacity of a memory, a hard disk, and the like can be reduced.

When the speaker's voice is input from the voice input unit, the speaker's voice is sentence-analyzed and word-decomposed by the word spot. Then, the obtained speech data is subjected to speech recognition using the speech recognition-dictionary created as described above. When the recognition is completed within a predetermined time, a response sentence to the utterance of the speaker is created based on a dialogue sequence that generates a sentence according to a predetermined expression “phrase” from the recognition result. Then, the created response sentence is synthesized and output, and the dialogue is advanced by speaking to the speaker.

On the other hand, if a word corresponding to the speech recognition dictionary is included but its recognition takes longer than a predetermined time, or if the word corresponding to the speech recognition dictionary is not included and the recognition result can be obtained. If not, the dialogue sequence creates a reflection sentence prompting the speaker to re-enter. Then, the created reflection sentence is synthesized by speech and asked back to the speaker. As a result, even if voice recognition is successful or not, there is always a response from the combination evening within a predetermined time, so that a natural conversation with a tempo proceeds smoothly without unnecessary intervals.

The invention described in claim 2 is a speech recognition method for speech dialogue according to claim 1, wherein the speech recognition dictionary used for speech recognition is sentence analysis of a speaker's voice by a codeword. The audio data obtained by the decomposition is compared with the scene words included in one or more voice recognition dictionaries, and at least one or more scene words corresponding to the audio data are included. It is characterized in that a predetermined speech recognition dictionary is selected and used.

In the present invention, one or two or more speech recognition dictionaries prepared for each scene to be a topic are prepared, and the dictionaries are stored in advance in a storage such as a computer memory or a hard disk. Then, the voice of the speaker is sentenced by the code spot. Analysis: The speech data obtained by word decomposition is compared with the scene words included in the speech recognition dictionary stored in storage such as a memory / hard disk, and at least the scene words corresponding to the speech data are analyzed. A predetermined recognition dictionary including one or more recognition dictionaries is selected. Then, the selected recognition dictionary is recorded in, for example, a cache memory or the like, and voice recognition is performed.

5B¾!

The invention according to claim 3 is the speech recognition method for speech dialogue according to claim 1 or 2, wherein the speech recognition dictionary used for speech recognition analyzes the speech of the speaker using a word spot. From the speech data obtained by word decomposition, scene words are pre-associated with each other to identify a topic of interest using a speech database created for each scene, and a speech recognition dictionary corresponding to the scene is identified. Alternatively, it is characterized by being used by selecting from two or more speech recognition dictionaries.

In the present invention, a speech recognition dictionary used for speech recognition is used to identify a topic of interest from speech data, and a speech recognition dictionary corresponding to the scene is identified by one or more speeches recorded in the storage. Select from the recognition dictionary and use it for speech recognition. A speech database is created in advance by collecting scene words that appear in each scene in association with each scene and creating a database of the scenes, and storing the database separately in a recording device or the like. From the voice data obtained by processing the voice of the speaker, the scene that has become a topic is identified using the voice data base. Then, speech recognition is performed using a speech recognition dictionary corresponding to the specified scene.

According to a fourth aspect of the present invention, in the voice recognition method for voice dialogue according to any one of the first to third aspects, the voice recognition dictionary used for voice recognition is stored in a cache memory. When the voice data to be recognized is not included as a scene word in the voice recognition dictionary in use, or when the newly specified scene is a scene different from the voice recognition dictionary in use. Is characterized in that it is used by replacing it with another voice recognition dictionary that includes the relevant voice data as a scene word or a voice recognition dictionary corresponding to a newly specified scene.

In the present invention, the speech recognition dictionary used for the selected speech recognition is stored in the cache memory and used. If there is a voice data that does not include the voice data to be recognized in the voice recognition dictionary in use, the voice data is deleted. A search is performed to determine whether or not there is another speech recognition dictionary that includes a scene word. As a result of the search, if there is another voice recognition dictionary that includes the voice data as a scene word, the voice recognition dictionary recorded in the cache memory is replaced with the newly searched voice recognition dictionary. Also, when the scene specified using the voice database is a scene different from the scene of the voice recognition dictionary in use, a voice recognition dictionary corresponding to the specified another scene is selected. Replace it with the voice recognition dictionary you are using.

The invention according to claim 5 is the speech recognition method for voice conversation according to any one of claims 1 to 4, wherein the dialogue sequence is for prompting a speaker to make an initial utterance. In addition to asking a question, a predetermined expression is generated based on words obtained by recognizing the speaker's voice, a response sentence to be asked next is generated in accordance with the wording, and the response sentence is synthesized by speech. The feature is that speech is recognized by initiating conversation with the speaker by asking the speaker.

According to the present invention, instead of waiting for the speaker to speak and proceeding with the conversation to be voice-recognized based on the speaker's speech, the speaker always asks the speaker and proceeds with the conversation in a leading manner. Have been. The sentence to be spoken to the speaker is created by a dialogue sequence that generates a response sentence to be asked next according to a predetermined expression-phrase from the sentence that asks the speaker to speak and a recognized word. You. For example, first, first and causes a computer to operate "business What is?""Hi, Hi, what you did not do? Yesterday,""newspaper today paddle I read?" And the like of the speaker It generates a text that prompts the speaker to make a speech, and then synthesizes it into a speech to ask the speaker. In response to the question, the speaker utters the voice, and generates a response sentence to be asked next from the recognized word according to a predetermined expression and phrase based on the dialogue sequence. Then, the response sentence is voice-synthesized, and the speaker is interrogated, and the next statement of the speaker is awaited. In this way, the speaker always leads the conversation, and the conversation proceeds smoothly. The invention according to claim 6 is characterized in that the speech of the speaker is recognized, a response sentence is created based on the obtained speech data, and the response is synthesized by speech synthesis. A speech recognition system that collects one or more words that appear in a topic scene as scene words and creates one or more A storage such as a memory or a recording device for recording a voice recognition dictionary, the voice input unit for inputting the voice of the speaker, and a voice obtained by analyzing the input voice of the speaker by using a word spot and analyzing words. Means for performing voice recognition of data using one or more voice recognition dictionaries, and, if the recognition is performed within a predetermined time, generating a sentence from the recognition result in accordance with a predetermined expression / phrase Creates a response sentence based on the dialogue sequence and, if recognition is not performed within a predetermined time, creates a return sentence that prompts the speaker to re-enter the input, and voices the created response sentence or the return sentence And a synthesizing means.

The invention described in claim 7 is a speech recognition system for speech dialogue according to claim 6, wherein the speech recognition dictionary used for speech recognition is a sentence analysis of a speaker's speech using a word spot. The speech data obtained by the decomposition is compared with the scene words included in one or more speech recognition dictionaries, and a predetermined speech recognition dictionary including at least one scene word corresponding to the speech data is selected. It is characterized in that it is used for

The invention according to claim 8 is the speech recognition system for speech dialogue according to claim 6 or 7, wherein the speech recognition dictionary used for speech recognition is a sentence analysis of a speaker's speech by a word spot. · A scene that is a topic of interest is specified by using a voice database created for each scene by associating scene words with each other in advance from the voice data obtained by word decomposition, and a voice recognition dictionary corresponding to the scene is specified. Is selected and used from one or more speech recognition dictionaries.

The invention described in claim 9 is a speech recognition method for speech dialogue according to any one of claims 6 to 9, wherein the speech recognition dictionary used for speech recognition is recorded in a cache memory. If the speech data to be recognized is not included as a scene word in the speech recognition dictionary in use, or the newly specified scene is a scene different from the speech recognition dictionary in use. In this case, it is characterized in that another speech recognition dictionary including the relevant speech data as a scene word or a speech recognition dictionary corresponding to a newly specified scene is used.

The invention described in claim 10 is described in any one of claims 6 to 10. In the voice recognition system for voice conversation described above, the dialogue sequence is based on words obtained by recognizing the voice of the speaker while asking the speaker to prompt the first utterance. Generates a response sentence to be asked next in accordance with the prescribed expression and wording, synthesizes the response sentence into a speech, and asks the speaker. It is characterized by performing recognition.

The invention described in Claim 11 is a computer which recognizes and processes a speaker's voice, creates a response sentence based on the obtained speech data, and synthesizes the response sentence to perform speech synthesis. This is a speech recognition program for speech conversation that executes a speech recognition method for dialogue, and the program uses a word spot to analyze a sentence of a speaker input from a speech input unit. One or two words collected from the audio data obtained as a result are collected in a storage such as a memory or a recording device by collecting predetermined words appearing in the topic scene as scene words. Speech recognition is performed using the above-described speech recognition dictionary, and if the recognition is performed within a predetermined time, a dialog sequence that generates a sentence from the recognition result according to a predetermined expression 'phrase' A response dialogue is created based on the response sentence based on the response sentence, and if the recognition is not performed within a predetermined time, a return sentence is created to prompt the speaker to re-input, and the response sentence or the return sentence is executed to perform speech synthesis. Provide a speech recognition program for

The invention according to claim 12 is a speech recognition program for speech dialogue according to claim 11, wherein the speech recognition dictionary used for speech recognition is a text-to-speech voice of a speaker. AnalysisSpeech data obtained by word decomposition is compared with scene words contained in one or more speech recognition dictionaries, and a predetermined speech containing at least one scene word corresponding to the speech data It is characterized in that a recognition dictionary is selected and executed.

The invention according to claim 13 is a speech recognition program for speech dialogue according to claim 10 or 12, wherein the speech recognition dictionary used for speech recognition includes a word spot of a speaker. From the speech data obtained by word decomposition, the scene words are linked in advance to the scene data, and the scenes that are the topic are identified using the speech data base created for each scene. In order to select and use one or more speech recognition dictionaries corresponding to Is performed.

According to the invention described in claim 14, in the speech recognition program for speech dialogue described in any one of claims 11 to 13, the speech data to be recognized is recorded in a cache memory. If the current speech recognition dictionary does not include the scene word as a scene word, or if the newly specified scene is a scene different from that of the speech recognition dictionary in use, another scene that includes the relevant speech data as a scene word is used. The speech recognition dictionary or the speech recognition dictionary corresponding to the newly specified scene is replaced with the speech recognition dictionary.

The invention according to Seki's scope 15 is a speech recognition program for voice conversation according to any one of claims 1 to 4, wherein the speech sequence prompts the speaker to speak first. To generate a response sentence to be asked next according to a predetermined expression and wording based on the words obtained by recognizing the voice of the speaker, and synthesize the response sentence into speech. Then, by asking the speaker, the speaker is led to perform the conversation in a leading manner and perform the speech recognition.

No ^ ^

FIG. 1 is a block diagram of an embodiment of a speech recognition system for speech dialogue according to the present invention.

FIG. 2 is a block diagram of a contest for realizing the speech recognition system of FIG.

FIG. 3 is an explanatory diagram showing the configuration of a scene dictionary.

FIG. 4 is a block diagram of a speech recognition system according to a second embodiment of the present invention.

FIG. 5 is an explanatory diagram showing the configuration of a voice database.

FIG. 6 is an explanatory diagram showing a configuration based on a relational data type. FIG. 7 is an overall flowchart of the voice dialogue.

FIG. 8 is a flowchart showing a flow of selecting a scene dictionary and generating a reflection sentence.

Departure bear

An embodiment of a voice recognition method, a voice recognition system, and a voice recognition program for voice dialogue according to the present invention will be described with reference to the drawings. First, the book shown in Figure 1 A basic configuration of the speech recognition system according to the embodiment will be described. The speech recognition system shown in FIG. 1 generally includes a speech input unit 3, a language processing unit 4, a speech recognition unit 5, a speech synthesis unit 6, a document creation unit 7, a speech recognition dictionary unit 8 It is configured with Such a system includes, for example, storage 11 such as a memory and a hard disk for recording and storing programs and processing results as shown in FIG. 2 and a keyboard and a pointing device for inputting information and instructions. A general computer 1 including an input device 13, a central processing unit (CPU) 15 for executing and processing a given program, and a monitor 17 for displaying input information and processing results. This is realized by using 0.

First, the voice input unit 3 converts the speaker's voice captured from the microphone 3a into a voice signal that is an electrical signal that can be processed in a convenience store, and the processing result is sent to the language processing unit 4. It is configured to pass.

The language processing section 4 analyzes the speech signal of the speaker sent from the speech input section 3 using a conventionally known speech recognition engine by using a word spot in a sentence analysis and word decomposition to obtain necessary speech data to be recognized. It is configured as follows.

The voice recognition unit 5 compares the voice data obtained by processing the voice of the speaker with the language processing unit 4 with the voice recognition dictionary recorded and stored in the voice recognition dictionary unit 8 in advance. Recognize what the word is.

The speech recognition dictionary unit 8 has at least one or more speech recognition dictionaries for recognizing a target word from the speech data obtained by analyzing the speech of the speaker in the language processing unit 4 and word decomposition. It is configured. This speech recognition dictionary is formed by collecting predetermined words appearing in a topic scene as scene words, and is created for each of various scenes (hereinafter, each speech recognition dictionary compiled for each scene). Is called "scene dictionary 8a"). Scenes include, for example, "business,""politics and economy,""computer,""education,""localinformation,""moviemusic,""naturalscience,""life and culture," and "sports." However, the present invention is not limited to this, and various categories can be adopted. It is also possible to create each scene by further subdividing it.For example, it can be created hierarchically by further dividing it into 'insurance' finance ',' food ','telecommunications', etc. in 'business' Good. The words that appear in each scene are collected as scene words for each scene. For example, "In the scene dictionary 8a for Combi You I, in addition to the scene words shown in Fig. 3, “Server”, “Client”, “CP”, “Program”, “Software”, “Video card”, “Task”, “Mouse”, “Display”, “PC”, “Start” The words that appear when a topic such as "system" becomes a topic are recorded. At least one or more scene dictionaries 8a created in this way are stored in a storage device 11 such as a recording device such as a hard disk of a computer. If a large number of words to be recognized are comprehensively recorded in the speech recognition dictionary used for speech recognition, there will be many candidates for matching the speech data of the input speaker with the speech data. May take some time. Therefore, a speech recognition dictionary is created for each topical scene to shorten the recognition time.

The scene dictionary 8a used for speech recognition compares the speech data obtained by processing the speech of the speaker with the language processing unit 4 with the scene words included in each of the stored scene dictionaries 8a. This is used by selecting a specific scene dictionary 8a containing at least one or more scene words corresponding to the audio data. For example, if the speaker says, "This computer has a high processing speed because of the high performance of the CPU (chip-up)," it corresponds to the audio data obtained by the language processing unit 4. The "computer" related scene dictionary 8a containing the words "PC", "CPU", "performance", and "processing speed" is selected. Then, the selected scene dictionary 8a is stored in the cache memory and used for the subsequent speech recognition.

Note that the selection of the voice recognition dictionary is configured such that, when voice data is included in two or more scene dictionaries 8a, the scene dictionary 8a including scene words corresponding to more voice data is selected. Alternatively, the scene dictionary 8a recorded in the cache memory may be preferentially used.

As shown in FIG. 4, in the second embodiment regarding the selection of the scene dictionary 8a, scene words are associated with each other in advance from voice data obtained by processing the voice of the Using the audio database 9 created for each of the scenes, the topic scene is identified, and the scene dictionary 8a corresponding to the identified scene is selected from one or more scene dictionaries 8a. It can also be configured for use. Specifically, the scene words appearing in each scene as shown in Fig. 5 are associated with each scene, and the audio is de-base-based for each scene. The database 9 is created, and the speech database obtained by processing the speech of the speaker by the language processing unit 4 is used to identify a topic that is a topic using the speech database 9. Then, a scene dictionary 8a to be used for speech recognition is selected from the specified scenes, stored in a cache memory, and used for speech recognition. The audio database 9 may be created so as to correspond to the same scene as the scene dictionary 8a, but the same scene as the scene dictionary 8a if it is configured to specify the scene dictionary 8a to be used. It is not necessary to classify into scenes. It is also possible to collect and use a plurality of scene dictionaries 8a as the audio database 9. The scene dictionary 8a used for voice recognition is preferably recorded in a cache memory and used so that the response can be performed quickly.

If the conversation progresses and the topic becomes different from the past, the voice data to be recognized may not be included in the used scene dictionary 8a as a scene word. In this case, the voice recognition unit 5 accesses another scene dictionary 8a in the voice dictionary unit 8 and searches for a scene dictionary 8a including a scene word corresponding to the voice data. If there is a corresponding scene dictionary 8a, a scene dictionary 8a containing the audio data is selected, and the used scene dictionary 8a recorded in the cache memory is newly selected as the scene dictionary 8a. It is configured to be used in place of a. Similarly, in the case of the second embodiment described above, the speech recognition unit 5 changes the topic, and the scene specified from the speech database 9 is different from the scene in the currently used scene dictionary 8a. In this case, a new scene is identified from the audio database 9, and a scene dictionary 8a corresponding to the scene is selected, and the previously used scene dictionary 8a and the newly selected scene dictionary 8a in the cache memory are selected. And are used interchangeably.

The document creation unit 7 extracts topics from the results recognized by the speech recognition unit 5 and responds to the speaker's utterance based on a dialogue sequence that is a program that generates sentences according to predetermined expressions and phrases. Create a response statement.

The dialogue sequence asks the speaker to prompt the first utterance, and a predetermined expression based on the word obtained by recognizing the speaker's voice A sentence is generated, and the response sentence is voice-synthesized and questioned to the speaker, whereby the speaker is The voice recognition is performed by initiatively proceeding with the conversation.

Specifically, a relational data base as shown in Fig. 6 is prepared. That is, each row of the relational data base shown in FIG. 6 is assigned to a record, and each column is assigned to a scheme indicating a data attribute for each record. For example, if it is an overnight pace related to “travel”, information such as “purpose”, “location”, “number of people”, “time”, “departure date”, “number of days” and the like are recorded as records. The schemes S1 to Sn, which are attributes for each record, include, for example, sightseeing, business, training, diving, skiing, etc. Is recorded, and "Hokkaido", "Tokyo", "Kyoto", "Oki Jip", "Hawaii", "UK", "China", etc. are recorded as "places". Then, the words obtained by recognizing the speaker's voice are applied to the relational database created in this way. If the speaker says, “I want to go to Hawaii for summer vacation,” the location “Hawaii” and the “time” “summer vacation” are identified, but the other records, “purpose” , "Number of people", "departure date" etc. are unknown. Therefore, from this unknown record, a sentence corresponding to the speaker's story is used as a response sentence corresponding to the dialogue scene, for example, a sentence example when the purpose is unknown. "〇〇" in "?" Is applied to the previously recognized word "place", and the sentence "What are you going to Hawaii?" Generate. Similarly, if the number is unknown, the sentence "How many people go to Hawaii?" Is used to generate a sentence "How many people go to Hawaii?" .

Interactive sequence is, also, when you first to operate the computer, "What are the requirements?", "Hi, hello, what you did not do? Yesterday", "Have you read today's newspaper?" , Etc., to generate sentences that prompt the speaker to speak ο

On the other hand, if the word corresponding to the scene dictionary 8a is included, but the recognition takes longer than a predetermined time (the word corresponding to another scene dictionary 8a is included), (Including the case where the recognition result could not be output beforehand.) In such a case, create a reflection sentence that prompts the speaker to re-enter what. By creating a reflection sentence and returning to the speaker, the conversation can proceed smoothly without making unnecessary time, and it is used first for the speaker. You can be encouraged to speak another language that is more recognizable, rather than the same as the original language. The predetermined time required for recognition, which is a condition for generating a reflection sentence, is preferably about 1 to 3 seconds.

In addition, the dialogue sequence is configured so that a response sentence is created using a conversation pattern in which a content previously answered by the speaker is recorded with respect to the response sentence created by recognizing the speaker's voice. You can also. In other words, the created response sentence is not synthesized and output as it is, but the created sentence is taken into the system once and the response sentence is created again. That is, the speaker's answer to the response sentence created based on the word obtained by recognizing the speaker's voice is recorded and stored in a storage such as a hard disk as a conversation pattern. Then, the response sentence generated based on the words obtained by the speech recognition is compared with the previously recorded conversation pattern, and if the same response sentence is found, the answer of the speaker to the response sentence is referred to. To generate the next response sentence. By constructing the dialogue sequence in this way, it is possible to give a feeling as if humans are having a conversation, instead of creating a response sentence in a fixed pattern for the word.

The speech synthesizer 6 synthesizes the greeting sentence for starting the conversation created by the document creator 7, the response sentence to the speaker's speech and the return sentence, and utters it from the speaker 6 a.

Next, the speech recognition method according to the present invention will be described together with the operation of the speech recognition system described above.

First, one or two or more scene dictionaries created for each scene by collecting predetermined words that appear in a topical scene as scene words 8a into a storage such as a memory or a recording device of a computer. Record (step S 1).

When this system is operated, first, a sentence that prompts the speaker to input voice based on the dialogue sequence, which is a program recorded in the sentence creation unit 7, for example, "What is the task?""Hi, Hi, what you did not do? yesterday,""Kai you read now date of the newspaper?" is created (step S 2). Then, the sentence is synthesized by the speech synthesizer 6 and the speaker is interrogated via the speaker 6a (step S3). When the speaker utters the voice in response to the question, the voice is taken from the microphone phone 3a and converted into an electrical signal as a voice signal by the voice input unit 3. Process (step S4). Then, the processing result is passed to the language processing unit 4 (step S5).

The language processing unit 4 obtains necessary speech data to be recognized by analyzing a speech signal by word spot analysis and word decomposition using a conventionally known speech recognition en scene (step S6). Then, the obtained speech data is compared with the scene dictionary 8a recorded and saved in the speech recognition dictionary unit 8 in advance (S1) to recognize what word the speech data is (step S1). 7). When the word is recognized, a response sentence to be asked next is generated in accordance with an expression and a phrase predetermined by the dialogue sequence (step S8). Then, the prepared response sentence is voice-synthesized by the voice synthesizer 6, and the voice is emitted from the speaker 6a to ask the speaker, and waits for the next utterance of the speaker (step S9). In this way, the conversation is always led by the speaker, and the dialogue proceeds smoothly.

If the conversation progresses and the topic changes, and the obtained audio data contains audio data that is not included as a scene word in the current scene dictionary 8a, as shown in Fig. 8, the audio dictionary The other scene dictionary 8a in the section 8 is accessed to search whether or not there is a scene dictionary 8a including a scene word corresponding to the audio data overnight (step S10). Then, if there is a corresponding scene dictionary 8a, the scene dictionary 8a including the audio data is selected, and the used scene dictionary 8a recorded in the cache memory is newly selected as the scene dictionary. Replace with 8a and use it (step S11). Similarly, if the topic changes and the scene specified from the audio data overnight pace 9 becomes different from the scene currently used in the scene dictionary 8a, a new scene is specified from the audio database 9, The scene dictionary 8a corresponding to the scene is selected, and the previously used scene dictionary 8a in the cache memory is replaced with the newly selected scene dictionary 8a. On the other hand, if a word corresponding to the scene dictionary 8a is included but its recognition takes more than a predetermined time (the word corresponding to another scene dictionary 8a is not included and the recognition result is output. In this case, the dialogue sequence creates a reflection sentence that prompts the speaker to re-enter “what, say again.” (Step S12). Then, the generated return sentence is voice-synthesized by the voice synthesis unit 6, and the voice is emitted from the speaker 6a to ask the speaker, and waits for the next utterance of the speaker (S9). Thereafter, this is repeated. The present invention can be realized by a program for causing a computer to execute the voice recognition method for voice conversation as described above, and the program includes a floppy disk, a CD-ROM, and a DVD. The program can be executed by recording it on a recording medium such as an MO or the like and reading it into a computer.

Also, without using such a recording medium, it is possible to read and execute the computer by communication represented by Internet, and the above-described voice recognition method can be executed on the computer by any method. As long as it is realized, the concept of the present invention is included.

Skull ii ffl wr Noh cattle

According to the speech recognition method, the speech recognition system, and the speech recognition program according to the present invention, speech recognition is performed using a speech recognition dictionary corresponding to a topic scene, and a speech recognition dictionary to be used is efficiently selected. By switching and using, the recognition time can be shortened, the recognition rate can be improved, and the memory capacity can be reduced.

Further, according to the speech recognition method, the speech recognition system and the speech recognition program according to the present invention, when no candidate is searched or there is a word that cannot be recognized even after a certain period of time, the speaker is prompted to speak again. With such a configuration, it is possible to perform speech recognition while making a conversation with a natural feeling without making an unnatural interval.

Claims

The scope of the claims

1. A speech recognition method for speech dialogue in which a speaker's speech is recognized, a response sentence is created based on the obtained speech data, and a speech is synthesized by speech synthesis.

Collecting predetermined words appearing in a topic scene as scene words and recording one or more speech recognition dictionaries created for each scene in storage such as a memory or a recording device;

Inputting the voice of the speaker from a voice input unit;

Speech analysis of the input speaker's speech using word spots and speech recognition of speech data obtained by word decomposition using one or more speech recognition dictionaries;

If the recognition is performed within a predetermined time, a response sentence is created based on a dialogue sequence that generates a sentence according to a predetermined expression 'phrase from the recognition result, and the recognition is not performed within the predetermined time. Creating a response sentence that prompts the speaker to re-enter if

A step of speech-synthesizing the prepared response sentence or the reflected sentence,

A speech recognition method for spoken dialogue comprising:

2. The speech recognition method for speech dialogue according to claim 1, wherein

The speech recognition dictionary used for speech recognition is composed of the speech data obtained by analyzing the speech of the speaker using a word spot and word decomposition, and the scene words included in one or more speech recognition dictionaries. Speech recognition for speech dialogue, characterized in that a predetermined speech recognition dictionary containing at least one scene word corresponding to the speech data is selected and used. Method.

3. The speech recognition method for speech dialogue according to claim 1 or 2, wherein the speech recognition dictionary used for speech recognition is obtained by text analysis and word decomposition of a speaker's speech by a voice spot. A scene that is a topic is specified using a speech database created for each scene by associating the scene words with each other in advance from speech data, and one or more speech recognition dictionaries corresponding to the scene are identified. A speech recognition method for spoken dialogue characterized by being used by selecting from a dictionary.

4. A speech recognition method for speech dialogue according to any one of claims 1 to 3. At

The voice recognition dictionary used for voice recognition is recorded and used in the cache memory, and if the voice data to be recognized is not included as a scene word in the voice recognition dictionary being used, or a newly specified scene is used. In the case of a scene different from that of the middle speech recognition dictionary, the speech recognition dictionary is replaced with another speech recognition dictionary containing the relevant speech data as a scene word or a speech recognition dictionary corresponding to a newly specified scene. Speech recognition method for spoken dialogue characterized by:

5-The speech recognition method for speech conversation according to any one of claims 1 to 4, wherein

In the dialogue sequence, the speaker should be asked to prompt the first utterance, and the next question should be asked in accordance with a predetermined expression and wording based on the word obtained by recognizing the voice of the speaker. A voice conversation characterized by generating a response sentence, voice-synthesizing the response sentence, and asking the speaker to perform a speech initiative with respect to the speaker to perform voice recognition. Speech recognition method for.

6. A speech recognition system for speech dialogue in which a speaker's speech is recognized, a response sentence is created based on the obtained speech data, and the speech is synthesized by speech synthesis.

A storage such as a memory or a recording device for collecting predetermined words appearing in a topic scene as scene words and recording one or more speech recognition dictionaries created for each scene;

Said voice input unit for inputting the voice of the speaker,

Means for performing sentence analysis of the input speaker's voice using a word spot, and performing voice recognition on voice data obtained by word decomposition using one or more of the voice recognition dictionaries;

If the recognition is performed within a predetermined time, a response sentence is created based on a dialogue sequence that generates a sentence according to a predetermined expression / phrase from the recognition result, and the recognition is not performed within the predetermined time. Means to create a response sentence that prompts the speaker to re-enter if

Means for speech-synthesizing the prepared response sentence or the reflected sentence, A speech recognition system for spoken dialogue comprising:

7. The speech recognition system for speech dialogue according to claim 6, wherein the speech recognition dictionary used for speech recognition is the speech data obtained by text analysis and word decomposition of a speaker's speech by a voice spot. And one or more of the scene words included in the speech recognition dictionary, and a scene corresponding to the speech data is compared.

A voice recognition system for voice dialogue, wherein a predetermined voice recognition dictionary containing at least one or more words is selected and used.

8. The speech recognition system for speech dialogue according to claim 6 or 7, wherein the speech recognition dictionary used for speech recognition is obtained by text analysis and word decomposition of a speaker's voice using a voice spot. A scene that is a topic is specified using a speech database created for each scene by associating the scene words with each other in advance from speech data, and one or more speech recognition dictionaries corresponding to the scene are identified. A speech recognition system for speech dialogue characterized by being selected from a dictionary for use.

9. The speech recognition method for speech dialogue according to any one of claims 6 to 9, wherein

The voice recognition dictionary used for voice recognition is recorded and used in the cache memory, and if the voice data to be recognized is not included as a scene word in the voice recognition dictionary being used, or a newly specified scene is used. If it is from a different scene from the middle speech recognition dictionary, replace it with another speech recognition dictionary that includes the relevant speech data as a scene word or a speech recognition dictionary corresponding to a newly specified scene. A speech recognition method for speech dialogue, characterized in that

10. The speech recognition system for speech conversation according to any one of claims 6 to 10, wherein:

In the dialogue sequence, the speaker should be asked to prompt the first utterance, and the next question should be asked in accordance with a predetermined expression and wording based on the word obtained by recognizing the voice of the speaker. A voice conversation characterized by generating a response sentence, voice-synthesizing the response sentence, and asking the speaker to perform a speech initiative with respect to the speaker to perform voice recognition. Speech recognition system for.

1 1. A speech recognition method for speech dialogue in which the speech of the speaker is recognized and processed in the evening of the combination, a response sentence is created based on the obtained speech data, and the speech is synthesized by speech synthesis. A voice recognition program for voice conversation to be executed,

The program analyzes the speech of the speaker input from the voice input unit using a word spot and analyzes the speech of the speaker input from the voice input unit. Speech recognition is performed using one or more speech recognition dictionaries that are collected for each scene and recorded in storage such as a memory or a recording device, and if the recognition is performed within a predetermined time, A response sentence is created based on a dialogue sequence that generates a sentence in accordance with a predetermined expression 'phrase from the recognition result, and a return sentence prompting the speaker to re-input if the recognition is not performed within a predetermined time. And a speech recognition program for speech dialogue that is executed to synthesize the response sentence or the reflected sentence.

12.The speech recognition program for speech dialogue according to claim 11, wherein the speech recognition dictionary used for speech recognition is obtained by analyzing a speech of a speaker by a text spot and word decomposition. Comparing the voice data with the scene words included in one or more of the voice recognition dictionaries, selecting a predetermined voice recognition dictionary including at least one or more scene words corresponding to the voice data; A speech recognition program for spoken dialogue characterized by being executed as used.

13. A speech recognition program for speech dialogue according to claim 10 or claim 12,

The speech recognition dictionary used for speech recognition is a speech database created for each scene by associating the scene words with each other in advance from the speech data obtained by analyzing the speech of the speaker using word spots and word decomposition. A speech recognition dictionary corresponding to the scene is identified by using the speech recognition dictionary, and the selected speech recognition dictionary is selected from one or more of the speech recognition dictionaries to be used. Speech recognition program for.

14.The speech recognition program for speech dialogue according to any one of claims 11 to 13,

If the voice data to be recognized is not included as a scene word in the voice recognition dictionary in use stored in the cache memory, or if a newly specified scene is If the scene is different from the speech recognition dictionary in use, replace it with another speech recognition dictionary that includes the relevant speech data as a scene word or a speech recognition dictionary corresponding to a newly specified scene. A speech recognition program for a spoken dialogue to be executed.

15. The speech recognition program for speech conversation according to any one of claims 1 to 4, wherein

The dialogue sequence is interrogated to prompt the speaker to make an initial utterance, and the next question is asked in accordance with a pre-determined language and wording based on words obtained by recognizing the speaker's voice. It is characterized in that a response sentence to be generated is generated, the response sentence is voice-synthesized and the speaker is questioned to the speaker, so that the conversation is led in a leading manner with respect to the speaker and the speech is recognized. A voice recognition program for voice conversations.