JP2022107933A

JP2022107933A - dialogue system

Info

Publication number: JP2022107933A
Application number: JP2021002644A
Authority: JP
Inventors: 智久末重; Tomohisa Sueshige
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2022-07-25
Anticipated expiration: 2041-01-12
Also published as: JP7339615B2

Abstract

To provide a dialogue system which makes a user to feel as if the user actually makes conversation with a person the user desires to talk with.SOLUTION: A registered person inputs voice and moving images as response patterns by using an input device 1. A voice and moving image processing unit 2 records the input voice as text data onto a text data recording device 3 and assigns an identifier, and records the voice and moving images onto a voice and moving image recording device 4 and assigns the same identifier. A response instance selection unit 8 selects text data that fits or closest to a response pattern with respect to the voice input by the user from the text data recording device 3 via a voice recognition unit 6 and a language understanding unit 7, and transfers the assigned identifier to a voice and moving image designation unit 9. The voice and moving image designation unit 9 extracts voice and moving images matching the identifier from the voice and moving image recording device 4 and reproduces them.SELECTED DRAWING: Figure 1

Description

本発明は、ユーザーの発話に対応して予め記録した音声および動画を表示する対話システムに関するものである。 The present invention relates to a dialogue system that displays pre-recorded audio and video corresponding to user's utterances.

近年、ユーザーの発話に対して応答を返す多くの対話システムが開発されている。例えば、ユーザーの発話に対して、音声だけで対話するシステム、特許文献１に記述されているような絵画像とテキストを表示させるシステム、特許文献２に記述されているような故人や有名人の静止画像を変形させて音声とともに出力するシステムなどがある。 In recent years, many dialogue systems have been developed that return responses to user utterances. For example, a system that interacts only with voice in response to user's utterance, a system that displays pictures and text as described in Patent Document 1, and a static image of a deceased person or celebrity as described in Patent Document 2 There are systems that transform images and output them along with audio.

特開平５－２１６６１８号公報JP-A-5-216618 特許第６６５６４４７号公報Japanese Patent No. 6656447

「音声対話システムの構成と今後」月刊パテント2019年7月発行"Structure and Future of Spoken Dialogue System" Monthly Patent, July 2019

しかしながら、これまでの対話システムでは、対話する相手が機械的に合成された音声であったり、絵画像であったり、現実の人であっても静止画像を変形させて対話しているように見せている画像であったため、実際に話したい人と対話していると感じる状態ではなかった。 However, in existing dialogue systems, even if the dialogue partner is a mechanically synthesized voice, a picture, or a real person, the static image is deformed to make it appear as if they are having a dialogue. Because the image was that of a person with whom I wanted to talk, I did not feel like I was actually having a conversation with the person I wanted to talk to.

本発明は、上記を解決するために実際に話したい人と対話しているように感じられる対話システムを提供することを目的としている。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a dialog system that makes it feel like you are actually interacting with a person you want to talk to in order to solve the above problems.

本発明は、予め記録された音声および動画の中からユーザーの発話に対応した音声および動画を抽出して表示することを主な特徴としている。また、複数の人の音声および動画を記録しておき、その中からユーザーが話したい人を選択した後、ユーザーの発話に対応した音声および動画を抽出して表示する手段をとることもできる。 The main feature of the present invention is to extract and display the audio and video corresponding to the user's utterance from pre-recorded audio and video. It is also possible to record the voices and videos of a plurality of people, select the person the user wants to talk to from among them, and then extract and display the voices and videos corresponding to the user's utterances.

本発明の対話システムは、実際に話したい人の音声と動画を表示させるため、機械的な相手ではなく本当に話したい人との対話を楽しむことができる。 Since the dialogue system of the present invention displays the voice and moving image of the person who actually wants to talk, it is possible to enjoy dialogue with the person who really wants to talk, not with a mechanical partner.

さらに、本発明を自治体や企業への問い合わせに利用した場合、人件費の削減が期待できる。また、悩みを持つ人にとっては、通常話すことのできない著名人や故人と話すことにより、前向きに生きていこうとする気持ちが現れ、うつ状態に陥ることや自殺などを食い止めることが期待できる。 Furthermore, when the present invention is used for inquiries to local governments and companies, a reduction in personnel costs can be expected. Also, for people with worries, talking to celebrities and the deceased, who they usually cannot talk to, can be expected to make them feel like they want to live positively and prevent them from falling into a depressed state or committing suicide.

図１は全体の構成を示したブロック図である。（実施例１）FIG. 1 is a block diagram showing the overall configuration. (Example 1) 図２は全体の構成を示したブロック図である。（実施例２）FIG. 2 is a block diagram showing the overall configuration. (Example 2)

本発明の形態を以下の実施例で説明する。 Aspects of the invention are illustrated in the following examples.

図１は、本発明のブロック図である。入力装置１はカメラ１０１とマイク１０２とを備えており、登録者は入力装置１を用いて応答パターンとして音声および動画を入力する。音声動画処理部２は入力された音声をテキストデータに変換してテキストデータ記録装置３に記録し、音声および動画を音声動画記録装置４に記録する。音声動画記録装置４に記録された音声および動画は1つにまとめて格納され固有の識別子が付与される。テキストデータ記録装置３に記録されたテキストデータにも音声および動画に付与された識別子と同じ識別子が付与される。応答パターンは様々な対話場面を想定して複数記録するとよい。識別子は応答パターンごとに付与される。 FIG. 1 is a block diagram of the present invention. The input device 1 includes a camera 101 and a microphone 102, and the registrant uses the input device 1 to input voice and moving images as a response pattern. The audio/video processing unit 2 converts the input audio into text data and records it in the text data recording device 3 , and records the audio and video in the audio/video recording device 4 . The audio and moving images recorded in the audio/moving image recording device 4 are collectively stored and given unique identifiers. The text data recorded in the text data recording device 3 is assigned the same identifier as the identifier assigned to the audio and moving images. A plurality of response patterns should be recorded assuming various dialogue situations. An identifier is assigned to each response pattern.

入出力装置５はマイク５０１を備えており、ユーザーは入出力装置５を用いて音声を入力する。音声認識部６はユーザーの入力した音声をテキストデータへと変換する。言語理解部７は音声認識部６で得られたテキストデータからユーザーの意図を同定する。言語理解部７の結果は応答事例選択部８に渡され、テキストデータ記録装置３から応答事例に適合または最も近いテキストデータを選択する。応答事例選択部８は選択したテキストデータに付与された識別子を音声動画指定部９に渡し、音声動画指定部９はこの識別子に合致した音声および動画を音声動画記録装置４から抽出し、入出力装置５に送る。入出力装置５は表示部５０２を備えており、送られた音声および動画を再生する。音声認識および言語理解については既知の技術であるため詳細な説明は省略する。 The input/output device 5 has a microphone 501, and the user uses the input/output device 5 to input voice. The speech recognition unit 6 converts the speech input by the user into text data. A language understanding unit 7 identifies the user's intention from the text data obtained by the speech recognition unit 6 . The result of the language understanding unit 7 is passed to the response example selection unit 8, which selects text data from the text data recording device 3 that matches or is closest to the response example. The response example selection unit 8 passes the identifier attached to the selected text data to the audio/video designation unit 9, and the audio/video designation unit 9 extracts the audio and video matching the identifier from the audio/video recording device 4, and inputs/outputs them. Send to device 5. The input/output device 5 has a display unit 502 and reproduces the sent audio and video. Speech recognition and language understanding are well-known techniques, and detailed descriptions thereof will be omitted.

本発明の登録者は有識者や自治体の相談員等が考えられる。登録者は予め想定される質問に対する応答パターンとして音声および動画を登録し、ユーザーの質問に対する応答に近い音声および動画を再生することが出来る。 Registrants of the present invention can be thought of as intellectuals, counselors of local governments, and the like. A registrant can register voice and video as a response pattern to an assumed question in advance, and reproduce voice and video similar to the response to the user's question.

応答パターンを登録する登録者は音声および動画の入力に先立って登録者に関する固有情報、例えば登録者名を入力装置１に備えたキーボード１０３によって入力する。入力された固有情報は固有情報記録装置１０に記録される。その後、入力装置１を用いて登録者は応答パターンとして音声および動画を入力する。入力された音声および動画は音声動画処理部２に渡され1つにまとめた状態で音声動画記録装置４に記録される。また、音声はテキストデータに変換されテキストデータ記録装置３に記録される。この時、実施例１と同様にテキストデータと音声および動画には同じ固有の識別子が付与されるが、テキストデータには固有情報記録装置１０に記録された固有情報も付与される。 A registrant who registers a response pattern inputs unique information about the registrant, such as a registrant's name, using the keyboard 103 provided in the input device 1 prior to inputting voice and moving images. The input unique information is recorded in the unique information recording device 10 . After that, the registrant uses the input device 1 to input voice and moving images as a response pattern. The input audio and moving images are transferred to the audio/moving image processing unit 2 and recorded in the audio/moving image recording device 4 in a state of being grouped into one. Also, the voice is converted into text data and recorded in the text data recording device 3 . At this time, the same unique identifier is given to the text data, voice and moving image as in the first embodiment, but the text data is also given unique information recorded in the unique information recording device 10 .

対話を始める際、ユーザーは対話する相手を指定するために入出力装置５に備えたキーボード５０３を用いて対話する相手の固有情報を入力する。入力された固有情報は指定情報一時保管装置１１に一時保管される。次に、ユーザーは入出力装置５を用いて音声を入力し対話を開始する。音声認識部６はユーザーの入力した音声をテキストデータへと変換する。言語理解部７は音声認識部６で得られたテキストデータからユーザーの意図を同定する。言語理解部７の結果は応答事例選択部８に渡される。応答事例選択部８は指定情報一時保管装置１１に一時保管された固有情報に合致し、言語理解部７の結果に適合または最も近いテキストデータをテキストデータ記録装置３から選択する。応答事例選択部８は選択したテキストデータに付与された識別子を音声動画指定部９に渡し、音声動画指定部９はこの識別子に合致した音声および動画を音声動画記録装置４から抽出し、入出力装置５に送る。入出力装置５は送られた音声および動画を再生する。 When starting a dialogue, the user uses the keyboard 503 provided in the input/output device 5 to input the unique information of the dialogue partner in order to specify the dialogue partner. The input specific information is temporarily stored in the designated information temporary storage device 11 . Next, the user uses the input/output device 5 to input voice and start dialogue. The speech recognition unit 6 converts the speech input by the user into text data. A language understanding unit 7 identifies the user's intention from the text data obtained by the speech recognition unit 6 . The result of the language understanding unit 7 is passed to the response example selection unit 8 . The response example selection unit 8 selects text data from the text data recording unit 3 that matches the unique information temporarily stored in the specified information temporary storage unit 11 and matches or is closest to the result of the language understanding unit 7 . The response example selection unit 8 passes the identifier attached to the selected text data to the audio/video designation unit 9, and the audio/video designation unit 9 extracts the audio and video matching the identifier from the audio/video recording device 4, and inputs/outputs them. Send to device 5. The input/output device 5 reproduces the sent audio and video.

固有情報記録時のキーボード１０３による入力手段および固有情報指定時のキーボード５０３による入力手段は、キーボードに限らずタブレットでも構わない。また、音声を入力するように構成にすることも考えられる。 The input means using the keyboard 103 when recording unique information and the input means using the keyboard 503 when specifying unique information are not limited to keyboards, and may be tablets. It is also conceivable to make a configuration for inputting voice.

固有情報の指定はキーボード５０３を用いて入力する代わりに固有情報記録装置１０に記録された固有情報、例えば登録者名を選択する手段をとってもよい。登録者の固有情報を男女別、年齢等についても登録した場合、20代男性や50代女性等を指定してある特定の層と対話するように構成することも考えられる。 Instead of using the keyboard 503 to specify the unique information, the unique information recorded in the unique information recording device 10, such as the registrant's name, may be selected. If the unique information of registrants is also registered according to gender and age, it is conceivable to designate men in their 20s, women in their 50s, etc., and configure them to interact with a specific group.

ユーザーの対話手段として入出力装置５を使用して音声で入力するように構成しているが、キーボード等音声以外の入力手段をとるように構成することも可能である。これにより音声による対話が難しい人でも対話を楽しむことができる。 Although the input/output device 5 is used as a user interaction means for voice input, it is also possible to use input means other than voice, such as a keyboard. As a result, even people who find it difficult to have conversations by voice can enjoy the conversations.

登録者が芸能人の場合、芸能人を指定して宴会等の乾杯の音頭をする音声および動画を再生してもよい。最初に芸能人を指定し、その後ユーザーによる「乾杯の音頭をお願いします」の音声に対して、指定された芸能人の乾杯の音頭の音声および動画が選択され再生される。乾杯の音頭の前に少し挨拶程度の対話をしてもよい。 If the registrant is an entertainer, the entertainer may be designated and the voice and video of the toast for a party or the like may be reproduced. First, an entertainer is designated, and then, in response to the user's voice "Cheers please", the voice and video of the designated entertainer's toast are selected and played back. Before the toast, you can have a little conversation about greetings.

非常に多くの音声および動画を記録させた場合、芸能人に限らず、著名人や一般人、故人と普通の対話を楽しむことも可能となり、利用する範囲を広げることができる。 When a large number of voices and moving images are recorded, it is possible to enjoy ordinary conversations not only with celebrities, but also with celebrities, ordinary people, and the deceased, and the range of use can be expanded.

１入力装置
１０１カメラ
１０２マイク
１０３キーボード
２音声動画処理部
３テキストデータ記録装置
４音声動画記録装置
５入出力装置
５０１マイク
５０２表示部
５０３キーボード
６音声認識部
７言語理解部
８応答事例選択部
９音声動画指定部
１０固有情報記録装置
１１指定情報一時保管装置 1 input device 101 camera 102 microphone 103 keyboard 2 audio/video processing unit 3 text data recording device 4 audio/video recording device 5 input/output device 501 microphone 502 display unit 503 keyboard 6 speech recognition unit 7 language understanding unit 8 response example selection unit 9 voice Moving image designation unit 10 Specific information recording device 11 Designated information temporary storage device

Claims

An input device for inputting voice and video as a response pattern by the registrant, a voice for converting the input voice into text data and recording it in the text data recording device, and recording the input voice and video in the audio video recording device A video processing unit, an input/output device for inputting voice by the user, a voice recognition unit for converting the input voice into text data, and a language understanding unit for identifying the user's intention from the text data converted by the voice recognition unit. a response example selection unit that selects text data that matches or is closest to the identified user's intention from the text data recording device; and a moving image designating unit, wherein the input/output device reproduces the audio and moving images sent from the audio and moving image designating unit, wherein the audio and moving image processing unit reproduces the text data recorded in the text data recording device and the The same identifier is assigned to each response pattern to the audio and video recorded in the audio/video recording device, and the audio/video specifying unit assigns the same identifier as the identifier assigned to the text data selected by the response example selection unit. A dialog system characterized by extracting given voice and moving images from an audio/moving image recording device.

The dialogue system comprises a unique information recording device for recording unique information of a registrant and a specified information temporary storage device for recording unique information specified by the user, and the audio/video processing unit records in the text data recording device. unique information recorded in the unique information recording device is added to the text data to be written, and the response example selection unit selects the text data that matches or is closest to the user's intention from the text data recording device. A dialogue system characterized by selecting text data matching unique information recorded in a storage device from the text data recording device.