JP2016156993A

JP2016156993A - Information processing unit, control method of information processing unit, control program and storage medium

Info

Publication number: JP2016156993A
Application number: JP2015035326A
Authority: JP
Inventors: 史彦鈴木; Fumihiko Suzuki; 誠悟伊藤; Seigo Ito
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2015-02-25
Filing date: 2015-02-25
Publication date: 2016-09-01
Anticipated expiration: 2035-02-25
Also published as: JP6543047B2

Abstract

PROBLEM TO BE SOLVED: To detect a voice including a name of a registration object person to instruct the registration object person to perform a predetermined operation.SOLUTION: An information processing unit (1) includes: a person's name extraction part (14) configured to extract a name of a registration object person from voice data of a voice including the name of the registration object person; a voice instruction generation part (15) configured to generate a voice instruction to the registration object person including the person's name extracted by the person's name extraction part (14); and a voice output part (32) configured to present the voice instruction to the registration object person.SELECTED DRAWING: Figure 1

Description

本発明は、対象者が特定の動作をするように、対象者に対して指示を提示する情報処理装置、情報処理装置の制御方法、プログラム、記録媒体に関する。 The present invention relates to an information processing apparatus that presents an instruction to a target person so that the target person performs a specific operation, a control method for the information processing apparatus, a program, and a recording medium.

人間との対話機能を用いて、対話の相手を認識対象として、その人物名と顔とを登録することができるロボット装置などが従来技術として知られている。 2. Description of the Related Art Conventionally, a robot apparatus that can register a person's name and face using a dialogue function with a human as a recognition target is known.

例えば、特許文献１は、顔写真と人物名と音声の特徴とでデータを検索し、発話者の音声が新規であれば、この発話者は認識対象となり、その人物名と顔とを登録し、認識対象に対して音声を発して名前を名乗らせた後に、好きな食べ物を発言させる、などの特定の動作をするように指示するロボット装置が記載されている。 For example, Patent Document 1 searches for data using a face photograph, a person's name, and voice characteristics, and if the voice of the speaker is new, the speaker becomes a recognition target and registers the person's name and face. A robot apparatus that instructs to perform a specific operation such as giving a voice to a recognition target and giving the name a name and then saying a favorite food is described.

特開２００３−２５５９８９号公報（２００３年９月１０日公開）JP 2003-255989 A (published September 10, 2003)

しかしながら、特許文献１に記載のロボット装置は、発話者の音声が新規か否かに基づいて、その発話者が新規の認識対象か否かを判断するため、発話者の音声が新規ではない場合、その音声に他の人物の人物名が含まれていても、その人物は認識対象とならない、という問題がある。 However, since the robot apparatus described in Patent Document 1 determines whether or not the speaker is a new recognition target based on whether or not the speaker's voice is new, the speaker's voice is not new. There is a problem that even if the voice includes the name of another person, the person is not recognized.

例えば、新規ではない（すなわち、認識済みの）発話者が、新規の認識対象を紹介したり、該認識対象に話しかけたりする場合、特許文献１に記載の技術では、発話者の音声が新規ではないため、発話者によって紹介されたり、話しかけられたりした人物が新規の認識対象なのか、または認識済みの人物なのかを正しく判断できない虞がある。そして、この判断を誤った場合、新規の認識対象の人物名を認識することができない。 For example, when a non-new (that is, recognized) speaker introduces a new recognition target or speaks to the recognition target, the technique described in Patent Document 1 does not provide a new voice for the speaker. Therefore, it may not be possible to correctly determine whether the person introduced or spoken by the speaker is a new recognition target or a recognized person. If this determination is wrong, a new recognition target person name cannot be recognized.

本発明は、上記の点を鑑みてなされたものであり、その目的は、対象者の人物名を含む音声を検知したときに、対象者が所定の動作をするように対象者に指示することができる情報処理装置、情報処理装置の制御方法などを実現することである。 The present invention has been made in view of the above points, and an object of the present invention is to instruct the subject to perform a predetermined operation when a voice including the subject's person name is detected. It is to realize an information processing apparatus capable of performing the processing, a control method for the information processing apparatus, and the like.

上記の課題を解決するために、本発明の一態様に係る情報処理装置は、音声の入力を受付ける音声入力部を備える情報処理装置であって、上記音声入力部から入力された、対象者の識別情報を含む音声の音声データから上記対象者の識別情報を抽出する識別情報抽出部と、上記対象者に対して所定の動作を行うよう指示する動作指示を、上記識別情報抽出部が抽出した上記識別情報を含めて生成する指示生成部と、上記動作指示を上記対象者に対して提示する指示提示部と、を備える。 In order to solve the above-described problem, an information processing apparatus according to an aspect of the present invention is an information processing apparatus including a voice input unit that receives voice input, and is input from the voice input unit. The identification information extraction unit extracts the identification information extraction unit that extracts the identification information of the target person from the voice data including the identification information, and the operation instruction that instructs the target person to perform a predetermined operation. An instruction generation unit that generates the identification information including the identification information, and an instruction presentation unit that presents the operation instruction to the target person.

また、上記の課題を解決するために、本発明の一態様に係る情報処理装置の制御方法は、音声の入力を受付ける音声入力部を備える情報処理装置の制御方法であって、上記音声入力部から、対象者の識別情報を含む音声の入力を受付ける音声入力ステップと、上記音声入力ステップにて受付けた上記音声の音声データから上記対象者の識別情報を抽出する識別情報抽出ステップと、上記対象者に対して所定の動作を行うよう指示する動作指示を、上記識別情報抽出ステップにて抽出した上記識別情報を含めて生成する指示生成ステップと、上記指示生成ステップにて生成した上記動作指示を上記対象者に対して提示する指示提示ステップと、を含む。 In order to solve the above problem, a control method for an information processing device according to an aspect of the present invention is a control method for an information processing device including a voice input unit that accepts voice input. A voice input step for receiving voice input including identification information of the target person, an identification information extraction step for extracting the target person identification information from the voice data of the voice received in the voice input step, and the target An instruction generation step for generating an operation instruction for instructing a person to perform a predetermined operation including the identification information extracted in the identification information extraction step, and the operation instruction generated in the instruction generation step. An instruction presentation step to be presented to the subject.

本発明の一態様によれば、対象者の人物名を含む音声を検知したときに、対象者が特定の動作をするように対象者に指示できるという効果を奏する。 According to one aspect of the present invention, there is an effect that when a voice including a person name of a target person is detected, the target person can be instructed to perform a specific operation.

本発明に係る情報処理装置の概略構成の一例を示すブロック図である。It is a block diagram which shows an example of schematic structure of the information processing apparatus which concerns on this invention. 情報処理装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of information processing apparatus. 情報処理装置が登録対象者の人物名と特徴データとを対応付けて登録する処理を説明する図である。It is a figure explaining the process which an information processing apparatus matches and registers the person name of a registration subject, and characteristic data. 情報処理装置が、テキストデータから人物名のテキストを抽出する処理の一例を示す図である。It is a figure which shows an example of the process which an information processing apparatus extracts the text of a person name from text data. 情報処理装置が出力する音声のテンプレートの一例を示す図である。It is a figure which shows an example of the audio | voice template which an information processing apparatus outputs. 情報処理装置が登録対象者の人物名と特徴データとを対応付けて登録する処理の流れの一例を説明するフローチャートである。It is a flowchart explaining an example of the flow of a process in which an information processing apparatus registers a person name of a registration target person and feature data in association with each other.

〔実施形態１〕
以下、本発明の実施の形態について、詳細に説明する。 Embodiment 1
Hereinafter, embodiments of the present invention will be described in detail.

（情報処理装置１が登録対象者を登録する処理）
はじめに、本発明の一実施形態において、情報処理装置１が登録する対象である登録対象者（対象者）の人物名（識別情報）と特徴データ（固有情報）とを対応付けて登録する処理について図３を用いて説明する。図３は、情報処理装置１が登録対象者の人物名と特徴データとを対応付けて登録する処理を説明する図である。 (Process in which the information processing apparatus 1 registers a person to be registered)
First, in one embodiment of the present invention, a process of registering a person name (identification information) and feature data (unique information) of a registration target person (target person) that is a target to be registered by the information processing apparatus 1 in association with each other. This will be described with reference to FIG. FIG. 3 is a diagram illustrating processing in which the information processing apparatus 1 registers the person name of the registration target person and the feature data in association with each other.

情報処理装置１は、登録対象者の人物名（姓、名前など）を含む音声の入力を受付けて、この音声に含まれる登録対象者の人物名を抽出し、該登録対象者の人物名を含んだ音声指示を発して、該登録対象者が所定の動作をするように指示する。 The information processing apparatus 1 accepts an input of a voice including a person name (last name, first name, etc.) of the person to be registered, extracts the person name of the person to be registered included in the voice, and determines the person name of the person to be registered. An included voice instruction is issued to instruct the person to be registered to perform a predetermined operation.

登録対象者の人物名の抽出は、登録対象者本人が発する音声の入力を受付けたときであってもよいし、登録対象者と異なる人物（ユーザＵなど）が発する音声の入力を受付けたときであってもよい。ここで、登録対象者が発する音声の例としては、登録対象者が自己紹介をしたり、他の人物に自分の名を名乗って挨拶をしたりする場合の音声が挙げられる。一方、登録対象者と異なる人物が発する音声の例としては、ユーザＵが登録対象者を他の人物に紹介するときの音声、ユーザＵが情報処理装置１に対して登録対象者を紹介するときの音声、およびユーザＵが登録対象者に問いかけたり、話しかけたりするときの音声が挙げられる。なお、登録対象者本人が発する音声や、登録対象者と異なる人物が発する音声に、登録対象者の人物名が含まれる場合、このような音声は所定の形式の文型を有している。また、情報処理装置１が抽出する登録対象者の人物名は、登録対象者が自分のことを指す呼称として認識している名称であれば、姓、名前などの人物名に限定されず、ニックネーム（通称）、別称、芸名などであってもよい。 The person name of the registration target person may be extracted when an input of a voice uttered by the registration target person is accepted, or when an input of a voice uttered by a person (such as the user U) different from the registration target person is accepted. It may be. Here, as an example of the voice uttered by the registration target person, there is a voice when the registration target person introduces himself / herself and gives a greeting by giving his / her name to another person. On the other hand, examples of voices uttered by a person different from the registration target person include voices when the user U introduces the registration target person to other persons, and when the user U introduces the registration target person to the information processing apparatus 1. And voices when the user U asks or speaks to the person to be registered. In addition, when the person name of a registration object person is contained in the sound which a registration object person himself utters, or the sound which a person different from a registration object person utters, such a sound has a sentence pattern of a predetermined format. Further, the name of the person to be registered extracted by the information processing apparatus 1 is not limited to a person name such as a surname or a name, as long as it is a name that the person to be registered recognizes as a name indicating itself. (Common name), alternative names, stage names, etc.

以下では、指示に従って所定の動作をした登録対象者の顔を撮像して、登録対象者の顔を撮像した顔画像Ｆから特徴データを抽出して、該登録対象者の特徴データと人物名とを対応付けた登録データを生成して管理する情報処理装置１を例に挙げて説明する。ここで、登録対象者の「特徴データ」とは、登録対象者の顔画像Ｆから抽出される特徴データに限定されず、指紋情報、網膜情報、声紋情報などから抽出される特徴データであってもよい。 In the following, a face of a registration target person who has performed a predetermined operation according to an instruction is imaged, feature data is extracted from a face image F obtained by imaging the face of the registration target person, and the registration target person's feature data, person name, An information processing apparatus 1 that generates and manages registration data in association with each other will be described as an example. Here, the “feature data” of the registration target person is not limited to feature data extracted from the face image F of the registration target person, and is feature data extracted from fingerprint information, retina information, voiceprint information, and the like. Also good.

なお、以下では、「所定の動作」が、登録対象者の顔を撮像した顔画像Ｆを取得するために、登録対象者に対して、顔を近づけさせる指示である場合を例に挙げて説明するが、これに限定されない。すなわち、情報処理装置１は、所定の動作をするように登録対象者に対して指示するが、「所定の動作」は、上記登録対象者に固有の固有情報を情報処理装置１が取得するための動作に限定されず、如何なる動作であってもよい。例えば、情報処理装置１が登録対象者の指紋情報を取得するための指示の場合、「指をスキャナ面に密着させる」ことを「所定の動作」として指示してもよい。あるいは、「情報処理装置１に対して登録対象者が所定の操作入力を行わせる」ことや、「所定の位置に移動させる」ことを「所定の動作」として指示してもよい。このような場合、「登録対象者」とは、情報処理装置１に登録される対象の人物に限定されず、情報処理装置１が発する指示に従って所定の動作をする、指示対象の人物（対象者）であってもよい。 In the following description, the case where the “predetermined operation” is an instruction to bring the face closer to the registration target person in order to acquire the face image F obtained by capturing the face of the registration target person will be described as an example. However, it is not limited to this. That is, the information processing apparatus 1 instructs the registration target person to perform a predetermined operation, but the “predetermined operation” is because the information processing apparatus 1 acquires unique information unique to the registration target person. It is not limited to this operation, and any operation may be performed. For example, in the case of an instruction for the information processing apparatus 1 to acquire the fingerprint information of the person to be registered, “contacting the finger with the scanner surface” may be instructed as a “predetermined operation”. Alternatively, “the user to be registered performs a predetermined operation input to the information processing apparatus 1” or “move to a predetermined position” may be instructed as a “predetermined operation”. In such a case, the “registration target person” is not limited to a target person registered in the information processing apparatus 1, and is an instruction target person (target person) who performs a predetermined operation in accordance with an instruction issued by the information processing apparatus 1. ).

例えば、図３の（ａ）に示すように、情報処理装置１は、音声の入力を受付ける音声入力部３１および音声出力部３２（指示提示部）を備えており、鈴木さんおよび佐藤さんと一緒にいるユーザＵが発した「こちらは鈴木さんです。」という音声を取得する。そして、情報処理装置１は、このユーザＵが発した音声に含まれる登録対象者の人物名である「鈴木」を用いて、鈴木さんが顔を近づけるように指示する音声指示「鈴木さん、お顔を近づけてください。」を音声出力部３２から出力する。 For example, as illustrated in FIG. 3A, the information processing apparatus 1 includes a voice input unit 31 and a voice output unit 32 (instruction presentation unit) that receive voice input, together with Mr. Suzuki and Mr. Sato. The voice “Users are Mr. Suzuki” uttered by the user U is acquired. Then, the information processing apparatus 1 uses “Suzuki”, which is the person name of the person to be registered, included in the voice uttered by the user U, and uses the voice instruction “Mr. Suzuki, Oh, “Please bring your face close.” Is output from the audio output unit 32.

情報処理装置１は撮像装置３３をさらに備えており、音声指示を出力した後に近づけられた顔を撮像する。例えば、図３の（ｂ）に示すように、音声指示を聞いた鈴木さんが情報処理装置１に近づいて顔を近づけると、情報処理装置１は、顔を撮像して、図３の（ｃ）に示すような顔画像Ｆを取得する。 The information processing apparatus 1 further includes an imaging device 33, which captures an image of a face that is brought close after outputting a voice instruction. For example, as shown in FIG. 3B, when Mr. Suzuki who has heard the voice instruction approaches the information processing apparatus 1 and brings his face close to the information processing apparatus 1, the information processing apparatus 1 captures the face and displays (c) in FIG. A face image F as shown in FIG.

このように、情報処理装置１は、このユーザＵが発した音声に含まれる登録対象者の人物名（例えば、「鈴木」）を含む音声指示を発することにより、この人物名の登録対象者に顔を近づけさせる等の、所定の動作をさせる。図３の（ａ）に示した例では、情報処理装置１からの音声指示を、複数の人物（すなわち、ユーザＵ、鈴木さん、および佐藤さん）が聞いているが、音声指示に登録対象者の人物名を含めて呼び掛けることによって、所定の動作を特定の登録対象者にさせることができる。音声指示に含まれる人物名（例えば、「鈴木」さん）の人物に、顔を近づけさせることにより、登録対象者の顔を、他の人物と取り違えることなく、正しく撮像することができる。 As described above, the information processing apparatus 1 issues a voice instruction including the person name (for example, “Suzuki”) of the person to be registered included in the voice uttered by the user U, to the person to be registered with this person name. A predetermined operation such as bringing a face close is performed. In the example shown in (a) of FIG. 3, a plurality of persons (that is, user U, Mr. Suzuki, and Mr. Sato) are listening to a voice instruction from the information processing apparatus 1. A specific operation can be made to be performed by a specific registration target person by including the person name. By bringing a face close to a person with a person name (for example, “Mr. Suzuki”) included in the voice instruction, the face of the person to be registered can be correctly imaged without being mistaken for another person.

情報処理装置１は、音声指示に従って顔を近づけた鈴木さんの顔を撮像した顔画像Ｆから、１または複数の特徴データを抽出して、該抽出した特徴データと、登録対象者の人物名「鈴木」さんと、を対応付けて登録データを生成し、この登録データを登録データベース２４（図１参照）に格納する。顔画像Ｆからの特徴データの抽出は、例えば、顔画像Ｆの目に対応する点Ｐ１および点Ｐ２の位置、鼻に対応する点Ｐ３の位置、および口に対応する点Ｐ４（口の中央）、点Ｐ５（口の右端）、点Ｐ６（口の左端）の各位置の位置座標、および各点の間の距離などを算出することによって行われる。ここで、位置座標とは、顔画像Ｆの所定の位置（例えば左下の端の位置）を原点とする座標系における、各点の位置に対応する座標である。なお、抽出する特徴データは、これらに限定されず、顔画像Ｆの画像データそのもの、顔の色、耳の形状など、任意の特徴データを登録データとして用いてもよい。また、顔画像Ｆは平面画像に限定されず、立体画像であってもよい。 The information processing apparatus 1 extracts one or a plurality of feature data from the face image F obtained by capturing the face of Mr. Suzuki whose face is brought close according to the voice instruction, and extracts the extracted feature data and the person name “ Registration data is generated in association with Mr. Suzuki, and this registration data is stored in the registration database 24 (see FIG. 1). The feature data is extracted from the face image F by, for example, the positions of the points P1 and P2 corresponding to the eyes of the face image F, the position of the point P3 corresponding to the nose, and the point P4 corresponding to the mouth (center of the mouth). This is done by calculating the position coordinates of each position of the point P5 (the right end of the mouth), the point P6 (the left end of the mouth), the distance between the points, and the like. Here, the position coordinates are coordinates corresponding to the position of each point in the coordinate system having the origin at a predetermined position (for example, the position of the lower left end) of the face image F. Note that the feature data to be extracted is not limited thereto, and arbitrary feature data such as the image data of the face image F itself, the face color, and the ear shape may be used as the registration data. Further, the face image F is not limited to a planar image, and may be a stereoscopic image.

なお、情報処理装置１が撮像装置３３を備え、登録対象者の顔の撮像をするために、指示音声（動作指示）を発して顔を近づけさせる場合を例に挙げて説明したが、指示音声はこれに限定されない。例えば、鈴木さんの顔を撮像した後に、「鈴木さん、お耳を見せてください。」という音声指示をさらに発して、鈴木さんの耳も撮像して、耳の形状を特徴データに追加してもよい。また、情報処理装置１が指紋センサを備える場合、「鈴木さん、指をスキャナ面に密着させてください。」という音声指示を発して、鈴木さんの指紋情報を特徴データとして取得してもよい。 Note that although the information processing apparatus 1 includes the imaging device 33 and has been described by taking as an example the case where an instruction voice (operation instruction) is issued to bring the face closer in order to capture the face of the person to be registered, the instruction voice Is not limited to this. For example, after imaging the face of Mr. Suzuki, further voice instructions “Mr. Suzuki, please show me your ears” are also taken, and Mr. Suzuki ’s ear is also imaged, and the shape of the ear is added to the feature data. Also good. When the information processing apparatus 1 includes a fingerprint sensor, a voice instruction “Mr. Suzuki, please keep your finger in close contact with the scanner surface” may be issued to acquire Mr. Suzuki's fingerprint information as feature data.

また、登録対象者の顔の撮像をするための動作指示を、音声として発する例を説明したが、動作指示の出力は音声出力であることに限定されず、登録対象者に所定の動作をするように提示できる出力方法であれば如何なる方法であってもよい。例えば、動作指示を表示装置（図示せず）に文字列として表示してもよい。 In addition, an example in which an operation instruction for imaging the face of a registration target person is issued as a voice has been described. However, the output of the operation instruction is not limited to voice output, and a predetermined operation is performed on the registration target person. As long as the output method can be presented as described above, any method may be used. For example, the operation instruction may be displayed as a character string on a display device (not shown).

（情報処理装置１のハードウェア構成）
まず、情報処理装置１のハードウェア構成について、図２を用いて説明する。図２は、情報処理装置１のハードウェア構成の一例を示すブロック図である。なお、説明の便宜上、姿勢などを制御する動作制御部、データ通信を行う通信部など、情報処理装置１が備え得る他の機能については、その説明を省略する。 (Hardware configuration of information processing apparatus 1)
First, the hardware configuration of the information processing apparatus 1 will be described with reference to FIG. FIG. 2 is a block diagram illustrating an example of a hardware configuration of the information processing apparatus 1. For convenience of explanation, descriptions of other functions that the information processing apparatus 1 can have, such as an operation control unit that controls the posture and the like, and a communication unit that performs data communication, are omitted.

情報処理装置１は、音声入力部３１、音声出力部３２、撮像装置３３、制御部１０、および記憶部２０を備えている。制御部１０は、情報処理装置１が備える各機能の処理を実行するように制御するものであり、音声入力部３１から入力した音声、および撮像装置３３が撮像した撮像画像Ｒを取得する。また、制御部１０、記憶部２０に格納されている認識辞書２１、形態素解析辞書２２、および指示テンプレートデータベース２３（図１参照）などを適宜参照して、所定の処理を実行する。さらに制御部１０は、登録データベース２４に生成した登録データを格納する。 The information processing apparatus 1 includes a voice input unit 31, a voice output unit 32, an imaging device 33, a control unit 10, and a storage unit 20. The control unit 10 performs control so as to execute processing of each function included in the information processing apparatus 1, and acquires the sound input from the sound input unit 31 and the captured image R captured by the image capturing device 33. Further, predetermined processing is executed with reference to the control unit 10, the recognition dictionary 21, the morphological analysis dictionary 22, and the instruction template database 23 (see FIG. 1) stored in the storage unit 20 as appropriate. Further, the control unit 10 stores the generated registration data in the registration database 24.

（情報処理装置１の構成）
次に、情報処理装置１の構成について図１を用いて説明する。図１は、情報処理装置１の概略構成の一例を示すブロック図である。情報処理装置１は、図２において既に説明したように、音声入力部３１、音声出力部３２、撮像装置３３、制御部１０、および記憶部２０を備えている。情報処理装置１が備える各機能を制御する制御部１０は、さらに、テキスト化部１１、形態素解析部１２、紹介文抽出部１３、人物名抽出部１４（識別情報抽出部）、音声指示生成部１５（指示生成部）、撮像装置制御部１６、特徴データ抽出部１７（固有情報抽出部）、および登録データ管理部１８を備えている。記憶部２０は、認識辞書２１、形態素解析辞書２２、指示テンプレートデータベース２３、および登録データベース２４（記憶部）を備えている。 (Configuration of information processing apparatus 1)
Next, the configuration of the information processing apparatus 1 will be described with reference to FIG. FIG. 1 is a block diagram illustrating an example of a schematic configuration of the information processing apparatus 1. As already described with reference to FIG. 2, the information processing apparatus 1 includes the audio input unit 31, the audio output unit 32, the imaging device 33, the control unit 10, and the storage unit 20. The control unit 10 that controls each function included in the information processing apparatus 1 further includes a text unit 11, a morpheme analyzer 12, an introduction sentence extraction unit 13, a person name extraction unit 14 (identification information extraction unit), and a voice instruction generation unit. 15 (instruction generation unit), imaging device control unit 16, feature data extraction unit 17 (unique information extraction unit), and registered data management unit 18. The storage unit 20 includes a recognition dictionary 21, a morphological analysis dictionary 22, an instruction template database 23, and a registration database 24 (storage unit).

音声入力部３１は、音声の入力を受付けるマイクロフォンであり、音声入力部３１から入力した音声の音声データは、テキスト化部１１に送られる。 The voice input unit 31 is a microphone that accepts voice input. The voice data input from the voice input unit 31 is sent to the text unit 11.

音声出力部３２は、音声データを音声として出力するスピーカである。音声出力部３２は、音声指示生成部１５から取得した、音声指示の音声データを出力する。 The audio output unit 32 is a speaker that outputs audio data as audio. The voice output unit 32 outputs voice data of the voice instruction acquired from the voice instruction generation unit 15.

撮像装置３３は、画像データを取得するデジタルカメラであり、例えば、ＣＣＤセンサ、ＣＭＯＳセンサなどを用いて、撮像対象物の撮像を行うカメラモジュールである。なお、撮像装置３３が撮像する画像は２次元画像に限定されず、３次元画像であってもよい。 The imaging device 33 is a digital camera that acquires image data. For example, the imaging device 33 is a camera module that captures an imaging target using a CCD sensor, a CMOS sensor, or the like. Note that the image captured by the imaging device 33 is not limited to a two-dimensional image, and may be a three-dimensional image.

テキスト化部１１は、音声入力部３１から入力した音声の音声データの音響特徴量を算出し、認識辞書２１を参照しながら、音声データをテキストデータへと変換する。認識辞書２１には、音響モデルと言語モデルとが規定されており、テキスト化部１１は、音響特徴量毎に対応するテキストを割当てて、音声データのテキストデータ化を行う。具体的には、認識辞書２１は、音声データに含まれる音声がどのような周波数特性を有しているかを表したものであり、多数の音声の波形のサンプルと、対応するテキストデータとの対応関係が規定されている。テキスト化部１１は、認識辞書２１を検索して、算出した音響特徴量に一致する音声に対応するテキストを特定する。テキスト化部１１は、テキストデータを形態素解析部１２に送る。 The text unit 11 calculates the acoustic feature amount of the voice data input from the voice input unit 31 and converts the voice data into text data while referring to the recognition dictionary 21. The recognition dictionary 21 defines an acoustic model and a language model, and the text conversion unit 11 assigns a text corresponding to each acoustic feature amount and converts voice data into text data. Specifically, the recognition dictionary 21 represents what frequency characteristics the voice included in the voice data has, and correspondence between a large number of voice waveform samples and corresponding text data. Relationships are defined. The text unit 11 searches the recognition dictionary 21 and identifies text corresponding to the voice that matches the calculated acoustic feature amount. The text unit 11 sends the text data to the morpheme analyzer 12.

形態素解析部１２は、テキスト化部１１から取得したテキストデータを形態素に分解する。形態素とは、意味を持つ最小の言語単位であり、文法関係を示す構成要素である。形態素解析部１２は、テキストデータを、形態素解析辞書２２を参照して文法上の各構成要素に分解し、それぞれの構成要素の文法関係を特定する。形態素解析部１２は、解析結果を、解析対象のテキストデータと共に、紹介文抽出部１３に出力する。なお、形態素解析部１２が、ユーザＵが登録対象者を紹介する音声をテキスト化したテキストデータにおける形態素解析の具体例については、後に説明する。 The morpheme analyzer 12 decomposes the text data acquired from the text unit 11 into morphemes. A morpheme is the smallest linguistic unit that has meaning and is a component that indicates a grammatical relationship. The morpheme analysis unit 12 decomposes the text data into each grammatical component with reference to the morpheme analysis dictionary 22 and specifies the grammatical relationship between the respective components. The morpheme analyzer 12 outputs the analysis result to the introductory sentence extractor 13 together with the text data to be analyzed. A specific example of the morphological analysis in the text data in which the morphological analysis unit 12 converts the voice that the user U introduces the registration target into text will be described later.

紹介文抽出部１３は、形態素解析の結果に基づいて、音声認識して生成されたテキストデータ（テキスト）の中から、登録対象者の人物名を含む文を抽出する。紹介文抽出部１３は、登録対象者が登録対象者と異なる人物に自己紹介したり、登録対象者と異なる人物が登録対象者を第三者もしくは情報処理装置１に対して紹介したりするときの音声から紹介文などを抽出してもよい。例えば、「私は鈴木です。」のような登録対象者本人が自己紹介する紹介文や、「こちらは鈴木さんです」のような登録対象者と異なる人物（例えば、ユーザＵ）が登録対象者を紹介する紹介文は、一人称または三人称の代名詞（私、僕、彼、こちらなど）の後に、人物名「鈴木」が配置される、といった、紹介文に特徴的な所定の形式を有している。登録対象者の人物名を含む紹介文の形式としては、これに限定されない。例えば、登録対象者である鈴木さんが、ユーザＵや佐藤さんに対して自己紹介するときに一人称の代名詞を省略して発する、「鈴木です。皆さん、よろしく。」や「佐藤さん、はじめまして。鈴木といいます。」などの音声から生成されたテキストデータも、登録対象者の人物名を含む所定の形式の文型を有しており、紹介文として抽出してもよい。紹介文抽出部１３は、登録対象者の人物名を含む所定の形式の文型を有するテキストデータを抽出する。 The introductory sentence extracting unit 13 extracts a sentence including the name of the person to be registered from text data (text) generated by speech recognition based on the result of morphological analysis. When the registration target person introduces himself / herself to a person different from the registration target person or a person different from the registration target person introduces the registration target person to a third party or the information processing apparatus 1 Introductory text may be extracted from For example, an introductory sentence that the person to be registered introduces, such as “I am Suzuki.” Or a person (for example, user U) who is different from the person to be registered, such as “This is Mr. Suzuki”. The introductory text that introduces the subject has a specific form that is characteristic of the introductory text, such as the person name “Suzuki” placed after the first or third person pronouns (I, I, him, here, etc.) Yes. The format of the introduction sentence including the person name of the person to be registered is not limited to this. For example, Mr. Suzuki, who is the subject of registration, omits the first-person pronouns when introducing himself to User U and Mr. Sato. The text data generated from the speech such as “has a predetermined form including the person name of the person to be registered, and may be extracted as an introduction sentence. The introductory sentence extracting unit 13 extracts text data having a sentence pattern of a predetermined format including the person name of the person to be registered.

人物名抽出部１４は、紹介文抽出部１３によって紹介文であることが確認されたテキストデータにおいて、形態素解析部１２が「名詞、固有名詞、人物名、姓」であると特定した名詞を、人物名のテキストデータとして抽出する。抽出された人物名のテキストデータは、音声指示生成部１５および登録データ管理部１８に出力される。 In the text data that has been confirmed to be an introductory sentence by the introductory sentence extracting unit 13, the person name extracting unit 14 identifies the noun that the morpheme analyzing unit 12 has identified as “noun, proper noun, person name, surname”. Extracted as text data of person names. The extracted text data of the person name is output to the voice instruction generation unit 15 and the registration data management unit 18.

音声指示生成部１５は、人物名抽出部１４が抽出した人物名（例えば、「鈴木」）のテキストを取得して、指示テンプレートデータベース２３に格納されている指示テンプレートの所定の位置に該人物名を挿入して、人物名を含む音声指示を生成する。なお、音声指示生成部１５が、ユーザＵが紹介した登録対象者の人物名を含む音声指示を生成する処理の具体例については、後に説明する。 The voice instruction generation unit 15 acquires the text of the person name (for example, “Suzuki”) extracted by the person name extraction unit 14 and places the person name at a predetermined position of the instruction template stored in the instruction template database 23. Is inserted to generate a voice instruction including the person name. Note that a specific example of processing in which the voice instruction generation unit 15 generates a voice instruction including the person name of the person to be registered introduced by the user U will be described later.

撮像装置制御部１６は、音声出力部３２から出力した音声指示を聞いた登録対象者が、該音声指示で指示した所定の動作を行ったときに、撮像装置３３が撮像するように制御する。なお、人物名抽出部１４が抽出した人物名のテキストデータが、人物名抽出部１４から音声指示生成部１５へ出力されるときに、撮像装置３３による撮像を開始するように撮像装置制御部１６に指示してもよい。あるいは、音声指示生成部１５が音声出力部３２に、生成した音声指示を出力するときに、撮像装置３３による撮像を開始するように撮像装置制御部１６に指示してもよい。 The imaging device control unit 16 controls the imaging device 33 to take an image when a registration target person who has heard the voice instruction output from the voice output unit 32 performs a predetermined operation instructed by the voice instruction. Note that when the text data of the person name extracted by the person name extraction unit 14 is output from the person name extraction unit 14 to the voice instruction generation unit 15, the imaging device control unit 16 starts imaging by the imaging device 33. You may instruct. Alternatively, when the voice instruction generation unit 15 outputs the generated voice instruction to the voice output unit 32, the imaging device control unit 16 may be instructed to start imaging by the imaging device 33.

また、撮像装置制御部１６は、撮像装置３３が撮像した撮像画像Ｒが、特徴データを抽出するために満たすべき条件を満たしているか否かを判断し、この条件が満たされていない場合には、同じ音声指示、または関連する動作を指示する音声指示（関連音声指示）を出力するように、音声指示生成部１５に指示してもよい。特徴データを抽出するために満たすべき条件とは、例えば、撮像画像Ｒにおける、顔画像Ｆの面積が一定以上のサイズであることや、撮像した顔画像Ｆで目・鼻・口の各位置（図３の（ｃ）の点Ｐ１〜Ｐ６参照）の位置座標が決定可能であることなどである。 In addition, the imaging device control unit 16 determines whether or not the captured image R captured by the imaging device 33 satisfies a condition that must be satisfied in order to extract feature data, and if this condition is not satisfied, The voice instruction generation unit 15 may be instructed to output the same voice instruction or a voice instruction (related voice instruction) that instructs a related operation. The conditions to be satisfied in order to extract the feature data include, for example, that the area of the face image F in the captured image R is a certain size or more, and each position of the eyes, nose, and mouth in the captured face image F ( For example, the position coordinates of the points P1 to P6 in FIG. 3C can be determined.

特徴データ抽出部１７は、顔画像Ｆの目に対応する点Ｐ１および点Ｐ２の位置、鼻に対応する点Ｐ３の位置、および口に対応する点Ｐ４（口の中央）、点Ｐ５（口の右端）、点Ｐ６（口の左端）の各位置の位置座標、および各点の間の距離などを特徴データとして抽出する。 The feature data extraction unit 17 includes the positions of the points P1 and P2 corresponding to the eyes of the face image F, the position of the point P3 corresponding to the nose, the point P4 (the center of the mouth), and the point P5 (the mouth of the mouth). (Right end), the position coordinates of each position of the point P6 (left end of the mouth), the distance between the points, and the like are extracted as feature data.

登録データ管理部１８は、特徴データ抽出部１７が抽出した特徴データと人物名とを対応付けて登録データを生成し、この登録データを登録データベース２４に格納する。 The registration data management unit 18 generates registration data by associating the feature data extracted by the feature data extraction unit 17 with the person name, and stores the registration data in the registration database 24.

このように、情報処理装置１は、登録する登録対象者を紹介するユーザＵの音声の入力を受付けると、該登録対象者が所定の動作をするように、該登録対象者の人物名を含む音声指示を発する。これにより、他の人物と取り違えることなく登録対象者の顔を撮像することができるため、該登録対象者に関する登録データを正確に管理することができる。 As described above, when the information processing apparatus 1 receives the input of the voice of the user U who introduces the registration target person to be registered, the information processing apparatus 1 includes the person name of the registration target person so that the registration target person performs a predetermined operation. Give voice instructions. Accordingly, since the face of the registration target person can be imaged without being mistaken for another person, registration data relating to the registration target person can be managed accurately.

なお、登録対象者の人物名を抽出する対象となる音声は、ユーザＵが登録対象者を第三者または情報処理装置１に紹介する音声や、登録対象者が他の人物に対して自己紹介する音声に限定されず、ユーザＵが登録対象者に対して話しかけたり、問いかけたりするときの音声であってもよい。例えば、図３に示す場合において、ユーザＵが鈴木さんや佐藤さんに対して、「鈴木さん、お元気ですか。」と問いかける音声や、「佐藤さん、こちらへどうぞ。」と話しかける音声などであってもよい。このような場合、紹介文抽出部１３（対象文抽出部）は、形態素解析の結果に基づいて、音声認識して生成されたテキストデータの中から、自分以外の人物に対して発する所定の形式の文型を有する音声のテキストデータ（対象文）を抽出する。次に、人物名抽出部１４は、紹介文抽出部１３によって所定の形式の文型を有することが確認されたテキストデータにおいて、形態素解析部１２が「名詞、固有名詞、人物名、姓」であると特定した名詞を、人物名のテキストデータとして抽出することができる。 In addition, the audio | voice which becomes the object which extracts the person name of a registration object person is the audio | voice which the user U introduces a registration object person to the third party or the information processing apparatus 1, or the registration object person introduces himself / herself to another person. It is not limited to the voice to be used, and may be a voice when the user U talks to or asks a registration target person. For example, in the case shown in FIG. 3, the user U asks Mr. Suzuki and Mr. Sato, for example, a voice for asking “Mr. Suzuki, how are you?” There may be. In such a case, the introductory sentence extracting unit 13 (target sentence extracting unit), based on the result of the morphological analysis, generates a predetermined format issued to a person other than itself from text data generated by speech recognition. The speech text data (target sentence) having the following sentence pattern is extracted. Next, the person name extraction unit 14 has the morpheme analysis unit 12 of “noun, proper noun, person name, surname” in the text data confirmed by the introduction sentence extraction unit 13 to have a sentence pattern of a predetermined format. Can be extracted as text data of a person name.

（テキストデータからの人物名の抽出）
次に、人物名抽出部１４が、形態素解析部１２によるテキストデータの形態素解析に基づいて、テキストデータに含まれる人物名を抽出する処理について、図４を用いて説明する。図４は、テキストデータから人物名のテキストを抽出する処理の一例を示す図である。図４の（ａ）は「こちらは鈴木さんです。」というテキストデータの形態素解析および人物名の抽出を説明するものであり、（ｂ）は「こちらは鈴木さんと佐藤さんです。」というテキストデータの形態素解析および人物名の抽出を説明するものである。 (Extraction of person name from text data)
Next, a process in which the person name extraction unit 14 extracts a person name included in the text data based on the morphological analysis of the text data by the morpheme analysis unit 12 will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of processing for extracting the text of the person name from the text data. (A) in FIG. 4 explains the morphological analysis of text data “This is Mr. Suzuki” and the extraction of person names, and (b) is the text “This is Mr. Suzuki and Mr. Sato.” This explains data morphological analysis and person name extraction.

図４の（ａ）に示すように、「こちらは鈴木さんです。」というテキストデータは、「こちら」、「は」、「鈴木」、「さん」、および「です」という形態素に分解される。「こちら」は、「名詞、代名詞、一般（名詞）」であると帰属され、「は」は、「助詞、係助詞」であると帰属され、「鈴木」は、「名詞、固有名詞、人物名、姓」であると帰属され、「さん」は、「名詞、接尾、人物名」であると帰属され、「です」は、「助動詞」であると帰属される。人物名抽出部１４は、「名詞、固有名詞、人物名、姓」として形態素解析部１２が帰属した「鈴木」というテキストデータを、人物名であるとして抽出する。 As shown in FIG. 4A, the text data “This is Mr. Suzuki” is decomposed into morphemes “This”, “Ha”, “Suzuki”, “Mr.”, and “Is”. . "Here" is attributed as "noun, pronoun, general (noun)", "ha" is attributed as "particle, co-particle", and "Suzuki" is "noun, proper noun, person" “First name, surname” is attributed, “san” is attributed as “noun, suffix, person name”, and “is” is attributed as “auxiliary verb”. The person name extraction unit 14 extracts text data “Suzuki” to which the morpheme analysis unit 12 belongs as “noun, proper noun, person name, surname” as a person name.

図４の（ｂ）に示すように、「こちらは鈴木さんと佐藤さんです。」というテキストデータは、「こちら」、「は」、「鈴木」、「さん」、「と」、「佐藤」、「さん」、および「です」という形態素に分解される。そして、「鈴木」および「佐藤」が「名詞、固有名詞、人物名、姓」であると帰属される。人物名抽出部１４は、「名詞、固有名詞、人物名、姓」として形態素解析部１２が帰属した「鈴木」というテキストデータを、人物名であるとして抽出する。 As shown in FIG. 4B, the text data “This is Mr. Suzuki and Mr. Sato” are “here”, “ha”, “Suzuki”, “san”, “to”, “Sato”. , “San”, and “is” morphemes. “Suzuki” and “Sato” are attributed as “nouns, proper nouns, person names, surnames”. The person name extraction unit 14 extracts text data “Suzuki” to which the morpheme analysis unit 12 belongs as “noun, proper noun, person name, surname” as a person name.

このように、人物名抽出部１４は、各テキストデータから、１または複数の人物名のテキストデータを抽出してもよい。これにより、ユーザＵから複数の人物を紹介する音声が入力した場合においても、人物名抽出部１４は、該音声によって紹介されている登録対象者の人物名をすべて抽出することができる。 As described above, the person name extraction unit 14 may extract text data of one or more person names from each text data. Thereby, even when the voice which introduces a plurality of persons is inputted from user U, person name extraction part 14 can extract all the person names of the registration subject introduced by the voice.

（出力する音声の生成）
続いて、登録対象者が所定の動作を行うように、登録対象者に対して指示する音声指示の例について、図５を用いて説明する。図５は、情報処理装置１が出力する音声のテンプレートの一例を示す図である。音声指示生成部１５は、指示テンプレートデータベース２３から指示テンプレートを読み出して、各指示テンプレートに人物名抽出部１４から取得した人物名を挿入して、音声指示を生成する。 (Generation of output audio)
Next, an example of a voice instruction that instructs the registration target person so that the registration target person performs a predetermined operation will be described with reference to FIG. FIG. 5 is a diagram illustrating an example of an audio template output by the information processing apparatus 1. The voice instruction generation unit 15 reads the instruction template from the instruction template database 23, inserts the person name acquired from the person name extraction unit 14 into each instruction template, and generates a voice instruction.

図３の（ａ）に示した、情報処理装置１から出力された音声指示「鈴木さん、お顔を近づけてください。」は、音声指示Ａ１の指示テンプレート「（人物名）さん、お顔を近づけてください。」の「（人物名）」の箇所に、人物名抽出部１４から取得した人物名である「鈴木」を挿入することにより生成することができる。 The voice instruction “Mr. Suzuki, please bring your face closer” output from the information processing apparatus 1 shown in (a) of FIG. 3 is a voice instruction A1 instruction template “(person name). It can be generated by inserting “Suzuki”, which is the person name acquired from the person name extraction unit 14, in the place of “(person name)” of “Please move closer”.

なお、撮像装置制御部１６が、撮像装置３３が撮像した撮像画像Ｒが、特徴データを抽出するために満たすべき条件を満たしているか否かを判断し、この条件が満たされていない場合には、同じ音声指示、または関連する動作を指示する音声指示（関連音声指示）を出力するように、音声指示生成部１５に指示して、顔画像の撮像をやり直してもよい。 Note that the imaging device control unit 16 determines whether or not the captured image R captured by the imaging device 33 satisfies a condition that should be satisfied in order to extract feature data, and if this condition is not satisfied, The voice instruction generation unit 15 may be instructed to output the same voice instruction or a voice instruction (related voice instruction) for instructing a related operation, and the face image may be captured again.

例えば、撮像装置制御部１６が、撮像画像Ｒにおける、顔画像Ｆの面積が一定以上のサイズでないため、特徴データを抽出するための条件を満たしていないと判定した場合には、音声指示Ａ２の指示テンプレート「（人物名）さん、もう少し、お顔を近づけてください。」の「（人物名）」の箇所に、人物名抽出部１４から取得した登録対象者の人物名を挿入することにより、登録対象者に、さらに顔を近づけさせる音声指示を生成してもよい。同様に、音声指示Ａ３の指示テンプレート「（人物名）さん、お顔をこちらに向けてください。」の「（人物名）」の箇所に、人物名抽出部１４から取得した登録対象者の人物名を挿入することにより、登録対象者に、さらに顔を近づけさせる音声指示を生成してもよい。 For example, when the imaging device control unit 16 determines that the condition for extracting the feature data is not satisfied because the area of the face image F in the captured image R is not a certain size or larger, the voice instruction A2 By inserting the person name of the person to be registered obtained from the person name extraction unit 14 into the “(person name)” portion of the instruction template “(person name), please bring your face closer”. You may generate | occur | produce the audio | voice instruction | indication which makes a registration subject further approach a face. Similarly, the person of the registration target acquired from the person name extraction unit 14 at the position of “(person name)” in the instruction template “(person name), please face your face here” of the voice instruction A3. By inserting a name, a voice instruction that makes the person to be registered closer to the face may be generated.

音声指示を生成する前に、図５の（ｂ）に示すような補助フレーズ音声を生成して、出力してもよい。補助フレーズ音声は、例えば、補助フレーズＸ１テンプレート「（人物名）さん、はじめまして。」の「（人物名）」の箇所に、人物名抽出部１４から取得した登録対象者の人物名を挿入することにより生成される。このような音声を出力することにより、登録対象者に対して音声指示を聞く準備を促し、この後に出力する音声指示への関心を高めることができる。 Before generating the voice instruction, an auxiliary phrase voice as shown in FIG. 5B may be generated and output. For the auxiliary phrase sound, for example, the person name of the person to be registered acquired from the person name extraction unit 14 is inserted into the location of “(person name)” of the auxiliary phrase X1 template “(person name), nice to meet you”. Is generated by By outputting such a voice, it is possible to prompt the registration subject to prepare to listen to the voice instruction, and to increase interest in the voice instruction to be output later.

さらに、顔の撮像に成功したときに、図５の（ｂ）に示すような補助フレーズ音声を生成して、出力してもよい。補助フレーズ音声は、例えば、補助フレーズＸ２テンプレート「（人物名）さん、お疲れ様でした。」、補助フレーズＸ３テンプレート「（人物名）さん、ありがとう。」の「（人物名）」の箇所に、人物名抽出部１４から取得した登録対象者の人物名を挿入することにより生成される。このような音声を出力することにより、登録対象者の顔の撮像が完了したこと、すなわち、特徴データを抽出して登録データを生成する処理に進んだことを、登録対象者を紹介したユーザＵおよび登録対象者に対して知らせることができる。 Furthermore, when the face is captured successfully, an auxiliary phrase sound as shown in FIG. 5B may be generated and output. Auxiliary phrase audio is, for example, a person in the place of “(person name)” in the auxiliary phrase X2 template “(person name), thank you very much” and auxiliary phrase X3 template “(person name) thank you.” It is generated by inserting the person name of the person to be registered acquired from the name extraction unit 14. By outputting such sound, the user U who introduced the registration target person that the imaging of the face of the registration target person has been completed, that is, the process has proceeded to the process of extracting feature data and generating registration data. And can notify the registered person.

（情報処理装置１が登録対象者の人物名と特徴データとを対応付けて登録する処理の流れ）
ここでは、情報処理装置１が登録対象者の人物名と特徴データとを対応付けて登録する処理の流れについて図６を用いて説明する。図６は、情報処理装置１が登録対象者の人物名と特徴データとを対応付けて登録する処理の流れの一例を説明するフローチャートである。 (Processing flow in which the information processing apparatus 1 registers the person name of the person to be registered and the feature data in association with each other)
Here, the flow of processing in which the information processing apparatus 1 registers the person name of the person to be registered and the feature data in association with each other will be described with reference to FIG. FIG. 6 is a flowchart for explaining an example of a flow of processing in which the information processing apparatus 1 registers the person name of the person to be registered and the feature data in association with each other.

まず、情報処理装置１は、音声入力部３１からの音声入力の受付けを開始して、ユーザＵからの登録対象者を紹介する音声の入力を受付ける（音声入力ステップ）。テキスト化部１１は、入力された音声の音声データの音響特徴量を算出する（Ｓ１）。次に、テキスト化部１１は、認識辞書２１に規定されている、音響特徴量とテキストとの対応関係に基づいて、入力した音声の音声データをテキストデータへと変換する（Ｓ２）。続いて、形態素解析部１２は、テキストデータを文法上の各構成要素に分解し、形態素解析辞書２２を参照して、それぞれの構成要素の文法関係を特定する（Ｓ３）。テキストデータの形態素解析の結果から、紹介文抽出部１３によって登録対象者を紹介する紹介文であることが確認されると（Ｓ４）、人物名抽出部１４は、紹介文であることが確認されたテキストデータにおいて、形態素解析部１２が、例えば「名詞、固有名詞、人物名、姓」として帰属した単語（名詞）を人物名としてテキストデータから抽出する（Ｓ５：識別情報抽出ステップ）。また人物名抽出部１４は、抽出した人物名を音声指示生成部１５および登録データ管理部１８に送る。 First, the information processing apparatus 1 starts receiving voice input from the voice input unit 31 and receives voice input from the user U that introduces the person to be registered (voice input step). The text unit 11 calculates the acoustic feature amount of the voice data of the input voice (S1). Next, the text converting unit 11 converts the voice data of the input voice into text data based on the correspondence relationship between the acoustic feature quantity and the text defined in the recognition dictionary 21 (S2). Subsequently, the morpheme analysis unit 12 decomposes the text data into grammatical components and refers to the morpheme analysis dictionary 22 to identify the grammatical relationship between the components (S3). When it is confirmed from the result of the morphological analysis of the text data that the introduction sentence extraction unit 13 introduces the person to be registered (S4), the person name extraction unit 14 is confirmed to be the introduction sentence. In the text data, the morphological analysis unit 12 extracts a word (noun) belonging as, for example, “noun, proper noun, person name, surname” from the text data as a person name (S5: identification information extraction step). The person name extraction unit 14 sends the extracted person name to the voice instruction generation unit 15 and the registration data management unit 18.

人物名のテキストを取得した音声指示生成部１５は、指示テンプレートデータベース２３から指示テンプレートを読み出して、各指示テンプレートに人物名抽出部１４から取得した人物名を挿入して、登録対象者が所定の動作をするように指示する音声指示を生成する（Ｓ６：指示生成ステップ）。ここで音声指示生成部１５は、音声指示を生成する前に、または音声指示を生成した直後に、撮像装置３３での撮像を開始するように、撮像装置制御部１６に指示が送られ、撮像装置３３での撮像が開始する（Ｓ７）。その後、音声指示を音声出力部３２から出力する（Ｓ８：指示提示ステップ）。なお、Ｓ６〜Ｓ８の処理の順序は一例であり、これに限定されない。例えば、音声指示を音声出力部３２から出力した後に直ちに撮像装置３３での撮像を開始してもよい。 The voice instruction generation unit 15 that has acquired the text of the person name reads out the instruction template from the instruction template database 23, inserts the person name acquired from the person name extraction unit 14 into each instruction template, and the person to be registered becomes a predetermined person. A voice instruction for instructing the operation is generated (S6: instruction generation step). Here, the voice instruction generation unit 15 sends an instruction to the imaging device control unit 16 to start imaging with the imaging device 33 before generating the voice instruction or immediately after generating the voice instruction. Imaging with the device 33 starts (S7). Thereafter, a voice instruction is output from the voice output unit 32 (S8: instruction presentation step). In addition, the order of the processing of S6 to S8 is an example, and is not limited to this. For example, imaging with the imaging device 33 may be started immediately after outputting a voice instruction from the voice output unit 32.

情報処理装置１が出力する音声指示は、登録対象者の人物名を含んでいるため、登録対象者が所定の動作をするように、登録対象者のみに呼びかけて指示する。例えば、図３の（ａ）に示すように、「鈴木さん、お顔を近づけてください」という音声指示を聞いた登録対象者である鈴木さんは、情報処理装置１に顔を近づけるが、ユーザＵおよび佐藤さんは指示を受けていないので、情報処理装置１に顔を近づけない。情報処理装置１は、近づけられた顔を撮像した撮像画像Ｒを、登録対象者である鈴木さんの顔画像Ｆとして取得する。 Since the voice instruction output from the information processing apparatus 1 includes the person name of the person to be registered, only the person to be registered is called and instructed so that the person to be registered performs a predetermined operation. For example, as shown in FIG. 3A, Suzuki who is a registration target who has heard the voice instruction “Mr. Suzuki, please bring your face closer” brings his face closer to the information processing apparatus 1, but the user Since U and Mr. Sato have not received instructions, their faces are not brought close to the information processing apparatus 1. The information processing apparatus 1 acquires a captured image R obtained by capturing a close face as a face image F of Mr. Suzuki who is a registration target person.

撮像装置制御部１６は、撮像画像Ｒから顔画像Ｆを検出して（Ｓ９）、特徴データを抽出するために満たすべき条件を満たしているか否かを判断する（Ｓ１０）。図６では、特徴データを抽出するために満たすべき条件が、撮像画像Ｒにおける顔画像Ｆの面積が閾値以上のサイズであること、である場合を例に挙げて図示している。この条件が満たされていない場合（Ｓ１０にてＮＯ）、Ｓ８に戻り、同じ音声指示、または関連する動作を指示する音声指示（関連音声指示）を出力するように、音声指示生成部１５に指示する。一方、特徴データを抽出可能である場合（Ｓ１０においてＹＥＳ）、撮像画像Ｒは撮像装置制御部１６から特徴データ抽出部１７に送られ、特徴データ抽出部１７は顔画像Ｆの特徴データを抽出する（Ｓ１１）。 The imaging device control unit 16 detects the face image F from the captured image R (S9), and determines whether or not a condition to be satisfied in order to extract feature data is satisfied (S10). FIG. 6 illustrates an example in which the condition to be satisfied for extracting the feature data is that the area of the face image F in the captured image R is a size equal to or larger than the threshold value. If this condition is not satisfied (NO in S10), the process returns to S8 and instructs the voice instruction generation unit 15 to output the same voice instruction or a voice instruction (related voice instruction) instructing a related action. To do. On the other hand, when the feature data can be extracted (YES in S10), the captured image R is sent from the imaging device control unit 16 to the feature data extraction unit 17, and the feature data extraction unit 17 extracts the feature data of the face image F. (S11).

最後に、登録データ管理部１８は、特徴データ抽出部１７から取得した特徴データと、人物名抽出部１４から取得した人物名とを対応付けた、登録対象者の登録データを登録データベース２４に格納する（Ｓ１２：登録データ管理ステップ）。 Finally, the registration data management unit 18 stores in the registration database 24 the registration data of the person to be registered in which the feature data acquired from the feature data extraction unit 17 is associated with the person name acquired from the person name extraction unit 14. (S12: Registration data management step).

このように、情報処理装置１は、登録する登録対象者を紹介する音声の入力を受付けると、該登録対象者が所定の動作をするように、該登録対象者の人物名を含む音声指示を発する。これにより、他の人物と取り違えることなく登録対象者の顔を撮像することができるため、該登録対象者に関する登録データを正確に管理することができる。 As described above, when the information processing apparatus 1 receives an input of a voice introducing a registration target person to be registered, the information processing apparatus 1 gives a voice instruction including the person name of the registration target person so that the registration target person performs a predetermined operation. To emit. Accordingly, since the face of the registration target person can be imaged without being mistaken for another person, registration data relating to the registration target person can be managed accurately.

〔実施形態２〕
上述の例では、情報処理装置１が音声入力部３１、音声出力部３２、および撮像装置３３を備え、ユーザＵによって紹介された登録対象者に関する登録データを生成して管理する例について説明したが、これに限定されない。例えば、情報処理装置１と、音声入力部３１、音声出力部３２、および撮像装置３３との間のデータ送受信が可能であれば、情報処理装置１と、音声入力部３１、音声出力部３２、および撮像装置３３とは、別体として構成されていてもよい。 [Embodiment 2]
In the above-described example, the information processing apparatus 1 includes the voice input unit 31, the voice output unit 32, and the imaging device 33, and the example in which the registration data regarding the registration target person introduced by the user U is generated and managed has been described. However, the present invention is not limited to this. For example, if data transmission / reception between the information processing apparatus 1 and the audio input unit 31, the audio output unit 32, and the imaging device 33 is possible, the information processing apparatus 1, the audio input unit 31, the audio output unit 32, The imaging device 33 may be configured as a separate body.

この場合、音声入力部３１に入力された音声の音声データは、情報処理装置１へ送信される。情報処理装置１は、受信した音声データをテキストデータに変換して、該テキストデータの形態素解析を行い、該テキストデータに含まれている登録対象者の人物名を抽出する。情報処理装置１は、その人物名を用いて生成した音声指示の音声データを音声出力部３２に送信すると共に、撮像装置３３に対して、撮像開始の指示を送信する。 In this case, the audio data of the audio input to the audio input unit 31 is transmitted to the information processing apparatus 1. The information processing apparatus 1 converts the received voice data into text data, performs morphological analysis of the text data, and extracts a person name of a registration target person included in the text data. The information processing apparatus 1 transmits the voice data of the voice instruction generated using the person name to the voice output unit 32 and transmits an imaging start instruction to the imaging apparatus 33.

撮像装置３３が撮像した撮像画像Ｒは情報処理装置１に送信され、情報処理装置１は、撮像画像Ｒの顔画像Ｆから、登録対象者の特徴データを抽出して、該登録対象者の人物名と対応付けて登録データを生成し、記憶部２０に格納する。 The captured image R captured by the imaging device 33 is transmitted to the information processing device 1, and the information processing device 1 extracts the feature data of the registration target person from the face image F of the captured image R, and the person of the registration target person Registration data is generated in association with the name and stored in the storage unit 20.

このように、音声入力部３１、音声出力部３２、および撮像装置３３は、情報処理装置１の制御部１０が設置されている位置と離れた位置に設置することも可能である。 As described above, the voice input unit 31, the voice output unit 32, and the imaging device 33 can be installed at positions away from the position where the control unit 10 of the information processing apparatus 1 is installed.

〔実施形態３〕
情報処理装置１の制御ブロック（特に、テキスト化部１１、形態素解析部１２、紹介文抽出部１３、人物名抽出部１４、音声指示生成部１５、撮像装置制御部１６、特徴データ抽出部１７、および登録データ管理部１８）は、集積回路（ＩＣチップ）等に形成された論理回路（ハードウェア）によって実現してもよいし、ＣＰＵ（Central Processing Unit）を用いてソフトウェアによって実現してもよい。 [Embodiment 3]
Control blocks of the information processing device 1 (particularly, the text conversion unit 11, the morpheme analysis unit 12, the introduction sentence extraction unit 13, the person name extraction unit 14, the voice instruction generation unit 15, the imaging device control unit 16, the feature data extraction unit 17, The registered data management unit 18) may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or may be realized by software using a CPU (Central Processing Unit). .

後者の場合、情報処理装置１は、各機能を実現するソフトウェアであるプログラムの命令を実行するＣＰＵ、上記プログラムおよび各種データがコンピュータ（またはＣＰＵ）で読み取り可能に記録されたＲＯＭ（Read Only Memory）または記憶装置（これらを「記録媒体」と称する）、上記プログラムを展開するＲＡＭ（Random Access Memory）などを備えている。そして、コンピュータ（またはＣＰＵ）が上記プログラムを上記記録媒体から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記プログラムは、該プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記コンピュータに供給されてもよい。なお、本発明は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 In the latter case, the information processing apparatus 1 includes a CPU that executes instructions of a program that is software that implements each function, and a ROM (Read Only Memory) in which the program and various data are recorded so as to be readable by a computer (or CPU). Alternatively, a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) that expands the program, and the like are provided. And the objective of this invention is achieved when a computer (or CPU) reads the said program from the said recording medium and runs it. As the recording medium, a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used. The program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program. The present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission.

〔まとめ〕
本発明の態様１に係る情報処理装置１は、音声の入力を受付ける音声入力部３１を備える情報処理装置であって、上記音声入力部から入力された、対象者の識別情報を含む音声の音声データから上記対象者の識別情報を抽出する識別情報抽出部（人物名抽出部１４）と、上記対象者に対して所定の動作を行うよう指示する動作指示を、上記識別情報抽出部が抽出した上記識別情報を含めて生成する指示生成部（音声指示生成部１５）と、上記動作指示を上記対象者に対して提示する指示提示部（音声出力部３２）と、を備える。 [Summary]
An information processing apparatus 1 according to an aspect 1 of the present invention is an information processing apparatus including a voice input unit 31 that accepts voice input, and includes voice information including identification information of a target person input from the voice input unit. The identification information extracting unit extracts the identification information extracting unit (person name extracting unit 14) that extracts the identification information of the subject from the data, and the operation instruction that instructs the subject to perform a predetermined operation. An instruction generation unit (speech instruction generation unit 15) that generates including the identification information, and an instruction presentation unit (speech output unit 32) that presents the operation instruction to the target person.

上記の構成によれば、対象者の識別情報を含む音声から上記対象者の識別情報を抽出し、上記対象者が所定の動作を行うように、上記対象者の識別情報を含む動作指示を、上記対象者に提示する。これにより、他の人物と取り違えることなく、正しい対象者に、所定の動作をさせることができる。 According to said structure, the operation instruction containing the said subject's identification information is extracted so that the said subject's identification information is extracted from the audio | voice containing the subject's identification information, and the said subject performs predetermined | prescribed operation | movement. Present to the above target person. Thereby, it is possible to cause a correct target person to perform a predetermined operation without being mistaken for another person.

本発明の態様２に係る情報処理装置は、上記態様１において、上記対象者の識別情報を含む音声は、上記対象者と異なる人物が発した音声であってもよい。 In the information processing apparatus according to aspect 2 of the present invention, in the aspect 1, the sound including the identification information of the target person may be a sound uttered by a person different from the target person.

これにより、上記対象者の識別情報を含む、対象者と異なる人物が発した音声から、対象者の識別情報を抽出することができる。よって、対象者と異なる人物が対象者に対して話しかける音声や問いかける音声、および対象者を紹介する音声などから対象者の識別情報を抽出することができる。 Thereby, the identification information of the target person can be extracted from the voice uttered by the person different from the target person including the identification information of the target person. Therefore, the identification information of the target person can be extracted from the voice that the person different from the target person talks to or asks the target person, the voice that introduces the target person, and the like.

本発明の態様３に係る情報処理装置は、上記態様２において、上記音声入力部から入力された音声データを音声認識して生成されたテキストから、上記対象者を当該対象者と異なる人物が紹介する紹介文を抽出し、上記識別情報抽出部へ出力する紹介文抽出部１３をさらに備えてもよい。 The information processing apparatus according to aspect 3 of the present invention is the information processing apparatus according to aspect 2, wherein a person different from the target person introduces the target person from the text generated by voice recognition of the voice data input from the voice input unit. An introduction sentence extraction unit 13 that extracts an introduction sentence to be extracted and outputs the extracted introduction sentence to the identification information extraction unit may be further provided.

上記の構成によれば、上記対象者と異なる人物が、当該対象者を紹介する紹介文から当該対象者の識別情報を抽出する。これにより、例えば、対象者と異なる人物が対象者を第三者や情報処理装置に紹介するときの音声から紹介文を抽出することができる。なお、対象者と異なる人物が当該対象者を紹介する紹介文は、対象者の識別情報を含む所定の形式の文型を有する文であるため、音声から紹介文を選択的に抽出することにより、対象者の識別情報を効率よく抽出することができる。 According to said structure, the person different from the said target person extracts the said person's identification information from the introduction sentence which introduces the said target person. Thereby, for example, an introduction sentence can be extracted from a voice when a person different from the target person introduces the target person to a third party or an information processing apparatus. In addition, since the introduction sentence in which the person different from the target person introduces the target person is a sentence having a predetermined form of sentence including the identification information of the target person, by selectively extracting the introduction sentence from the voice, The identification information of the target person can be extracted efficiently.

本発明の態様４に係る情報処理装置は、上記態様１から３のいずれかにおいて、上記識別情報は、上記対象者を示す人物名であり、上記指示生成部は、上記識別情報抽出部が上記識別情報として抽出した上記人物名を含む上記動作指示を生成してもよい。 The information processing apparatus according to aspect 4 of the present invention is the information processing apparatus according to any one of the aspects 1 to 3, wherein the identification information is a person name indicating the target person, and the instruction generation unit includes the identification information extraction unit described above. The operation instruction including the person name extracted as identification information may be generated.

上記の構成によれば、対象者の人物名を含む指示を生成する。これにより、動作指示が誰に向けた指示であるかを明確にして出力することができる。 According to said structure, the instruction | indication including the subject person's person name is produced | generated. Thereby, it is possible to clarify and output to whom the operation instruction is directed.

本発明の態様５に係る情報処理装置は、上記態様１から４のいずれかにおいて、上記対象者の顔を撮像した顔画像から、上記対象者に固有の固有情報を抽出する固有情報抽出部（特徴データ抽出部１７）をさらに備え、上記所定の動作を行った上記対象者の上記識別情報と、上記固有情報抽出部が抽出した当該対象者の上記固有情報とを対応付けた、上記対象者に関する登録データを記憶部（記憶部２０、登録データベース２４）に記憶してもよい。 An information processing apparatus according to aspect 5 of the present invention is the information processing apparatus according to any one of aspects 1 to 4, wherein a unique information extraction unit that extracts unique information specific to the subject from a face image obtained by imaging the face of the subject. The target person further comprising a feature data extraction unit 17), which associates the identification information of the target person who has performed the predetermined operation with the specific information of the target person extracted by the specific information extraction part The registration data regarding may be stored in a storage unit (storage unit 20, registration database 24).

上記の構成によれば、上記対象者の顔を撮像した顔画像から、対象者に固有の固有情報を抽出して、対象者の識別情報と固有情報とを対応付けて記憶する。一般に、顔は対象者毎に異なるので、対象者毎に固有の固有情報を、撮像した顔画像から抽出することができる。これにより、他の人物と取り違えることなく、対象者の顔を撮像することができるため、当該対象者に関する登録データを正確に管理することができる。 According to said structure, the specific information specific to a subject is extracted from the face image which imaged the said subject's face, and identification information and specific information of a subject are matched and memorize | stored. In general, since a face differs for each target person, unique information unique to each target person can be extracted from the captured face image. Thereby, since the face of the subject can be imaged without being mistaken for another person, the registration data regarding the subject can be managed accurately.

本発明の態様６に係る情報処理装置１の制御方法は、音声の入力を受付ける音声入力部３１を備える情報処理装置の制御方法であって、上記音声入力部から、対象者の識別情報を含む音声の入力を受付ける音声入力ステップ（Ｓ１）と、上記音声入力ステップにて受付けた上記音声の音声データから上記対象者の識別情報を抽出する識別情報抽出ステップ（Ｓ５）と、上記対象者に対して所定の動作を行うよう指示する動作指示を、上記識別情報抽出ステップにて抽出した上記識別情報を含めて生成する指示生成ステップ（Ｓ６）と、上記指示生成ステップにて生成した上記動作指示を上記対象者に対して提示する指示提示ステップ（Ｓ８）と、を含む。上記の構成によれば、上記態様１と同様の効果を奏する。 The control method of the information processing device 1 according to the aspect 6 of the present invention is a control method of the information processing device including the voice input unit 31 that receives voice input, and includes identification information of the subject from the voice input unit. A voice input step (S1) for receiving voice input; an identification information extraction step (S5) for extracting the target person's identification information from the voice data of the voice received in the voice input step; An instruction generation step (S6) for generating an operation instruction for instructing to perform a predetermined operation including the identification information extracted in the identification information extraction step, and the operation instruction generated in the instruction generation step. An instruction presentation step (S8) to be presented to the subject. According to said structure, there exists an effect similar to the said aspect 1. FIG.

本発明の各態様に係る情報処理装置は、コンピュータによって実現してもよく、この場合には、コンピュータを上記情報処理装置が備える各部（ソフトウェア要素）として動作させることにより上記情報処理装置をコンピュータにて実現させる情報処理装置の制御プログラム、およびそれを記録したコンピュータ読み取り可能な記録媒体も、本発明の範疇に入る。 The information processing apparatus according to each aspect of the present invention may be realized by a computer. In this case, the information processing apparatus is operated on each computer by causing the computer to operate as each unit (software element) included in the information processing apparatus. The control program for the information processing apparatus to be realized in this way and a computer-readable recording medium on which the control program is recorded also fall within the scope of the present invention.

本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。さらに、各実施形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成することができる。 The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope shown in the claims, and embodiments obtained by appropriately combining technical means disclosed in different embodiments. Is also included in the technical scope of the present invention. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.

本発明は、人間とのコミュニケーション機能を備える電子機器やロボット等に利用することができる。 The present invention can be used for an electronic device or a robot having a communication function with a human.

１情報処理装置
１３紹介文抽出部
１４人物名抽出部（識別情報抽出部）
１５音声指示生成部（指示生成部）
１７特徴データ抽出部（固有情報抽出部）
１８登録データ管理部
２０記憶部
２４登録データベース（記憶部）
３１音声入力部
３２音声出力部（指示提示部）
３３撮像装置 DESCRIPTION OF SYMBOLS 1 Information processing apparatus 13 Introduction sentence extraction part 14 Person name extraction part (identification information extraction part)
15 Voice instruction generation unit (instruction generation unit)
17 Feature data extraction unit (unique information extraction unit)
18 Registration Data Management Unit 20 Storage Unit 24 Registration Database (Storage Unit)
31 Voice input part 32 Voice output part (instruction presentation part)
33 Imaging device

Claims

An information processing apparatus including a voice input unit that receives voice input,
An identification information extracting unit that extracts the identification information of the subject from voice data including the identification information of the subject input from the voice input unit;
An instruction generating unit that generates an operation instruction for instructing the subject to perform a predetermined operation, including the identification information extracted by the identification information extracting unit;
An information presenting apparatus comprising: an instruction presenting unit that presents the operation instruction to the target person.

The information processing apparatus according to claim 1, wherein the voice including the identification information of the target person is a voice uttered by a person different from the target person.

An introductory sentence that extracts an introductory text that introduces the target person from a person different from the target person from text generated by speech recognition of the speech data input from the speech input section, and outputs the introductory text to the identification information extracting section The information processing apparatus according to claim 2, further comprising an extraction unit.

The identification information is a person name indicating the target person,
The information processing apparatus according to any one of claims 1 to 3, wherein the instruction generation unit generates the operation instruction including the person name extracted as the identification information by the identification information extraction unit. .

A unique information extraction unit that extracts unique information unique to the subject from a face image obtained by capturing the face of the subject,
Storing registration data relating to the target person in association with the identification information of the target person who has performed the predetermined operation and the specific information of the target person extracted by the specific information extracting unit. The information processing apparatus according to claim 1, wherein:

A method for controlling an information processing apparatus including a voice input unit that receives voice input,
A voice input step for receiving voice input including identification information of the subject from the voice input unit;
An identification information extraction step of extracting the identification information of the subject from the voice data of the voice received in the voice input step;
An instruction generation step for generating an operation instruction for instructing the subject to perform a predetermined operation including the identification information extracted in the identification information extraction step;
An instruction presenting step of presenting the operation instruction generated in the instruction generating step to the target person.

A control program for causing a computer to function as the information processing apparatus according to claim 1, wherein the control program causes the computer to function as the identification information extraction unit and the instruction generation unit.

A computer-readable recording medium on which the control program according to claim 7 is recorded.