JP6543047B2

JP6543047B2 - Information processing apparatus, control program, recording medium

Info

Publication number: JP6543047B2
Application number: JP2015035326A
Authority: JP
Inventors: 史彦鈴木; 誠悟伊藤
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2015-02-25
Filing date: 2015-02-25
Publication date: 2019-07-10
Anticipated expiration: 2035-02-25
Also published as: JP2016156993A

Description

本発明は、対象者が特定の動作をするように、対象者に対して指示を提示する情報処理装置、情報処理装置の制御方法、プログラム、記録媒体に関する。 The present invention relates to an information processing apparatus that presents an instruction to a subject so that the subject performs a specific operation, a control method of the information processing apparatus, a program, and a recording medium.

人間との対話機能を用いて、対話の相手を認識対象として、その人物名と顔とを登録することができるロボット装置などが従来技術として知られている。 A robot apparatus and the like capable of registering a person's name and a face with the other party of the dialogue as a recognition target using a dialogue function with a human is known as a prior art.

例えば、特許文献１は、顔写真と人物名と音声の特徴とでデータを検索し、発話者の音声が新規であれば、この発話者は認識対象となり、その人物名と顔とを登録し、認識対象に対して音声を発して名前を名乗らせた後に、好きな食べ物を発言させる、などの特定の動作をするように指示するロボット装置が記載されている。 For example, Patent Document 1 searches data by face photograph, person name and voice feature, and if the voice of the speaker is new, this speaker is a recognition target, and that person name and face are registered. A robot apparatus is described that instructs a user to perform a specific action such as speaking a favorite food after emitting a voice to a recognition object and giving a name.

特開２００３−２５５９８９号公報（２００３年９月１０日公開）Unexamined-Japanese-Patent No. 2003-255989 (Sep. 10, 2003 publication)

しかしながら、特許文献１に記載のロボット装置は、発話者の音声が新規か否かに基づいて、その発話者が新規の認識対象か否かを判断するため、発話者の音声が新規ではない場合、その音声に他の人物の人物名が含まれていても、その人物は認識対象とならない、という問題がある。 However, since the robot apparatus described in Patent Document 1 determines whether the speaker is a new recognition target based on whether the speaker's voice is new or not, the case where the speaker's voice is not new There is a problem that even if the voice contains the name of another person, that person is not to be recognized.

例えば、新規ではない（すなわち、認識済みの）発話者が、新規の認識対象を紹介したり、該認識対象に話しかけたりする場合、特許文献１に記載の技術では、発話者の音声が新規ではないため、発話者によって紹介されたり、話しかけられたりした人物が新規の認識対象なのか、または認識済みの人物なのかを正しく判断できない虞がある。そして、この判断を誤った場合、新規の認識対象の人物名を認識することができない。 For example, when a non-new (that is, a recognized) speaker introduces a new recognition target or speaks to the recognition target, the technology described in Patent Document 1 allows the speaker's voice to be new. Because there is no such person, it may not be possible to correctly determine whether the person introduced or spoken by the speaker is a new recognition target or a recognized person. And if this judgment is wrong, it is not possible to recognize a new recognition target person's name.

本発明は、上記の点を鑑みてなされたものであり、その目的は、対象者の人物名を含む音声を検知したときに、対象者が所定の動作をするように対象者に指示することができる情報処理装置、情報処理装置の制御方法などを実現することである。 The present invention has been made in view of the above points, and an object thereof is to instruct a subject to perform a predetermined action when a voice including the person's name of the subject is detected. It is to realize an information processing apparatus capable of processing information and a control method of the information processing apparatus.

上記の課題を解決するために、本発明の一態様に係る情報処理装置は、音声の入力を受付ける音声入力部を備える情報処理装置であって、上記音声入力部から入力された、対象者の識別情報を含む音声の音声データから上記対象者の識別情報を抽出する識別情報抽出部と、上記対象者に対して所定の動作を行うよう指示する動作指示を、上記識別情報抽出部が抽出した上記識別情報を含めて生成する指示生成部と、上記動作指示を上記対象者に対して提示する指示提示部と、を備える。 In order to solve the above-mentioned subject, an information processor concerning one mode of the present invention is an information processor provided with a voice input part which receives an input of a voice, and it is an object person's person who was inputted from the above-mentioned voice input part. The identification information extraction unit extracts an identification information extraction unit that extracts identification information of the target person from voice data of voice including identification information, and an operation instruction that instructs the target person to perform a predetermined operation. And an instruction presenting unit that presents the operation instruction to the subject.

また、上記の課題を解決するために、本発明の一態様に係る情報処理装置の制御方法は、音声の入力を受付ける音声入力部を備える情報処理装置の制御方法であって、上記音声入力部から、対象者の識別情報を含む音声の入力を受付ける音声入力ステップと、上記音声入力ステップにて受付けた上記音声の音声データから上記対象者の識別情報を抽出する識別情報抽出ステップと、上記対象者に対して所定の動作を行うよう指示する動作指示を、上記識別情報抽出ステップにて抽出した上記識別情報を含めて生成する指示生成ステップと、上記指示生成ステップにて生成した上記動作指示を上記対象者に対して提示する指示提示ステップと、を含む。 Further, in order to solve the above problems, a control method of an information processing apparatus according to an aspect of the present invention is a control method of an information processing apparatus including a voice input unit that receives an input of voice, the voice input unit A voice input step for receiving an input of voice including identification information of a target person, an identification information extraction step for extracting identification information of the target person from voice data of the voice received in the voice input step, and An instruction generation step of generating an operation instruction instructing the person to perform a predetermined operation including the identification information extracted in the identification information extraction step; and the operation instruction generated in the instruction generation step And D. an instruction presenting step presented to the subject.

本発明の一態様によれば、対象者の人物名を含む音声を検知したときに、対象者が特定の動作をするように対象者に指示できるという効果を奏する。 According to one aspect of the present invention, when a voice including a person's name of a subject is detected, there is an effect that the subject can be instructed to perform a specific action.

本発明に係る情報処理装置の概略構成の一例を示すブロック図である。It is a block diagram which shows an example of schematic structure of the information processing apparatus which concerns on this invention. 情報処理装置のハードウェア構成の一例を示すブロック図である。It is a block diagram showing an example of the hardware constitutions of an information processor. 情報処理装置が登録対象者の人物名と特徴データとを対応付けて登録する処理を説明する図である。FIG. 7 is a diagram for explaining a process in which the information processing apparatus registers a person name of a registration target person in association with feature data; 情報処理装置が、テキストデータから人物名のテキストを抽出する処理の一例を示す図である。FIG. 6 is a diagram illustrating an example of processing of extracting text of a person's name from text data by the information processing apparatus. 情報処理装置が出力する音声のテンプレートの一例を示す図である。It is a figure which shows an example of the template of the audio | voice which an information processing apparatus outputs. 情報処理装置が登録対象者の人物名と特徴データとを対応付けて登録する処理の流れの一例を説明するフローチャートである。5 is a flowchart illustrating an example of the flow of processing in which an information processing apparatus registers a person name of a registration target person and feature data in association with each other;

〔実施形態１〕
以下、本発明の実施の形態について、詳細に説明する。 Embodiment 1
Hereinafter, embodiments of the present invention will be described in detail.

（情報処理装置１が登録対象者を登録する処理）
はじめに、本発明の一実施形態において、情報処理装置１が登録する対象である登録対象者（対象者）の人物名（識別情報）と特徴データ（固有情報）とを対応付けて登録する処理について図３を用いて説明する。図３は、情報処理装置１が登録対象者の人物名と特徴データとを対応付けて登録する処理を説明する図である。 (Process in which the information processing apparatus 1 registers a registration target person)
First, in one embodiment of the present invention, a process of associating and registering the person name (identification information) of a registration target person (target person) to be registered by the information processing apparatus 1 and the feature data (specific information) This will be described with reference to FIG. FIG. 3 is a diagram for explaining processing in which the information processing apparatus 1 registers the person name of the registration target person and the feature data in association with each other.

情報処理装置１は、登録対象者の人物名（姓、名前など）を含む音声の入力を受付けて、この音声に含まれる登録対象者の人物名を抽出し、該登録対象者の人物名を含んだ音声指示を発して、該登録対象者が所定の動作をするように指示する。 The information processing apparatus 1 receives an input of a voice including a person's name (first name, last name, etc.) of a person to be registered, extracts the person's name of the person to be registered included in the voice, and selects the person's name of the person to be registered. A voice instruction is issued to instruct the person to be registered to perform a predetermined operation.

登録対象者の人物名の抽出は、登録対象者本人が発する音声の入力を受付けたときであってもよいし、登録対象者と異なる人物（ユーザＵなど）が発する音声の入力を受付けたときであってもよい。ここで、登録対象者が発する音声の例としては、登録対象者が自己紹介をしたり、他の人物に自分の名を名乗って挨拶をしたりする場合の音声が挙げられる。一方、登録対象者と異なる人物が発する音声の例としては、ユーザＵが登録対象者を他の人物に紹介するときの音声、ユーザＵが情報処理装置１に対して登録対象者を紹介するときの音声、およびユーザＵが登録対象者に問いかけたり、話しかけたりするときの音声が挙げられる。なお、登録対象者本人が発する音声や、登録対象者と異なる人物が発する音声に、登録対象者の人物名が含まれる場合、このような音声は所定の形式の文型を有している。また、情報処理装置１が抽出する登録対象者の人物名は、登録対象者が自分のことを指す呼称として認識している名称であれば、姓、名前などの人物名に限定されず、ニックネーム（通称）、別称、芸名などであってもよい。 Extraction of the person's name of the registration subject may be when the input of the voice emitted by the registration subject himself / herself may be received, or when the input of the voice emitted by a person (such as the user U) different from the registration subject is accepted. It may be Here, as an example of the voice emitted by the person to be registered, there is voice when the person to be registered introduces himself or gives a greeting by giving his / her name to another person. On the other hand, as an example of a voice emitted by a person different from the registration subject, a voice when the user U introduces the registration subject to another person, when the user U introduces the registration subject to the information processing apparatus 1 And the voice when the user U asks or speaks with the person to be registered. If the voice of the person to be registered or the voice of a person different from the person to be registered includes the person's name of the person to be registered, such voice has a sentence pattern of a predetermined format. Further, the person name of the registration target person extracted by the information processing apparatus 1 is not limited to a person name such as a surname or a first name, as long as the registration target person is a name recognized as a name pointing to himself. (Common name), another name, stage name, etc. may be used.

以下では、指示に従って所定の動作をした登録対象者の顔を撮像して、登録対象者の顔を撮像した顔画像Ｆから特徴データを抽出して、該登録対象者の特徴データと人物名とを対応付けた登録データを生成して管理する情報処理装置１を例に挙げて説明する。ここで、登録対象者の「特徴データ」とは、登録対象者の顔画像Ｆから抽出される特徴データに限定されず、指紋情報、網膜情報、声紋情報などから抽出される特徴データであってもよい。 In the following, the face of the registration subject who has performed a predetermined action according to the instruction is imaged, feature data is extracted from the face image F obtained by imaging the face of the registration subject, and the feature data and the person name of the registration subject The information processing apparatus 1 that generates and manages registration data in which the above are associated will be described as an example. Here, the “feature data” of the registration subject is not limited to the feature data extracted from the face image F of the registration subject, but is the feature data extracted from fingerprint information, retina information, voice print information, etc. It is also good.

なお、以下では、「所定の動作」が、登録対象者の顔を撮像した顔画像Ｆを取得するために、登録対象者に対して、顔を近づけさせる指示である場合を例に挙げて説明するが、これに限定されない。すなわち、情報処理装置１は、所定の動作をするように登録対象者に対して指示するが、「所定の動作」は、上記登録対象者に固有の固有情報を情報処理装置１が取得するための動作に限定されず、如何なる動作であってもよい。例えば、情報処理装置１が登録対象者の指紋情報を取得するための指示の場合、「指をスキャナ面に密着させる」ことを「所定の動作」として指示してもよい。あるいは、「情報処理装置１に対して登録対象者が所定の操作入力を行わせる」ことや、「所定の位置に移動させる」ことを「所定の動作」として指示してもよい。このような場合、「登録対象者」とは、情報処理装置１に登録される対象の人物に限定されず、情報処理装置１が発する指示に従って所定の動作をする、指示対象の人物（対象者）であってもよい。 In the following, the case where the “predetermined action” is an instruction to bring the face of the registration target person closer to the face of the registration target person to obtain the face image F obtained by imaging the face of the registration target person is described as an example. Although it is not limited to this. That is, although the information processing apparatus 1 instructs the registration target person to perform a predetermined operation, the “predetermined operation” is because the information processing apparatus 1 acquires unique information unique to the registration target person. The operation is not limited to the above and may be any operation. For example, in the case of an instruction for acquiring the fingerprint information of the registration target person, the information processing apparatus 1 may instruct that “adhere the finger closely to the scanner surface” as “predetermined operation”. Alternatively, it may be instructed that “the registration target person causes the information processing apparatus 1 to perform predetermined operation input” or “moves to the predetermined position” as “predetermined operation”. In such a case, the “person to be registered” is not limited to the person to be registered in the information processing apparatus 1, and the person to be instructed to perform a predetermined operation according to the instruction issued by the information processing apparatus 1 (object person ) May be.

例えば、図３の（ａ）に示すように、情報処理装置１は、音声の入力を受付ける音声入力部３１および音声出力部３２（指示提示部）を備えており、鈴木さんおよび佐藤さんと一緒にいるユーザＵが発した「こちらは鈴木さんです。」という音声を取得する。そして、情報処理装置１は、このユーザＵが発した音声に含まれる登録対象者の人物名である「鈴木」を用いて、鈴木さんが顔を近づけるように指示する音声指示「鈴木さん、お顔を近づけてください。」を音声出力部３２から出力する。 For example, as shown in (a) of FIG. 3, the information processing apparatus 1 includes a voice input unit 31 and a voice output unit 32 (instruction presentation unit) for receiving voice input, and together with Mr. Suzuki and Mr. Sato A voice of “This is Mr. Suzuki” issued by the user U who is present in is acquired. Then, the information processing apparatus 1 uses the “Suzuki”, which is the personal name of the person to be registered included in the voice uttered by the user U, and instructs the Mr. Suzuki to bring his face closer “Mr. Suzuki, "Please bring the face close" from the voice output unit 32.

情報処理装置１は撮像装置３３をさらに備えており、音声指示を出力した後に近づけられた顔を撮像する。例えば、図３の（ｂ）に示すように、音声指示を聞いた鈴木さんが情報処理装置１に近づいて顔を近づけると、情報処理装置１は、顔を撮像して、図３の（ｃ）に示すような顔画像Ｆを取得する。 The information processing device 1 further includes an imaging device 33, and captures an image of a face brought close after outputting a voice instruction. For example, as shown in (b) of FIG. 3, when Mr. Suzuki who has heard the voice instruction approaches the information processing apparatus 1 and brings the face closer, the information processing apparatus 1 captures an image of the face and The face image F as shown in) is acquired.

このように、情報処理装置１は、このユーザＵが発した音声に含まれる登録対象者の人物名（例えば、「鈴木」）を含む音声指示を発することにより、この人物名の登録対象者に顔を近づけさせる等の、所定の動作をさせる。図３の（ａ）に示した例では、情報処理装置１からの音声指示を、複数の人物（すなわち、ユーザＵ、鈴木さん、および佐藤さん）が聞いているが、音声指示に登録対象者の人物名を含めて呼び掛けることによって、所定の動作を特定の登録対象者にさせることができる。音声指示に含まれる人物名（例えば、「鈴木」さん）の人物に、顔を近づけさせることにより、登録対象者の顔を、他の人物と取り違えることなく、正しく撮像することができる。 Thus, the information processing apparatus 1 sends a voice instruction including the name of the person to be registered (for example, “Suzuki”) included in the voice uttered by the user U to the person to be registered with the person name. Make a predetermined action, such as bringing the face closer. In the example shown in (a) of FIG. 3, although a plurality of persons (that is, the user U, Mr. Suzuki and Mr. Sato) listen to the voice instruction from the information processing apparatus 1, the person to be registered in the voice instruction By calling and including the person's name, it is possible to make a specific person to be registered perform a predetermined operation. By bringing the face closer to the person of the person name (for example, Mr. Suzuki) included in the voice instruction, the face of the person to be registered can be correctly imaged without being confused with another person.

情報処理装置１は、音声指示に従って顔を近づけた鈴木さんの顔を撮像した顔画像Ｆから、１または複数の特徴データを抽出して、該抽出した特徴データと、登録対象者の人物名「鈴木」さんと、を対応付けて登録データを生成し、この登録データを登録データベース２４（図１参照）に格納する。顔画像Ｆからの特徴データの抽出は、例えば、顔画像Ｆの目に対応する点Ｐ１および点Ｐ２の位置、鼻に対応する点Ｐ３の位置、および口に対応する点Ｐ４（口の中央）、点Ｐ５（口の右端）、点Ｐ６（口の左端）の各位置の位置座標、および各点の間の距離などを算出することによって行われる。ここで、位置座標とは、顔画像Ｆの所定の位置（例えば左下の端の位置）を原点とする座標系における、各点の位置に対応する座標である。なお、抽出する特徴データは、これらに限定されず、顔画像Ｆの画像データそのもの、顔の色、耳の形状など、任意の特徴データを登録データとして用いてもよい。また、顔画像Ｆは平面画像に限定されず、立体画像であってもよい。 The information processing apparatus 1 extracts one or a plurality of feature data from a face image F obtained by capturing the face of Mr. Suzuki, who brings the face close according to a voice instruction, and extracts the extracted feature data and the person name of the registration target person "Mr. Suzuki" is associated with each other to generate registration data, and this registration data is stored in the registration database 24 (see FIG. 1). Extraction of feature data from the face image F is performed, for example, by the positions of the points P1 and P2 corresponding to the eyes of the face image F, the position of the point P3 corresponding to the nose, and the point P4 corresponding to the mouth (center of the mouth) , Position coordinates of each position of point P5 (right end of the mouth) and point P6 (left end of the mouth), and calculating the distance between each point. Here, the position coordinate is a coordinate corresponding to the position of each point in a coordinate system whose origin is a predetermined position (for example, the position of the lower left end) of the face image F. The feature data to be extracted is not limited to these, and arbitrary feature data such as image data of the face image F itself, face color, shape of ear, etc. may be used as registration data. Further, the face image F is not limited to a planar image, and may be a stereoscopic image.

なお、情報処理装置１が撮像装置３３を備え、登録対象者の顔の撮像をするために、指示音声（動作指示）を発して顔を近づけさせる場合を例に挙げて説明したが、指示音声はこれに限定されない。例えば、鈴木さんの顔を撮像した後に、「鈴木さん、お耳を見せてください。」という音声指示をさらに発して、鈴木さんの耳も撮像して、耳の形状を特徴データに追加してもよい。また、情報処理装置１が指紋センサを備える場合、「鈴木さん、指をスキャナ面に密着させてください。」という音声指示を発して、鈴木さんの指紋情報を特徴データとして取得してもよい。 Although the information processing apparatus 1 includes the imaging device 33 and issues an instruction voice (operation instruction) to bring the face closer in order to capture an image of the face of the registration target person, the instruction voice has been described. Is not limited to this. For example, after imaging the face of Mr. Suzuki, further issue a voice instruction “Please show Mr. Suzuki, please show your ears.” And also imaging Mr. Suzuki's ears, adding the shape of the ears to the feature data. It is also good. In addition, when the information processing apparatus 1 includes a fingerprint sensor, a voice instruction of "Ms. Suzuki, put a finger in contact with the scanner surface" may be issued to acquire Mr. Suzuki's fingerprint information as feature data.

また、登録対象者の顔の撮像をするための動作指示を、音声として発する例を説明したが、動作指示の出力は音声出力であることに限定されず、登録対象者に所定の動作をするように提示できる出力方法であれば如何なる方法であってもよい。例えば、動作指示を表示装置（図示せず）に文字列として表示してもよい。 In addition, although an example in which an operation instruction for capturing a face of a registration target person is issued as a voice has been described, the output of the operation instruction is not limited to the voice output, and a predetermined operation is performed for the registration target person As long as it is an output method that can be presented as such, it may be any method. For example, the operation instruction may be displayed as a character string on a display device (not shown).

（情報処理装置１のハードウェア構成）
まず、情報処理装置１のハードウェア構成について、図２を用いて説明する。図２は、情報処理装置１のハードウェア構成の一例を示すブロック図である。なお、説明の便宜上、姿勢などを制御する動作制御部、データ通信を行う通信部など、情報処理装置１が備え得る他の機能については、その説明を省略する。 (Hardware configuration of the information processing apparatus 1)
First, the hardware configuration of the information processing device 1 will be described with reference to FIG. FIG. 2 is a block diagram showing an example of the hardware configuration of the information processing apparatus 1. Note that for convenience of description, the description of other functions that can be included in the information processing apparatus 1, such as an operation control unit that controls attitude and the like and a communication unit that performs data communication, will be omitted.

情報処理装置１は、音声入力部３１、音声出力部３２、撮像装置３３、制御部１０、および記憶部２０を備えている。制御部１０は、情報処理装置１が備える各機能の処理を実行するように制御するものであり、音声入力部３１から入力した音声、および撮像装置３３が撮像した撮像画像Ｒを取得する。また、制御部１０、記憶部２０に格納されている認識辞書２１、形態素解析辞書２２、および指示テンプレートデータベース２３（図１参照）などを適宜参照して、所定の処理を実行する。さらに制御部１０は、登録データベース２４に生成した登録データを格納する。 The information processing apparatus 1 includes an audio input unit 31, an audio output unit 32, an imaging device 33, a control unit 10, and a storage unit 20. The control unit 10 performs control to execute the processing of each function of the information processing apparatus 1, and acquires the voice input from the voice input unit 31 and the captured image R captured by the imaging device 33. Further, the control unit 10, the recognition dictionary 21 stored in the storage unit 20, the morphological analysis dictionary 22, the instruction template database 23 (see FIG. 1), and the like are appropriately referenced to execute predetermined processing. Further, the control unit 10 stores the generated registration data in the registration database 24.

（情報処理装置１の構成）
次に、情報処理装置１の構成について図１を用いて説明する。図１は、情報処理装置１の概略構成の一例を示すブロック図である。情報処理装置１は、図２において既に説明したように、音声入力部３１、音声出力部３２、撮像装置３３、制御部１０、および記憶部２０を備えている。情報処理装置１が備える各機能を制御する制御部１０は、さらに、テキスト化部１１、形態素解析部１２、紹介文抽出部１３、人物名抽出部１４（識別情報抽出部）、音声指示生成部１５（指示生成部）、撮像装置制御部１６、特徴データ抽出部１７（固有情報抽出部）、および登録データ管理部１８を備えている。記憶部２０は、認識辞書２１、形態素解析辞書２２、指示テンプレートデータベース２３、および登録データベース２４（記憶部）を備えている。 (Configuration of information processing apparatus 1)
Next, the configuration of the information processing apparatus 1 will be described with reference to FIG. FIG. 1 is a block diagram showing an example of a schematic configuration of the information processing apparatus 1. The information processing apparatus 1 includes the voice input unit 31, the voice output unit 32, the imaging device 33, the control unit 10, and the storage unit 20, as already described in FIG. The control unit 10 that controls each function included in the information processing apparatus 1 further includes a textification unit 11, a morphological analysis unit 12, an introduction sentence extraction unit 13, a person name extraction unit 14 (identification information extraction unit), and a voice instruction generation unit The image processing apparatus includes a command generation unit 15, an imaging device control unit 16, a feature data extraction unit 17 (specific information extraction unit), and a registration data management unit 18. The storage unit 20 includes a recognition dictionary 21, a morphological analysis dictionary 22, an instruction template database 23, and a registration database 24 (storage unit).

音声入力部３１は、音声の入力を受付けるマイクロフォンであり、音声入力部３１から入力した音声の音声データは、テキスト化部１１に送られる。 The voice input unit 31 is a microphone for receiving voice input, and voice data of the voice input from the voice input unit 31 is sent to the text conversion unit 11.

音声出力部３２は、音声データを音声として出力するスピーカである。音声出力部３２は、音声指示生成部１５から取得した、音声指示の音声データを出力する。 The audio output unit 32 is a speaker that outputs audio data as audio. The voice output unit 32 outputs the voice data of the voice instruction acquired from the voice instruction generation unit 15.

撮像装置３３は、画像データを取得するデジタルカメラであり、例えば、ＣＣＤセンサ、ＣＭＯＳセンサなどを用いて、撮像対象物の撮像を行うカメラモジュールである。なお、撮像装置３３が撮像する画像は２次元画像に限定されず、３次元画像であってもよい。 The imaging device 33 is a digital camera that acquires image data, and is, for example, a camera module that captures an image of an imaging target using a CCD sensor, a CMOS sensor, or the like. The image captured by the imaging device 33 is not limited to a two-dimensional image, and may be a three-dimensional image.

テキスト化部１１は、音声入力部３１から入力した音声の音声データの音響特徴量を算出し、認識辞書２１を参照しながら、音声データをテキストデータへと変換する。認識辞書２１には、音響モデルと言語モデルとが規定されており、テキスト化部１１は、音響特徴量毎に対応するテキストを割当てて、音声データのテキストデータ化を行う。具体的には、認識辞書２１は、音声データに含まれる音声がどのような周波数特性を有しているかを表したものであり、多数の音声の波形のサンプルと、対応するテキストデータとの対応関係が規定されている。テキスト化部１１は、認識辞書２１を検索して、算出した音響特徴量に一致する音声に対応するテキストを特定する。テキスト化部１１は、テキストデータを形態素解析部１２に送る。 The text conversion unit 11 calculates an acoustic feature of voice data of voice input from the voice input unit 31, and converts the voice data into text data while referring to the recognition dictionary 21. An acoustic model and a language model are defined in the recognition dictionary 21, and the textification unit 11 assigns text corresponding to each acoustic feature amount, and converts voice data into text data. Specifically, the recognition dictionary 21 represents what kind of frequency characteristic the speech included in the speech data has, and corresponds to a large number of speech waveform samples and corresponding text data. Relationships are defined. The text conversion unit 11 searches the recognition dictionary 21 and specifies the text corresponding to the voice that matches the calculated acoustic feature amount. The text conversion unit 11 sends the text data to the morphological analysis unit 12.

形態素解析部１２は、テキスト化部１１から取得したテキストデータを形態素に分解する。形態素とは、意味を持つ最小の言語単位であり、文法関係を示す構成要素である。形態素解析部１２は、テキストデータを、形態素解析辞書２２を参照して文法上の各構成要素に分解し、それぞれの構成要素の文法関係を特定する。形態素解析部１２は、解析結果を、解析対象のテキストデータと共に、紹介文抽出部１３に出力する。なお、形態素解析部１２が、ユーザＵが登録対象者を紹介する音声をテキスト化したテキストデータにおける形態素解析の具体例については、後に説明する。 The morphological analysis unit 12 decomposes the text data acquired from the text conversion unit 11 into morphemes. A morpheme is the smallest linguistic unit having meaning, and is a component showing a grammatical relation. The morphological analysis unit 12 decomposes the text data into constituent elements on the grammar with reference to the morphological analysis dictionary 22, and specifies the grammatical relationship of the constituent elements. The morphological analysis unit 12 outputs the analysis result to the introductory sentence extraction unit 13 together with the text data to be analyzed. A specific example of morpheme analysis in text data in which the morpheme analysis unit 12 converts the voice of the user U introducing the registration target into text will be described later.

紹介文抽出部１３は、形態素解析の結果に基づいて、音声認識して生成されたテキストデータ（テキスト）の中から、登録対象者の人物名を含む文を抽出する。紹介文抽出部１３は、登録対象者が登録対象者と異なる人物に自己紹介したり、登録対象者と異なる人物が登録対象者を第三者もしくは情報処理装置１に対して紹介したりするときの音声から紹介文などを抽出してもよい。例えば、「私は鈴木です。」のような登録対象者本人が自己紹介する紹介文や、「こちらは鈴木さんです」のような登録対象者と異なる人物（例えば、ユーザＵ）が登録対象者を紹介する紹介文は、一人称または三人称の代名詞（私、僕、彼、こちらなど）の後に、人物名「鈴木」が配置される、といった、紹介文に特徴的な所定の形式を有している。登録対象者の人物名を含む紹介文の形式としては、これに限定されない。例えば、登録対象者である鈴木さんが、ユーザＵや佐藤さんに対して自己紹介するときに一人称の代名詞を省略して発する、「鈴木です。皆さん、よろしく。」や「佐藤さん、はじめまして。鈴木といいます。」などの音声から生成されたテキストデータも、登録対象者の人物名を含む所定の形式の文型を有しており、紹介文として抽出してもよい。紹介文抽出部１３は、登録対象者の人物名を含む所定の形式の文型を有するテキストデータを抽出する。 The introductory sentence extraction unit 13 extracts a sentence including the person's name of the registration target person from the text data (text) generated by speech recognition based on the result of the morphological analysis. The introduction sentence extraction unit 13 introduces a self-introduction to a person different from the person to be registered or a person to be registered introduces the person to be registered to a third party or the information processing apparatus 1. The introduction sentence etc. may be extracted from the voice of. For example, an introduction sentence such as “I am Suzuki.” Who is the subject of the self-introduction, or a person (eg, user U) who is different from the subject of the registration such as “This is Mr. Suzuki” (the user U) The introductory sentence to introduce has a prescribed form that is characteristic of the introductory sentence, such as the person name “Suzuki” is placed after the first person or third person pronoun (I, I, I, here, etc.) There is. The form of the introduction including the person's name of the person to be registered is not limited to this. For example, Mr. Suzuki, who is the subject of registration, emits the first-person pronoun when he / she introduces himself to user U and Mr. Sato, and emits "I am Suzuki. Thank you all," and "Mr. Sato, nice to meet you, Suzuki. The text data generated from the voice such as “.” Also has a sentence pattern in a predetermined format including the person's name of the registration target, and may be extracted as an introduction sentence. The introduction sentence extracting unit 13 extracts text data having a sentence pattern of a predetermined format including the person's name of the registration target.

人物名抽出部１４は、紹介文抽出部１３によって紹介文であることが確認されたテキストデータにおいて、形態素解析部１２が「名詞、固有名詞、人物名、姓」であると特定した名詞を、人物名のテキストデータとして抽出する。抽出された人物名のテキストデータは、音声指示生成部１５および登録データ管理部１８に出力される。 The person name extraction unit 14 determines the noun specified by the morphological analysis unit 12 as “noun, proper noun, person's name, surname” in the text data confirmed to be the introduction sentence by the introduction sentence extraction unit 13, Extract as text data of person's name. The text data of the extracted personal name is output to the voice instruction generation unit 15 and the registration data management unit 18.

音声指示生成部１５は、人物名抽出部１４が抽出した人物名（例えば、「鈴木」）のテキストを取得して、指示テンプレートデータベース２３に格納されている指示テンプレートの所定の位置に該人物名を挿入して、人物名を含む音声指示を生成する。なお、音声指示生成部１５が、ユーザＵが紹介した登録対象者の人物名を含む音声指示を生成する処理の具体例については、後に説明する。 The voice instruction generation unit 15 acquires the text of the person name (for example, “Suzuki”) extracted by the person name extraction unit 14, and the person name is acquired at a predetermined position of the instruction template stored in the instruction template database 23. Is inserted to generate a voice instruction including the person's name. Note that a specific example of processing in which the voice instruction generation unit 15 generates a voice instruction including the personal name of the registration target person introduced by the user U will be described later.

撮像装置制御部１６は、音声出力部３２から出力した音声指示を聞いた登録対象者が、該音声指示で指示した所定の動作を行ったときに、撮像装置３３が撮像するように制御する。なお、人物名抽出部１４が抽出した人物名のテキストデータが、人物名抽出部１４から音声指示生成部１５へ出力されるときに、撮像装置３３による撮像を開始するように撮像装置制御部１６に指示してもよい。あるいは、音声指示生成部１５が音声出力部３２に、生成した音声指示を出力するときに、撮像装置３３による撮像を開始するように撮像装置制御部１６に指示してもよい。 The imaging device control unit 16 controls the imaging device 33 to capture an image when the registration target person who has heard the voice instruction output from the voice output unit 32 performs a predetermined operation instructed by the voice instruction. In addition, when text data of a person's name extracted by the person's name extraction unit 14 is output from the person's name extraction unit 14 to the voice instruction generation unit 15, the imaging device control unit 16 starts imaging by the imaging device 33. You may instruct to Alternatively, when the voice instruction generation unit 15 outputs the generated voice instruction to the voice output unit 32, the imaging device control unit 16 may be instructed to start imaging by the imaging device 33.

また、撮像装置制御部１６は、撮像装置３３が撮像した撮像画像Ｒが、特徴データを抽出するために満たすべき条件を満たしているか否かを判断し、この条件が満たされていない場合には、同じ音声指示、または関連する動作を指示する音声指示（関連音声指示）を出力するように、音声指示生成部１５に指示してもよい。特徴データを抽出するために満たすべき条件とは、例えば、撮像画像Ｒにおける、顔画像Ｆの面積が一定以上のサイズであることや、撮像した顔画像Ｆで目・鼻・口の各位置（図３の（ｃ）の点Ｐ１〜Ｐ６参照）の位置座標が決定可能であることなどである。 In addition, the imaging device control unit 16 determines whether the captured image R captured by the imaging device 33 satisfies the condition to be satisfied for extracting the feature data, and in the case where the condition is not satisfied. The voice instruction generation unit 15 may be instructed to output the same voice instruction or a voice instruction (related voice instruction) instructing a related operation. The conditions to be satisfied in order to extract the feature data include, for example, that the area of the face image F in the captured image R is a certain size or more, each position of the eyes, nose, and mouth in the captured face image F The position coordinates of points P1 to P6 in (c) of FIG. 3 can be determined.

特徴データ抽出部１７は、顔画像Ｆの目に対応する点Ｐ１および点Ｐ２の位置、鼻に対応する点Ｐ３の位置、および口に対応する点Ｐ４（口の中央）、点Ｐ５（口の右端）、点Ｐ６（口の左端）の各位置の位置座標、および各点の間の距離などを特徴データとして抽出する。 The feature data extraction unit 17 detects the positions of the points P1 and P2 corresponding to the eyes of the face image F, the position of the point P3 corresponding to the nose, and the points P4 (center of the mouth) and P5 (for the mouth). The position coordinates of each position of the right end) and the point P6 (left end of the mouth), the distance between each point, and the like are extracted as feature data.

登録データ管理部１８は、特徴データ抽出部１７が抽出した特徴データと人物名とを対応付けて登録データを生成し、この登録データを登録データベース２４に格納する。 The registration data management unit 18 associates the feature data extracted by the feature data extraction unit 17 with the person name to generate registration data, and stores the registration data in the registration database 24.

このように、情報処理装置１は、登録する登録対象者を紹介するユーザＵの音声の入力を受付けると、該登録対象者が所定の動作をするように、該登録対象者の人物名を含む音声指示を発する。これにより、他の人物と取り違えることなく登録対象者の顔を撮像することができるため、該登録対象者に関する登録データを正確に管理することができる。 As described above, when the information processing apparatus 1 receives an input of the voice of the user U who introduces the registration target person to be registered, the information processing device 1 includes the personal name of the registration target person so that the registration target person performs a predetermined operation. Give voice instructions. As a result, the face of the person to be registered can be imaged without being confused with another person, so that registration data regarding the person to be registered can be accurately managed.

なお、登録対象者の人物名を抽出する対象となる音声は、ユーザＵが登録対象者を第三者または情報処理装置１に紹介する音声や、登録対象者が他の人物に対して自己紹介する音声に限定されず、ユーザＵが登録対象者に対して話しかけたり、問いかけたりするときの音声であってもよい。例えば、図３に示す場合において、ユーザＵが鈴木さんや佐藤さんに対して、「鈴木さん、お元気ですか。」と問いかける音声や、「佐藤さん、こちらへどうぞ。」と話しかける音声などであってもよい。このような場合、紹介文抽出部１３（対象文抽出部）は、形態素解析の結果に基づいて、音声認識して生成されたテキストデータの中から、自分以外の人物に対して発する所定の形式の文型を有する音声のテキストデータ（対象文）を抽出する。次に、人物名抽出部１４は、紹介文抽出部１３によって所定の形式の文型を有することが確認されたテキストデータにおいて、形態素解析部１２が「名詞、固有名詞、人物名、姓」であると特定した名詞を、人物名のテキストデータとして抽出することができる。 The voice targeted for extracting the person name of the registration target person is a voice where the user U introduces the registration target person to a third party or the information processing apparatus 1, or the registration target person introduces himself to the other person The voice is not limited to the voice, but may be voice when the user U speaks or asks the person to be registered. For example, in the case shown in FIG. 3, the user U asks Mr. Suzuki or Mr. Sato a voice asking "Suzule, how are you?" Or a voice talking "Sato, please go here". It may be. In such a case, the introductory sentence extraction unit 13 (target sentence extraction unit) generates a predetermined format to be emitted to a person other than the person from text data generated by speech recognition based on the result of morphological analysis. Extract text data (target sentence) of speech having sentence pattern. Next, in the text data in which it is confirmed by the introduction sentence extraction unit 13 that the person name extraction unit 14 has a sentence pattern in a predetermined format, the morphological analysis unit 12 is “noun, proper noun, person's name, surname” The noun identified as can be extracted as text data of a person's name.

（テキストデータからの人物名の抽出）
次に、人物名抽出部１４が、形態素解析部１２によるテキストデータの形態素解析に基づいて、テキストデータに含まれる人物名を抽出する処理について、図４を用いて説明する。図４は、テキストデータから人物名のテキストを抽出する処理の一例を示す図である。図４の（ａ）は「こちらは鈴木さんです。」というテキストデータの形態素解析および人物名の抽出を説明するものであり、（ｂ）は「こちらは鈴木さんと佐藤さんです。」というテキストデータの形態素解析および人物名の抽出を説明するものである。 (Extraction of person's name from text data)
Next, a process of the person name extraction unit 14 extracting a person name included in text data based on morpheme analysis of text data by the morpheme analysis unit 12 will be described using FIG. 4. FIG. 4 is a diagram showing an example of processing for extracting a text of a person's name from text data. Figure 4 (a) explains the morphological analysis of the text data "This is Mr. Suzuki." And extraction of the person's name, and (b) the text "This is Mr. Suzuki and Mr. Sato." It explains morpheme analysis of data and extraction of a person's name.

図４の（ａ）に示すように、「こちらは鈴木さんです。」というテキストデータは、「こちら」、「は」、「鈴木」、「さん」、および「です」という形態素に分解される。「こちら」は、「名詞、代名詞、一般（名詞）」であると帰属され、「は」は、「助詞、係助詞」であると帰属され、「鈴木」は、「名詞、固有名詞、人物名、姓」であると帰属され、「さん」は、「名詞、接尾、人物名」であると帰属され、「です」は、「助動詞」であると帰属される。人物名抽出部１４は、「名詞、固有名詞、人物名、姓」として形態素解析部１２が帰属した「鈴木」というテキストデータを、人物名であるとして抽出する。 As shown in (a) of FIG. 4, the text data “This is Mr. Suzuki.” Is decomposed into morphemes “here”, “ha”, “suzuki”, “san”, and “is”. . "Here" is attributed to "noun, pronoun, general (noun)", "ha" is attributed to "article, associated particle", and "Suzuki" is associated with "noun, proper noun, person" The family name is surnamed, the "san" is attributed to the "noun, suffix, person's name", and the "is" is attributed to the "auxiliary verb". The person's name extraction unit 14 extracts text data "Suzuki" to which the morphological analysis unit 12 belongs as "noun, proper noun, person's name, surname" as a person's name.

図４の（ｂ）に示すように、「こちらは鈴木さんと佐藤さんです。」というテキストデータは、「こちら」、「は」、「鈴木」、「さん」、「と」、「佐藤」、「さん」、および「です」という形態素に分解される。そして、「鈴木」および「佐藤」が「名詞、固有名詞、人物名、姓」であると帰属される。人物名抽出部１４は、「名詞、固有名詞、人物名、姓」として形態素解析部１２が帰属した「鈴木」というテキストデータを、人物名であるとして抽出する。 As shown in (b) of FIG. 4, the text data “This is Mr. Suzuki and Mr. Sato” is “here”, “Ha”, “Suzuki”, “Mr”, “To”, “Sato” It is decomposed into morphemes, "san", and "is". Then, "Suzuki" and "Sato" are attributed to "Noun, proper noun, person's name, surname". The person's name extraction unit 14 extracts text data "Suzuki" to which the morphological analysis unit 12 belongs as "noun, proper noun, person's name, surname" as a person's name.

このように、人物名抽出部１４は、各テキストデータから、１または複数の人物名のテキストデータを抽出してもよい。これにより、ユーザＵから複数の人物を紹介する音声が入力した場合においても、人物名抽出部１４は、該音声によって紹介されている登録対象者の人物名をすべて抽出することができる。 As described above, the personal name extraction unit 14 may extract text data of one or more personal names from each text data. Thus, even when a voice introducing a plurality of persons is input from the user U, the personal name extraction unit 14 can extract all the personal names of the registration target persons introduced by the voice.

（出力する音声の生成）
続いて、登録対象者が所定の動作を行うように、登録対象者に対して指示する音声指示の例について、図５を用いて説明する。図５は、情報処理装置１が出力する音声のテンプレートの一例を示す図である。音声指示生成部１５は、指示テンプレートデータベース２３から指示テンプレートを読み出して、各指示テンプレートに人物名抽出部１４から取得した人物名を挿入して、音声指示を生成する。 (Generation of sound to be output)
Subsequently, an example of a voice instruction to instruct a registration subject so that the registration subject performs a predetermined operation will be described with reference to FIG. FIG. 5 is a diagram showing an example of a sound template output by the information processing apparatus 1. The voice instruction generation unit 15 reads out the instruction template from the instruction template database 23, inserts the person name acquired from the person name extraction unit 14 into each instruction template, and generates a voice instruction.

図３の（ａ）に示した、情報処理装置１から出力された音声指示「鈴木さん、お顔を近づけてください。」は、音声指示Ａ１の指示テンプレート「（人物名）さん、お顔を近づけてください。」の「（人物名）」の箇所に、人物名抽出部１４から取得した人物名である「鈴木」を挿入することにより生成することができる。 The voice instruction "Mr. Suzuki, please bring your face close" outputted from the information processing apparatus 1 shown in (a) of FIG. 3 is the instruction template "(person's name) Mr. face of the voice instruction A1. It can be generated by inserting “Suzuki”, which is a person's name acquired from the person's name extraction unit 14, in the “(person's name)” portion of “Please move closer”.

なお、撮像装置制御部１６が、撮像装置３３が撮像した撮像画像Ｒが、特徴データを抽出するために満たすべき条件を満たしているか否かを判断し、この条件が満たされていない場合には、同じ音声指示、または関連する動作を指示する音声指示（関連音声指示）を出力するように、音声指示生成部１５に指示して、顔画像の撮像をやり直してもよい。 Note that the imaging device control unit 16 determines whether the captured image R captured by the imaging device 33 satisfies a condition to be satisfied for extracting feature data, and if this condition is not satisfied. The voice instruction generation unit 15 may be instructed to output the same voice instruction or a voice instruction (related voice instruction) instructing a related operation, and imaging of the face image may be redone.

例えば、撮像装置制御部１６が、撮像画像Ｒにおける、顔画像Ｆの面積が一定以上のサイズでないため、特徴データを抽出するための条件を満たしていないと判定した場合には、音声指示Ａ２の指示テンプレート「（人物名）さん、もう少し、お顔を近づけてください。」の「（人物名）」の箇所に、人物名抽出部１４から取得した登録対象者の人物名を挿入することにより、登録対象者に、さらに顔を近づけさせる音声指示を生成してもよい。同様に、音声指示Ａ３の指示テンプレート「（人物名）さん、お顔をこちらに向けてください。」の「（人物名）」の箇所に、人物名抽出部１４から取得した登録対象者の人物名を挿入することにより、登録対象者に、さらに顔を近づけさせる音声指示を生成してもよい。 For example, when it is determined that the condition for extracting feature data is not satisfied because the area of the face image F in the captured image R is not equal to or larger than a certain size, the imaging device control unit 16 By inserting the person name of the registration target person acquired from the person name extraction unit 14 in the part of “(person name)” of the instruction template “(person name), please bring your face a little closer”. A voice instruction may be generated to make the person to be registered more closely approach the face. Similarly, the person of the registration target person acquired from the person name extraction unit 14 in the part of “(person name)” of the instruction template “(person name) Mr. face please face here” of the voice instruction A3. By inserting the name, a voice instruction may be generated to make the person to be registered more closely approach the face.

音声指示を生成する前に、図５の（ｂ）に示すような補助フレーズ音声を生成して、出力してもよい。補助フレーズ音声は、例えば、補助フレーズＸ１テンプレート「（人物名）さん、はじめまして。」の「（人物名）」の箇所に、人物名抽出部１４から取得した登録対象者の人物名を挿入することにより生成される。このような音声を出力することにより、登録対象者に対して音声指示を聞く準備を促し、この後に出力する音声指示への関心を高めることができる。 Before generating the voice instruction, an auxiliary phrase sound as shown in (b) of FIG. 5 may be generated and output. In the auxiliary phrase voice, for example, the person name of the registration target person acquired from the person name extraction unit 14 is inserted in the part of “(person name)” of the assistance phrase X1 template “(person name) Mr. nice to meet you.” Generated by By outputting such a voice, it is possible to urge a person to be registered to prepare to listen to the voice instruction, and to increase interest in the voice instruction to be output thereafter.

さらに、顔の撮像に成功したときに、図５の（ｂ）に示すような補助フレーズ音声を生成して、出力してもよい。補助フレーズ音声は、例えば、補助フレーズＸ２テンプレート「（人物名）さん、お疲れ様でした。」、補助フレーズＸ３テンプレート「（人物名）さん、ありがとう。」の「（人物名）」の箇所に、人物名抽出部１４から取得した登録対象者の人物名を挿入することにより生成される。このような音声を出力することにより、登録対象者の顔の撮像が完了したこと、すなわち、特徴データを抽出して登録データを生成する処理に進んだことを、登録対象者を紹介したユーザＵおよび登録対象者に対して知らせることができる。 Furthermore, when the imaging of the face is successful, the auxiliary phrase sound as shown in (b) of FIG. 5 may be generated and output. The auxiliary phrase sound is, for example, the auxiliary phrase X2 template "(person name) Mr. Thank you very much.", The auxiliary phrase X3 template "(person name) Mr. thank you." It is generated by inserting the person name of the registration target person acquired from the name extraction unit 14. By outputting such a voice, the user U who introduced the registration target person that imaging of the face of the registration target person is completed, that is, proceeding to processing for extracting feature data and generating registration data And the person to be registered can be notified.

（情報処理装置１が登録対象者の人物名と特徴データとを対応付けて登録する処理の流れ）
ここでは、情報処理装置１が登録対象者の人物名と特徴データとを対応付けて登録する処理の流れについて図６を用いて説明する。図６は、情報処理装置１が登録対象者の人物名と特徴データとを対応付けて登録する処理の流れの一例を説明するフローチャートである。 (Flow of processing in which the information processing apparatus 1 registers the person name of the registration target person and the feature data in association with each other)
Here, the flow of processing in which the information processing apparatus 1 registers the person name of the registration target person in association with the feature data will be described with reference to FIG. FIG. 6 is a flow chart for explaining an example of the flow of processing in which the information processing apparatus 1 registers the person name of the registration target person in association with the feature data.

まず、情報処理装置１は、音声入力部３１からの音声入力の受付けを開始して、ユーザＵからの登録対象者を紹介する音声の入力を受付ける（音声入力ステップ）。テキスト化部１１は、入力された音声の音声データの音響特徴量を算出する（Ｓ１）。次に、テキスト化部１１は、認識辞書２１に規定されている、音響特徴量とテキストとの対応関係に基づいて、入力した音声の音声データをテキストデータへと変換する（Ｓ２）。続いて、形態素解析部１２は、テキストデータを文法上の各構成要素に分解し、形態素解析辞書２２を参照して、それぞれの構成要素の文法関係を特定する（Ｓ３）。テキストデータの形態素解析の結果から、紹介文抽出部１３によって登録対象者を紹介する紹介文であることが確認されると（Ｓ４）、人物名抽出部１４は、紹介文であることが確認されたテキストデータにおいて、形態素解析部１２が、例えば「名詞、固有名詞、人物名、姓」として帰属した単語（名詞）を人物名としてテキストデータから抽出する（Ｓ５：識別情報抽出ステップ）。また人物名抽出部１４は、抽出した人物名を音声指示生成部１５および登録データ管理部１８に送る。 First, the information processing apparatus 1 starts accepting voice input from the voice input unit 31, and receives voice input from the user U for introducing a person to be registered (voice input step). The text conversion unit 11 calculates the acoustic feature amount of the voice data of the input voice (S1). Next, the text conversion unit 11 converts the voice data of the input voice into text data based on the correspondence between the acoustic feature amount and the text defined in the recognition dictionary 21 (S2). Subsequently, the morphological analysis unit 12 decomposes the text data into each component in the grammar, and refers to the morphological analysis dictionary 22 to specify the grammatical relationship of each component (S3). When the introduction sentence extraction unit 13 confirms that the introduction sentence introduces the person to be registered from the result of morphological analysis of the text data (S4), the person name extraction unit 14 confirms that the introduction sentence is In the text data, the morphological analysis unit 12 extracts words (nouns) belonging as, for example, "noun, proper noun, person's name, surname" from text data as person's name (S5: identification information extraction step). The person name extraction unit 14 also sends the extracted person name to the voice instruction generation unit 15 and the registration data management unit 18.

人物名のテキストを取得した音声指示生成部１５は、指示テンプレートデータベース２３から指示テンプレートを読み出して、各指示テンプレートに人物名抽出部１４から取得した人物名を挿入して、登録対象者が所定の動作をするように指示する音声指示を生成する（Ｓ６：指示生成ステップ）。ここで音声指示生成部１５は、音声指示を生成する前に、または音声指示を生成した直後に、撮像装置３３での撮像を開始するように、撮像装置制御部１６に指示が送られ、撮像装置３３での撮像が開始する（Ｓ７）。その後、音声指示を音声出力部３２から出力する（Ｓ８：指示提示ステップ）。なお、Ｓ６〜Ｓ８の処理の順序は一例であり、これに限定されない。例えば、音声指示を音声出力部３２から出力した後に直ちに撮像装置３３での撮像を開始してもよい。 The voice instruction generation unit 15 having acquired the text of the person's name reads out the instruction template from the instruction template database 23, inserts the person's name acquired from the person's name extraction unit 14 into each instruction template, and the registration target person is predetermined. A voice instruction instructing operation is generated (S6: instruction generation step). Here, the voice instruction generation unit 15 sends an instruction to the imaging device control unit 16 to start imaging with the imaging device 33 before generating the voice instruction or immediately after generating the voice instruction. The imaging by the device 33 starts (S7). Thereafter, an audio instruction is output from the audio output unit 32 (S8: instruction presentation step). In addition, the order of the process of S6-S8 is an example, and is not limited to this. For example, imaging may be started by the imaging device 33 immediately after the audio instruction is output from the audio output unit 32.

情報処理装置１が出力する音声指示は、登録対象者の人物名を含んでいるため、登録対象者が所定の動作をするように、登録対象者のみに呼びかけて指示する。例えば、図３の（ａ）に示すように、「鈴木さん、お顔を近づけてください」という音声指示を聞いた登録対象者である鈴木さんは、情報処理装置１に顔を近づけるが、ユーザＵおよび佐藤さんは指示を受けていないので、情報処理装置１に顔を近づけない。情報処理装置１は、近づけられた顔を撮像した撮像画像Ｒを、登録対象者である鈴木さんの顔画像Ｆとして取得する。 Since the voice instruction output by the information processing apparatus 1 includes the name of the person to be registered, the person to be registered calls and instructs only the person to be registered so as to perform a predetermined operation. For example, as shown in (a) of FIG. 3, Mr. Suzuki who is a registration target who has heard the voice instruction "Ms. Suzuki, please bring your face close" brings the face close to the information processing apparatus 1, but the user Since U and Mr. Sato have not received an instruction, the face can not be brought close to the information processing apparatus 1. The information processing apparatus 1 acquires a captured image R obtained by capturing an approached face as a face image F of Mr. Suzuki, who is a registration target.

撮像装置制御部１６は、撮像画像Ｒから顔画像Ｆを検出して（Ｓ９）、特徴データを抽出するために満たすべき条件を満たしているか否かを判断する（Ｓ１０）。図６では、特徴データを抽出するために満たすべき条件が、撮像画像Ｒにおける顔画像Ｆの面積が閾値以上のサイズであること、である場合を例に挙げて図示している。この条件が満たされていない場合（Ｓ１０にてＮＯ）、Ｓ８に戻り、同じ音声指示、または関連する動作を指示する音声指示（関連音声指示）を出力するように、音声指示生成部１５に指示する。一方、特徴データを抽出可能である場合（Ｓ１０においてＹＥＳ）、撮像画像Ｒは撮像装置制御部１６から特徴データ抽出部１７に送られ、特徴データ抽出部１７は顔画像Ｆの特徴データを抽出する（Ｓ１１）。 The imaging device control unit 16 detects a face image F from the captured image R (S9), and determines whether a condition to be satisfied for extracting feature data is satisfied (S10). FIG. 6 exemplifies a case where the condition to be satisfied for extracting the feature data is that the area of the face image F in the captured image R is a size equal to or larger than the threshold. If this condition is not satisfied (NO in S10), the process returns to S8, and the voice instruction generation unit 15 is instructed to output the same voice instruction or a voice instruction (related voice instruction) instructing the related operation. Do. On the other hand, if the feature data can be extracted (YES in S10), the captured image R is sent from the imaging device control unit 16 to the feature data extraction unit 17, and the feature data extraction unit 17 extracts feature data of the face image F (S11).

最後に、登録データ管理部１８は、特徴データ抽出部１７から取得した特徴データと、人物名抽出部１４から取得した人物名とを対応付けた、登録対象者の登録データを登録データベース２４に格納する（Ｓ１２：登録データ管理ステップ）。 Finally, the registration data management unit 18 stores in the registration database 24 the registration data of the person to be registered, in which the feature data acquired from the feature data extraction unit 17 is associated with the person name acquired from the person name extraction unit 14 (S12: registration data management step).

このように、情報処理装置１は、登録する登録対象者を紹介する音声の入力を受付けると、該登録対象者が所定の動作をするように、該登録対象者の人物名を含む音声指示を発する。これにより、他の人物と取り違えることなく登録対象者の顔を撮像することができるため、該登録対象者に関する登録データを正確に管理することができる。 As described above, when the information processing apparatus 1 receives an input of a voice introducing a registration target person to be registered, a voice instruction including the person name of the registration target person is performed so that the registration target person performs a predetermined operation. It emits. As a result, the face of the person to be registered can be imaged without being confused with another person, so that registration data regarding the person to be registered can be accurately managed.

〔実施形態２〕
上述の例では、情報処理装置１が音声入力部３１、音声出力部３２、および撮像装置３３を備え、ユーザＵによって紹介された登録対象者に関する登録データを生成して管理する例について説明したが、これに限定されない。例えば、情報処理装置１と、音声入力部３１、音声出力部３２、および撮像装置３３との間のデータ送受信が可能であれば、情報処理装置１と、音声入力部３１、音声出力部３２、および撮像装置３３とは、別体として構成されていてもよい。 Second Embodiment
In the above-described example, the information processing apparatus 1 includes the voice input unit 31, the voice output unit 32, and the imaging device 33, and generates and manages registration data regarding a registration target person introduced by the user U. Not limited to this. For example, if it is possible to transmit and receive data between the information processing device 1, the voice input unit 31, the voice output unit 32, and the imaging device 33, the information processing device 1, the voice input unit 31, the voice output unit 32, And the imaging device 33 may be configured separately.

この場合、音声入力部３１に入力された音声の音声データは、情報処理装置１へ送信される。情報処理装置１は、受信した音声データをテキストデータに変換して、該テキストデータの形態素解析を行い、該テキストデータに含まれている登録対象者の人物名を抽出する。情報処理装置１は、その人物名を用いて生成した音声指示の音声データを音声出力部３２に送信すると共に、撮像装置３３に対して、撮像開始の指示を送信する。 In this case, the voice data of the voice input to the voice input unit 31 is transmitted to the information processing device 1. The information processing apparatus 1 converts the received voice data into text data, performs morphological analysis of the text data, and extracts the person's name of the registration target person included in the text data. The information processing apparatus 1 transmits voice data of a voice instruction generated using the person's name to the voice output unit 32, and transmits an instruction to start imaging to the imaging device 33.

撮像装置３３が撮像した撮像画像Ｒは情報処理装置１に送信され、情報処理装置１は、撮像画像Ｒの顔画像Ｆから、登録対象者の特徴データを抽出して、該登録対象者の人物名と対応付けて登録データを生成し、記憶部２０に格納する。 The captured image R captured by the imaging device 33 is transmitted to the information processing device 1, and the information processing device 1 extracts feature data of the registration target person from the face image F of the captured image R, and the person of the registration target person Registration data is generated in association with a name and stored in the storage unit 20.

このように、音声入力部３１、音声出力部３２、および撮像装置３３は、情報処理装置１の制御部１０が設置されている位置と離れた位置に設置することも可能である。 As described above, the voice input unit 31, the voice output unit 32, and the imaging device 33 can also be installed at a position apart from the position at which the control unit 10 of the information processing device 1 is installed.

〔実施形態３〕
情報処理装置１の制御ブロック（特に、テキスト化部１１、形態素解析部１２、紹介文抽出部１３、人物名抽出部１４、音声指示生成部１５、撮像装置制御部１６、特徴データ抽出部１７、および登録データ管理部１８）は、集積回路（ＩＣチップ）等に形成された論理回路（ハードウェア）によって実現してもよいし、ＣＰＵ（Central Processing Unit）を用いてソフトウェアによって実現してもよい。 Third Embodiment
Control blocks of the information processing apparatus 1 (in particular, the text conversion unit 11, the morphological analysis unit 12, the introduction sentence extraction unit 13, the person name extraction unit 14, the voice instruction generation unit 15, the imaging device control unit 16, the feature data extraction unit 17, The registration data management unit 18) may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or may be realized by software using a CPU (central processing unit) .

後者の場合、情報処理装置１は、各機能を実現するソフトウェアであるプログラムの命令を実行するＣＰＵ、上記プログラムおよび各種データがコンピュータ（またはＣＰＵ）で読み取り可能に記録されたＲＯＭ（Read Only Memory）または記憶装置（これらを「記録媒体」と称する）、上記プログラムを展開するＲＡＭ（Random Access Memory）などを備えている。そして、コンピュータ（またはＣＰＵ）が上記プログラムを上記記録媒体から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記プログラムは、該プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記コンピュータに供給されてもよい。なお、本発明は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 In the latter case, the information processing apparatus 1 is a CPU that executes instructions of a program that is software that realizes each function, and a ROM (Read Only Memory) in which the program and various data are readably recorded by a computer (or CPU). Alternatively, a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) for developing the program, and the like are provided. The object of the present invention is achieved by the computer (or CPU) reading the program from the recording medium and executing the program. As the recording medium, a “non-transitory tangible medium”, for example, a tape, a disk, a card, a semiconductor memory, a programmable logic circuit or the like can be used. The program may be supplied to the computer via any transmission medium (communication network, broadcast wave, etc.) capable of transmitting the program. The present invention can also be realized in the form of a data signal embedded in a carrier wave, in which the program is embodied by electronic transmission.

〔まとめ〕
本発明の態様１に係る情報処理装置１は、音声の入力を受付ける音声入力部３１を備える情報処理装置であって、上記音声入力部から入力された、対象者の識別情報を含む音声の音声データから上記対象者の識別情報を抽出する識別情報抽出部（人物名抽出部１４）と、上記対象者に対して所定の動作を行うよう指示する動作指示を、上記識別情報抽出部が抽出した上記識別情報を含めて生成する指示生成部（音声指示生成部１５）と、上記動作指示を上記対象者に対して提示する指示提示部（音声出力部３２）と、を備える。 [Summary]
The information processing apparatus 1 according to aspect 1 of the present invention is an information processing apparatus including an audio input unit 31 that receives an input of audio, and is an audio of audio including identification information of a target person input from the audio input unit. The identification information extraction unit extracts an identification information extraction unit (person name extraction unit 14) that extracts identification information of the target person from data, and an operation instruction that instructs the target person to perform a predetermined operation. It includes an instruction generation unit (voice instruction generation unit 15) that generates the identification information and the above, and an instruction presentation unit (voice output unit 32) that presents the operation instruction to the target person.

上記の構成によれば、対象者の識別情報を含む音声から上記対象者の識別情報を抽出し、上記対象者が所定の動作を行うように、上記対象者の識別情報を含む動作指示を、上記対象者に提示する。これにより、他の人物と取り違えることなく、正しい対象者に、所定の動作をさせることができる。 According to the above configuration, the identification information of the target person is extracted from the voice including the identification information of the target person, and the operation instruction including the identification information of the target person is performed so that the target person performs a predetermined operation. Present to the above subjects. In this way, it is possible to cause the correct subject to perform a predetermined action without being confused with another person.

本発明の態様２に係る情報処理装置は、上記態様１において、上記対象者の識別情報を含む音声は、上記対象者と異なる人物が発した音声であってもよい。 In the information processing device according to aspect 2 of the present invention, in the aspect 1, the sound including the identification information of the target person may be a sound emitted by a person different from the target person.

これにより、上記対象者の識別情報を含む、対象者と異なる人物が発した音声から、対象者の識別情報を抽出することができる。よって、対象者と異なる人物が対象者に対して話しかける音声や問いかける音声、および対象者を紹介する音声などから対象者の識別情報を抽出することができる。 Thereby, the identification information of the object person can be extracted from the voice emitted by the person different from the object person including the identification information of the object person. Therefore, it is possible to extract the identification information of the target person from the voice or the question voice which the person different from the target person speaks to the target person, the voice introducing the target person, and the like.

本発明の態様３に係る情報処理装置は、上記態様２において、上記音声入力部から入力された音声データを音声認識して生成されたテキストから、上記対象者を当該対象者と異なる人物が紹介する紹介文を抽出し、上記識別情報抽出部へ出力する紹介文抽出部１３をさらに備えてもよい。 In the information processing apparatus according to aspect 3 of the present invention, in the above aspect 2, a person different from the target person introduces the target person from the text generated by performing voice recognition of the voice data input from the voice input unit. It may further include an introduction sentence extraction unit 13 which extracts an introduction sentence to be output and outputs the introduction sentence to the identification information extraction unit.

上記の構成によれば、上記対象者と異なる人物が、当該対象者を紹介する紹介文から当該対象者の識別情報を抽出する。これにより、例えば、対象者と異なる人物が対象者を第三者や情報処理装置に紹介するときの音声から紹介文を抽出することができる。なお、対象者と異なる人物が当該対象者を紹介する紹介文は、対象者の識別情報を含む所定の形式の文型を有する文であるため、音声から紹介文を選択的に抽出することにより、対象者の識別情報を効率よく抽出することができる。 According to the above configuration, a person different from the target person extracts the identification information of the target person from the introduction sentence introducing the target person. Thereby, for example, an introduction sentence can be extracted from the voice when a person different from the target person introduces the target person to a third party or an information processing apparatus. In addition, since the introductory sentence in which a person different from the subject introduces the subject is a sentence having a sentence pattern in a predetermined format including the identification information of the subject, it is possible to selectively extract the introductory sentence from the voice. The identification information of the subject can be extracted efficiently.

本発明の態様４に係る情報処理装置は、上記態様１から３のいずれかにおいて、上記識別情報は、上記対象者を示す人物名であり、上記指示生成部は、上記識別情報抽出部が上記識別情報として抽出した上記人物名を含む上記動作指示を生成してもよい。 In the information processing apparatus according to aspect 4 of the present invention, in any of the above aspects 1 to 3, the identification information is a personal name indicating the target person, and the instruction generation unit is the identification information extraction unit The operation instruction may be generated including the person name extracted as identification information.

上記の構成によれば、対象者の人物名を含む指示を生成する。これにより、動作指示が誰に向けた指示であるかを明確にして出力することができる。 According to the above configuration, the instruction including the person's name of the subject is generated. This makes it possible to clarify and output to whom the operation instruction is directed.

本発明の態様５に係る情報処理装置は、上記態様１から４のいずれかにおいて、上記対象者の顔を撮像した顔画像から、上記対象者に固有の固有情報を抽出する固有情報抽出部（特徴データ抽出部１７）をさらに備え、上記所定の動作を行った上記対象者の上記識別情報と、上記固有情報抽出部が抽出した当該対象者の上記固有情報とを対応付けた、上記対象者に関する登録データを記憶部（記憶部２０、登録データベース２４）に記憶してもよい。 The information processing apparatus according to aspect 5 of the present invention is the unique information extraction unit for extracting the unique information unique to the target person from the face image obtained by imaging the face of the target person in any one of the above aspects 1 to 4 The above target person further comprising: a characteristic data extraction unit 17), wherein the identification information of the target person who has performed the predetermined operation is associated with the unique information of the target person extracted by the unique information extraction unit Registration data relating to the above may be stored in the storage unit (storage unit 20, registration database 24).

上記の構成によれば、上記対象者の顔を撮像した顔画像から、対象者に固有の固有情報を抽出して、対象者の識別情報と固有情報とを対応付けて記憶する。一般に、顔は対象者毎に異なるので、対象者毎に固有の固有情報を、撮像した顔画像から抽出することができる。これにより、他の人物と取り違えることなく、対象者の顔を撮像することができるため、当該対象者に関する登録データを正確に管理することができる。 According to the above configuration, unique information specific to the subject is extracted from the face image obtained by imaging the face of the subject, and identification information of the subject and unique information are associated and stored. Generally, the face is different for each subject, so that unique information unique to each subject can be extracted from the captured face image. As a result, the face of the subject can be imaged without being confused with another person, so that registration data regarding the subject can be accurately managed.

本発明の態様６に係る情報処理装置１の制御方法は、音声の入力を受付ける音声入力部３１を備える情報処理装置の制御方法であって、上記音声入力部から、対象者の識別情報を含む音声の入力を受付ける音声入力ステップ（Ｓ１）と、上記音声入力ステップにて受付けた上記音声の音声データから上記対象者の識別情報を抽出する識別情報抽出ステップ（Ｓ５）と、上記対象者に対して所定の動作を行うよう指示する動作指示を、上記識別情報抽出ステップにて抽出した上記識別情報を含めて生成する指示生成ステップ（Ｓ６）と、上記指示生成ステップにて生成した上記動作指示を上記対象者に対して提示する指示提示ステップ（Ｓ８）と、を含む。上記の構成によれば、上記態様１と同様の効果を奏する。 A control method of an information processing apparatus 1 according to a sixth aspect of the present invention is a control method of an information processing apparatus including an audio input unit 31 that receives an input of audio, and includes identification information of a target person from the audio input unit. A voice input step (S1) for accepting voice input, an identification information extraction step (S5) for extracting identification information of the subject from the voice data of the voice received in the voice input step; Operation instruction for generating a predetermined operation including the identification information extracted in the identification information extraction step, an instruction generation step (S6), and the operation instruction generated in the instruction generation step And an instruction presenting step (S8) presented to the subject. According to the above-mentioned composition, the same effect as the above-mentioned mode 1 is produced.

本発明の各態様に係る情報処理装置は、コンピュータによって実現してもよく、この場合には、コンピュータを上記情報処理装置が備える各部（ソフトウェア要素）として動作させることにより上記情報処理装置をコンピュータにて実現させる情報処理装置の制御プログラム、およびそれを記録したコンピュータ読み取り可能な記録媒体も、本発明の範疇に入る。 The information processing apparatus according to each aspect of the present invention may be realized by a computer. In this case, the computer is operated as each unit (software element) included in the information processing apparatus, and the information processing apparatus is operated by the computer. A control program of an information processing apparatus to be realized and a computer readable recording medium recording the same also fall within the scope of the present invention.

本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。さらに、各実施形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成することができる。 The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and embodiments obtained by appropriately combining the technical means disclosed in the different embodiments. Is also included in the technical scope of the present invention. Furthermore, new technical features can be formed by combining the technical means disclosed in each embodiment.

本発明は、人間とのコミュニケーション機能を備える電子機器やロボット等に利用することができる。 The present invention can be used for electronic devices, robots, etc. provided with a communication function with human beings.

１情報処理装置
１３紹介文抽出部
１４人物名抽出部（識別情報抽出部）
１５音声指示生成部（指示生成部）
１７特徴データ抽出部（固有情報抽出部）
１８登録データ管理部
２０記憶部
２４登録データベース（記憶部）
３１音声入力部
３２音声出力部（指示提示部）
３３撮像装置 1 information processing apparatus 13 introduction sentence extraction unit 14 person name extraction unit (identification information extraction unit)
15 Voice instruction generator (instruction generator)
17 Feature data extraction unit (specific information extraction unit)
18 Registered data management unit 20 Storage unit 24 Registration database (storage unit)
31 voice input unit 32 voice output unit (instruction presenting unit)
33 Imaging device

Claims

An information processing apparatus comprising an audio input unit for receiving an input of audio, the information processing apparatus comprising:
An identification information extraction unit for extracting identification information of the target person from voice data of speech including identification information of the target person input from the voice input unit;
An instruction generation unit that generates an operation instruction for instructing the subject to perform a predetermined operation including the identification information extracted by the identification information extraction unit;
Providing an instruction presentation unit for presenting the operation instruction to the target person,
From the text generated by speech recognition of voice data input from the voice input unit, an introduction sentence in which a person different from the subject person introduces the target person is extracted, and the introduction sentence is output to the identification information extraction unit An information processing apparatus further comprising an extraction unit.

The identification information is a personal name indicating the target person,
The information processing apparatus according to claim 1, wherein the instruction generation unit generates the operation instruction including the personal name extracted by the identification information extraction unit as the identification information.

It further comprises a unique information extraction unit for extracting unique information unique to the subject from the face image obtained by imaging the face of the subject,
Storing in the storage unit registration data regarding the target person in which the identification information of the target person who has performed the predetermined operation and the unique information of the target person extracted by the unique information extraction unit are associated with each other The information processing apparatus according to claim 1 or 2, characterized in that

A control program for causing a computer to function as the information processing apparatus according to claim 1, wherein the control program causes the computer to function as the identification information extraction unit, the instruction generation unit, and the introduction sentence extraction unit.

The computer-readable recording medium which recorded the control program of Claim 4 .