WO2019146422A1

WO2019146422A1 - Information processing device, information processing method, program, and robot

Info

Publication number: WO2019146422A1
Application number: PCT/JP2019/000607
Authority: WO
Inventors: 井手　直紀; アンドリューシン
Original assignee: ソニー株式会社
Priority date: 2018-01-25
Filing date: 2019-01-10
Publication date: 2019-08-01

Abstract

The present invention enables a person's behavior to be satisfactorily imitated without making user registration. The present invention is provided with a processing unit for obtaining a score for action generation by person on the basis of the data of a person detected from image or voice input data. A score includes, for example, an impression score obtained by evaluating whether a person's behavior is good or bad, or a relationship score obtained by evaluating a person's relationship. The processing unit, for example, generates an action that corresponds to each person using the by-person score.

Description

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, PROGRAM, AND ROBOT

The present technology relates to an information processing apparatus, an information processing method, a program, and a robot, and more particularly to an information processing apparatus suitable for being applied to a robot aiming at interaction with people.

Robots that aim to interact with people are expected to behave like people. As an example of imitating a person's behavior, an action based on the person's impression or the like of who this person is can be considered.

For example, Patent Document 1 describes a robot that determines an action using a positivity index. In this case, since a user registration operation for the robot is required by the user, a natural response is exhibited for the person who has registered the user, but a natural response is not shown for the person who has not registered the user. For example, when purchased by a family, it means that there is no natural reaction to a person who forgot to register in the family. In addition, in this robot, the operation of user registration is not human.

JP, 2009-266200, A

The purpose of the present technology is to better simulate human behavior without performing user registration.

The concept of this technology is
According to another aspect of the present invention, an information processing apparatus includes a processing unit that obtains a score for action generation for each person based on data of a person detected from input data of an image or a sound.

In the present technology, the processing unit obtains a score for action generation for each person based on the data of the person detected from the input data of the image or the sound. For example, the score may be made to include an impression score obtained by evaluating the quality of the person's behavior. Also, for example, the score may be made to include a relationship score obtained by evaluating a person relationship.

Further, for example, the processing unit collates the data of the detected person with the data of the predetermined number of stored persons, and when there is nothing that matches, the data of the detected person and the data of the person are compared. The obtained score is stored as data of a new person, and when there is a match, the stored score of the corresponding person is updated using the score obtained based on the data of the detected person May be done.

Also, for example, the processing unit performs identification by learning using a network configuration of metric learning (Metric Learnings) for realizing semi-supervised learnings and fe-shot learnings. It may be done using a module. In this case, for example, the identification module converts the data of the detected person into the feature amount, calculates the component of the feature amount of the data of the predetermined number of persons stored for the converted feature amount, and performs this calculation The score of the data of a predetermined number of stored persons with respect to the data of the detected persons is detected based on the detected components, and based on the detected scores, information of a matching person or any person is detected. It may be made to output information that they do not match. Then, in this case, for example, the component of the feature value converts the feature value of the stored person data into a unit vector, and takes the inner product of the detected feature data of the person data and the converted unit vector. It may be calculated as

Also, for example, the processing unit may be configured to further generate an action corresponding to each person using a score for each person. In this case, for example, the score includes an impression score obtained by evaluating a person's behavior and a relationship score obtained by evaluating a person relationship, and the processing unit uses the impression score and the relationship score. An action may be generated based on the integrated score calculated based on the calculated score.

As described above, in the present technology, a score for action generation for each person is obtained based on data of a person detected from input data of images or sounds. Therefore, it is possible to better simulate human behavior without performing user registration. In this case, for example, for a person who frequently encounters, a person who makes a lot of positive impressions, or a person who has a good relationship with others, an action that is naturally associated with a positive impression evaluation Interaction is performed based on Also, for example, for a person who is not fit for the first time, a person who is not fit well, an impression is performed based on the action involved in the evaluation.

According to the present technology, it is possible to better simulate human behavior without performing user registration. In addition, the effect described here is not necessarily limited, and may be any effect described in the present disclosure.

It is a block diagram showing an example of composition of an information processor as an embodiment. It is a figure which shows the example of a memory of a score table classified by person. It is a flowchart which shows the process sequence in an action production | generation module. It is a figure which shows the schematic operation | movement flow of an information processing apparatus. It is a figure for demonstrating a person identification module. It is a figure for demonstrating learning of a person identification module. FIG. 2 illustrates an embedded network. It is a figure which shows the network of metric learning which implement | achieves half teacher training and fu shot learning. FIG. 5 illustrates a network with normalized support features. It is a figure which shows the comparison result in the data set called "Omni glot" known as performance evaluation benchmark of fu shot learning. It is a figure which shows the comparison result in the data set called "miniImagenet" known as performance evaluation benchmark of fu shot learning. It is a figure which shows the network which added known unknown determination.

Hereinafter, modes for carrying out the invention (hereinafter referred to as “embodiments”) will be described. The description will be made in the following order.
1. Embodiment 2. Modified example

<1. First embodiment>
[Information processing device]
FIG. 1 shows a configuration example of an information processing apparatus 100 as an embodiment. The information processing apparatus 100 is possessed by a robot (agent) intended to interact with people. The information processing apparatus 100 includes an input unit 101, a person area / section detection module 102, a person identification module 103, an impression score calculation module 104, a relationship score calculation module 105, and a score per person update module 106. A person-by-person score table 107, an action generation module 108, and an output unit 109 are included.

The input unit 101 includes an image sensor and a microphone, obtains image data with the image sensor, and obtains audio data with the microphone. In this case, the image sensor functions as the eye of the robot (agent), and the microphone functions as the ear of the robot (agent). Here, the image sensor and the microphone are not limited to one each, and only one of the image sensor and the microphone may be provided.

The human area / section detection module 102 detects an area where a person exists from the image data obtained by the input unit 101 using, for example, an image recognition technology, and identifies the image data of that area as detected human data. The module 103 is supplied. Also, the human area / section detection module 102 detects a voice section of a person from the voice data obtained by the input unit 101 using, for example, voice recognition technology, and uses the voice data of the section as detected person data. It is sent to the identification module 103, the impression score calculation module 104, and the relationship score calculation module 105.

The person identification module 103 collates the detected person data with a predetermined number of stored person data (identification data) stored in the person-by-person score table 107, and matches the information of the stored person or any stored person. Output information that they do not match. In this embodiment, the person identification module 103 is an identification module which has learned the collation using a prototype network configuration of fusion learning. Details of this identification module will be described later.

The impression score calculation module 104 generates a score for evaluating the goodness or badness of the detected person data by the learning device learned using the data set labeled with the goodness and badness behavior, and uses this score as an impression score. Output.

For example, there are the following as information on good feeling (good) and aversion (evil) from images.
Good feeling (good) :
· Seen from the robot (agent), often seen, especially with a smile, approaching the agent · being at the same time as a person with high familiarity with the robot (agent) · with a person with high familiarity smiling, conversation is ambulatory You
・ You will get rid of it (tactile sensor)
Disgust (Evil) :
-Approaches robots (agents) with hostile expressions-People with high intimacy have a disgusted expression, are deceiving-Betrayed

Also, for example, there are the following as information of good feeling (good) and aversion (evil) from speech.
Good feeling (good) :
· Complain in the conversation, ask good questions, talk infrequently (Evil) :
・ A strange question is asked by human beings (sexual harassment, power harassment)

The relationship score calculation module 105 is configured to determine the relationship between the plurality of persons with respect to the detected person data by the learning device learned using the data set labeled with the relationship between the plurality of persons. Generate a score to evaluate gender, and output this score as a relevance score.

For example, information on good and bad relationships from images includes:
Good relationship :
・ Several people are talking with smiles ・Positive relationship where several people are walking holding hands
・ Several people are jealous ・ Several people are walking away

Also, for example, information on good and bad relations from speech is as follows.
Good relationship :
・ A bad relationship in which two or more people are laughing and having a conversation:
・ Several people are loud and loud

When data of a plurality of persons exist as detected person data detected by the person area / section detection module 102, the person identification module 103 performs person identification on data of each person, and an impression score is obtained. The calculation module 104 performs an impression score calculation process on data of each person. Further, only in this case, the relationship score calculation module 105 performs a relationship score calculation process.

The per-person score update module 106 updates the per-person score table 107 based on the output information of the per-person identification module 103. In the per-person score table 107, identification data, an impression score, and a relationship score are stored for each person, identified by the person ID. FIG. 2 shows an example of storage of the per-person score table 107. In the illustrated example, the terms of three persons identified by the person IDs A, B, and C exist, and identification data, an impression score, and a relationship score are stored in each term.

When the person identification module 103 outputs information that it does not match any of the stored persons, the individual score updating module 106 sets the new individual item in the individual score table 107 so as to be identifiable by the person ID. Then, identification data, an impression score, and a relationship score are stored in the new person's section.

In this case, detected person data relating to the above-described collation is stored as identification data. As the impression score, the impression score obtained by the impression score calculation module 104 based on the detected person data related to the above-mentioned matching is stored. In addition, as the relationship score, the relationship score obtained by the relationship score calculation module 105 based on the detected person data related to the above-mentioned matching is stored.

In addition, when the person identification module 103 outputs the information of the stored stored person, the individual score update module 106 gives an impression of the impression score and the relationship score in the item of the corresponding person in the individual score table 107, respectively. The impression score obtained by the score calculation module 104 and the relationship score obtained by the relationship score calculation module 105 are updated. This updating can be performed, for example, by adding a new score to an already existing score with appropriate weighting.

The action generation module 108 uses the action for the person identified by the person identification module 103 using the score related to the person stored in the individual score table 107, and in this embodiment, the impression score and the relationship score. The action for the person is determined, and the action information is sent to the output unit 109.

The flowchart of FIG. 3 shows the processing procedure in the action generation module 108. First, in step ST1, person identification information is received from the person identification module 103. Next, in step ST2, the impression score is referred to from the person score table 107 through the person score update module 106. For example, in the example of the table in FIG. 2, in the case of a person indicated by the person ID of A, “impr_A” is referred to as an impression score.

Next, in step ST3, the relevance score is referred to from the per-person score table 107 through the per-person score update module 106. For example, in the example of the table in FIG. 2, in the case of a person indicated by the person ID of A, “relv_AB” and “relv_AC” are referred to as the relationship score.

Next, in step ST4, a correction score "comp_X" is calculated from the relationship score impression score and the following equation (1). In this equation, "X" indicates a person indicated by the person identification information received in step ST1, and "S" indicates a person of the other party of the relationship.
comp_X = _{S S} relv_XS * impr_S (1)

Next, in step ST5, an integrated score "comp_X" is calculated from the impression score and the correction score as shown in the following equation (2). Next, in step ST6, the action score "vX" is calculated based on the following equation (3) using the contents of the integrated score. In Equation (3), a and b are constants or coefficients that can be learned.
comb_X = {impr_X, comp_X} (2)
vX = a * impr_X + b * comp_X (3)

Next, in step ST7, the action score is used to determine an action as shown in the following equation (4), and in step ST8, information on the determined action is sent to the output unit 109.
if (vA_min <vX <vA_max)
select action_A (4)

In the formula (4), “vA_min” and “vA_max” indicate threshold values for determining the action A (action_A). A plurality of such threshold ranges are prepared, and one action is determined from the plurality of action candidates depending on which threshold range the action score falls within.

Referring back to FIG. 1, the output unit 109 executes an action based on the action information sent from the action generation module 108. The output unit 109 includes, for example, an actuator, a display, a speaker, and the like.

A specific example of the action will be described.
In the case of a robot dog : Intimacy: Toss, sip a tail, turn around, blink somewhere Aversion: In the case of a dialogue agent who moves away, backslides, threatens, threatens, barks Intimacy: Large intonation, toned, to call by name, talk Try to continue Disgust: If there is no intonation, over-respected words, clamshell tone or other medium intimacy, dare to ignore the user's attention, randomly repeat intimacy and disgust

FIG. 4 shows a schematic operation flow of the information processing apparatus 100 shown in FIG. 1 described above. First, in step ST21, the information processing apparatus 100 receives image and sound data from the input unit 101 by the human area / section detection module 102. Next, in step ST22, the information processing apparatus 100 causes the person area / section detection module 102 to detect an area where a person exists from the received image data, and uses the image data of the area as detected person data. A voice section of a person is detected from the voice data, and voice data of the section is detected person data.

Next, in step ST23, the information processing apparatus 100 causes the person identification module 103 to identify a person based on the detected person data. Further, in step ST24, the information processing apparatus 100 causes the impression score calculation module 104 to calculate an impression score based on the detected person data. Further, in step ST25, the information processing apparatus 100 causes the relationship score calculation module 105 to calculate a relationship score based on the detected person data. The processes from step ST23 to step ST25 do not have to be sequentially performed, but may be performed in parallel.

Next, in step ST26, the information processing apparatus 100 causes the per-person score update module 106 to update the per-person score table 107 based on the processing results of steps ST23 to ST25.

When the person identification module 103 outputs information that it does not match any stored person, a new person item is provided in the person score table 107 so that it can be identified by the person ID, and the new person item is identified Data, impression score, and relationship score are stored. In this case, detected person data is used as the identification data, the impression score obtained by the impression score calculation module 104 is used as the impression score, and the relationship obtained by the relationship score calculation module 105 is used as the relationship score. Sex score is used.

On the other hand, when outputting the information of the stored person who matches the person identification module 103, the impression score and the relationship score in the item of the corresponding person in the individual score table 107 are the impressions obtained by the impression score calculation module 104, respectively. The score is updated using the relationship score obtained by the relationship score calculation module 105.

Next, in step ST27, the information processing apparatus 100 causes the action generation module 108 to determine an action for the person identified by the person identification module 103. In this case, the action is determined (selected) based on the score (impression score, relevance score) related to the person stored in the per-person score table 107.

Next, in step ST28, the information processing apparatus 100 sends the information of the determined action from the action generation module 108 to the output unit 109. Thus, the output unit 109 executes an action based on the action information.

In the information processing apparatus 100 shown in FIG. 1 described above, the following basic operations (1) to (4) are performed.
(1) Execute predetermined actions with high probability for a user (person) that applies to the following.
a. A user who is identified as a "person" from an image and sound data within a predetermined period after the power is first turned on b. A user who has been determined to have identified in the past by "matching" input image data and voice data with past data

(2) The probability of a predetermined action changes based on the following input.
a. First, the time between power on and detection of the user b. Number of persons already stored in memory c. Number of times the input data is judged to be identical to the data registered first d. Scores stored for each user (impression score, relationship score, etc.)

(3) The score is calculated for each user based on the following conditions.
a. Impression score is calculated by an impression score evaluation module prepared in advance b. The relationship score is calculated by the relationship score evaluation module prepared in advance.
(4) Based on input, action execution probability is learned to maximize feedback from the user

In the information processing apparatus 100 shown in FIG. 1 described above, the following operations (1) and (2) can be considered.
(1) Decrease the score of people who have not met very often

(2) Store the number of days since registration, and control the action according to the rule according to the number of days a. Within the first number of days: Good impression b. In the case of the second or more days: forgetting c. Take a distracting action from the first day to the second day, for example, probabilistically ignore intentionally, or become more familiar with other people

Next, a use case of the information processing apparatus 100 shown in FIG. 1 will be described. First, the use case of “Interaction with User Family” will be described in a bullet.
(1) Father buys robot dog (2) Child turns on a. Start the camera and detect a person (child 1) from the image b. Activate the microphone and detect the person (child 1) from the voice c. Store the image and sound data of a person (child 1)

(3) The child 2 appears in the image at the same time ・ Because it is within the predetermined period (the same day), it memorizes the child 2 as the child 1 (4) The children 1, 2 have a high probability of interaction action Select by

(5) The next day, the mother appears a. At first, only images and sounds are stored in the database. However, the impression score that determines the action is 0. b. Next, the score goes up every time I meet a couple of times ・ I do good work such as "house chores" and improve the "impression score" c. If it exceeds a certain level, predetermined interactions (such as jumps) appear

(6) Father gets drunk and returns a. Impression score 0 at first look
b. The score changes when you meet next ・ "Impression score" decreases due to bad behavior such as "run up" c. If it exceeds a certain level, predetermined interaction (such as hatred) will occur

Next, the use case of “Interaction with User Friends” will be described in a bullet.
(1) Friend 1 of child 1 comes (friendship)
a. Enter the data that child 1 and friend 1 are together (for example, an image holding hands)
b. The module identifies A (child 1) and E (friend 1) as persons (E may be a first encounter from the robot)
c. At the same time, calculate the relationship score of this image (Because we are holding hands, so intimacy)
d. Update the relationship score of A and E in the table (in this case, update in the positive direction)

(2) Child 2's friend 2 comes (dislikes)
a. Enter data co-occurring with child 1 and friend 2 (for example, an image grabbing a chest)
b. The module identifies A (child 1) and F (friend 2) as persons (F may be a first encounter from the robot)
c. At the same time, calculate the relationship score of this image (due to grasping the chest, it's a pity)
d. Update the relationship score between A and F in the table (in this case, update in the negative direction)

(3) Determine the action from the relationship score a. Determine the action when finding the friend 1 b. First, the product of the impression score of the person related to the relationship score possessed by friend 1 is calculated c. Subsequently, the sum is calculated, and the impression score of Friend 1 is added to calculate an action score (action reference score) d. Determine the action based on the obtained action score

Next, the person identification module 103 in the information processing apparatus 100 of FIG. 1 will be described. As shown in FIG. 5, the person identification module 103 is a module for identifying a category of input data from very few data registered by category (here, person). This module uses a neural network identification module learned using deep learning.
Several methods are known as a method of learning the module which identifies a category only from very few registration data using deep learning, for example, the method of following (1), (2).
(1) Method using shamises network (2) Method using triplet network These methods can be classified as techniques for learning distance measures in feature space called metric learning or metric embedding learnings There are many.

In either case, as shown in FIG. 6, in learning, a function (identification module) necessary for identification of a category is acquired by a learning device from a large amount of labeled data in advance, and this function is used at runtime. Determine the category of input data. These can realize high discrimination performance in face identification, speaker identification and the like. If the module generated by these methods is used for person identification, the person identification module in the present technology can be realized.

On the other hand, in recent years, one-shot learning and fu-shot learning have attracted attention as techniques for realizing high-performance learning from only a small number of data.

In the following, one-shot learning and fu-shot learning are used in the following sense.
・ One-shot learning:
How to learn when there is only one data for each category, and how to use the learning module-Fushot learning How to learn when the data for each category is small and how to use the learning module

When the purpose of the fu shot learning is limited to the task of “identification”, it is very similar to the task that the Shammys net and the triplet net are trying to solve. That is, the task is to estimate the category of the input data using the minority data registered for each category. Although various methods have been proposed for fu shot learning, a prototype net is a typical example.

Shamies net, triplet net and prototype net are all configured to learn feature space for mapping multiple data to feature space which keeps given distance information (Metric Learnings) ing. Each network consists of the following two parts.
(1) Mapping data to feature space suitable for identification (2) Designing network configuration and objective function with mathematical structure suitable for identification

A neural network that maps data to feature space is called an embed function. FIG. 7 shows an embedded network, and FIG. 8 shows a prototype network.

In triplet nets and prototype nets, the parameters of neural nets have a common "plurality" embedding function. Different data are input to the embed function to generate feature vectors resulting from them. Then, the “multiple” feature vectors are combined to form a loss, that is, an objective function.

In the case of a triplet net, three data are combined and input as input data. The three data are xa, xp and xq below.
Xa: anchor data, pick up other data xp: positive data, data of the same category as the anchor x n: negative data, data of the category different from the anchor

Also, in the case of prototype net, input data is input by combining more data. These input data can be divided into the following two types: xs and Xq.
Xs: Support data (data representative of each category)
-Xq: Query data (data in one of the support data categories)

The function of triplet loss (see Equation (5)) and the loss function of a one-shot learning prototype net (see Equation (6)) are shown below.
(1) Triplet loss function

Variable definition x_a: Input data (a is an abbreviation of anchor)
x_p: Registration data (p stands for positive: registration data in the same category as the input data)
x_n: Registered data (n is an abbreviation for negative: registered data in a category different from input data)
α: margin parameter θ: generic name of neural net parameters f_θ: function represented by neural net l: objective function (“l” is an abbreviation for loss, which is an objective function for one of input data, optimization is a large number of input data Use an objective function that combines the objective functions of

(2) Loss function of prototype net

Variable definition x_q: input data (q stands for query)
y_q: Category of input data x_s: Registered data of category s (s stands for support)

If it is limited to one shot, that is, one data per category, loss of prototype net is similar to triplet loss. For example, when the support data of the prototype net is only two categories, the loss function of the prototype net is expressed by the following equation (7).

The query data of the prototype net is regarded as the anchor of the triplet net, and one of the support data of the prototype net is regarded as the positive of the triplet, and the rest is regarded as the negative. The prototype net can be understood as extending the concept of anchor net, positive and negative of triplet net to the concept of query and support.

Furthermore, in the present technology, the distance function that constitutes the loss function of the prototype net is changed to normalize the support feature as shown in the following equation (8) to improve the identification performance. FIG. 9 shows the network with the support feature normalized.

FIGS. 10 and 11 show comparison results of data sets called “Omnigglot” and “miniImagenet” known as benchmarks for fu shot learning. From this result, it can be seen that the configuration that normalizes the support feature has better discrimination performance.

This change is a change of the original distance function Euclidean distance to a function similar to a cosine (Cosine) similarity. Here, the Euclidean distance is expressed by the following expression (9), the cosine similarity is expressed by the following expression (10), and the support feature normalization is expressed by the following expression (11). The support normalization concept is similar to "Weight Normalization".

"Weight Normalization" is a method of normalizing the weighting factor to the same unit. Here, in the loss function created from the query and support feature vectors, the support feature vector f_θ (x_s) is regarded as the weight of the neural network. When this weight is normalized according to "Weight Normalization", the normalized weight is as shown in Equation (12) below.

Then, the loss function is equivalent to the case of adding the “Weight Normalization” unbiased linear layer as shown in the following Equation (13). The reason why there is no bias is that in the fu shot learning, a term that depends only on the class at the time of learning is unnecessary at the time of the test.

In the method using support normalization, features representative of categories are not collected at one point of the feature space like Euclidean prototype but are collected on a straight line passing through the origin as a representative axis. Therefore, the degree of freedom for the parameter to be learned is increased by one, and high performance can be realized more easily than the Euclidean distance method. In addition, evaluation of similarity is not one-dimensional as in Euclidean prototypes, and calculation errors are less likely to occur. Also, higher discrimination performance can be realized than proto-type and type by cosine distance. In addition to the view that the feature representing the category is represented by the representative axis, the present method can also be regarded as represented by a single point constrained on the same hypersphere.

In one-shot learning and learning by fu-shot learning, there is an aspect in which the performance is improved by using, as prior knowledge, that the input data is any category of registered data. However, in face identification and speaker identification, there is a case where the actual task does not belong to any category of the registration data of the input data. And, in the currently known fu shot learning, such a case is not dealt with. Therefore, in the following, a method of determining the case where it does not belong to any of the registered data in the case of the fu shot learning will be described.

In a known method of fu shot learning, consider a method of extending to include cases where input data does not belong to the category of registration data. Determination of belonging to or not belonging to any of the categories of registered data is called known unknown determination. Here, consider the following two examples.
(1) Add margins to the loss function of the prototype network following the triplet loss.
(2) Add a known / unknown threshold to the loss function of the prototype network.

In the present technology, it is a starting point that the loss function can be expressed as shown in the following Equation (14) using the similarity (or distance function) included in the loss function of the prototype network.

In this equation, Sq is represented by the following equation (15) if it is a prototype network, and it is represented by the following equation (16) if it is a support feature standardization method.

First, consider how to add a margin to the loss function. As mentioned above, triplet networks and prototype networks are very similar. Here, in the prototype network, a margin β is introduced with respect to the similarity as shown in the following equation (17).

At this time, as described above, when the support data prototype net only 2 categories, loss function, the following equations (18) and then, the function, approximated by asymptotic function in infinity positive and negative s_ _s Then, the loss function is expressed by the following equation (19), and when the similarity is the sign inversion of Euclidean distance, a result equivalent to the triplet loss is obtained. The actual known unknown determination is performed based on whether the degree of similarity determined without including the margin is equal to or greater than a threshold. The threshold is determined using data used for learning.

Another way is to write the loss function assuming that there is a category called unknown class in addition to existing classes. In this case, a loss function is represented as shown in the following equation (20). The unknown class discrimination parameter u may be obtained by learning. FIG. 12 shows a network to which known / unknown determination is added.

In this case, the multiclass classification is extended to automatically estimate the unknown class. The actual known unknown evaluation evaluates the value of the similarity, and determines that the largest is known if the similarity of the existing category is the other, otherwise unknown.

Hereinafter, features of the person identification module 103 according to the present technology will be summarized.
(1) Using an identification module learned using the configuration of a prototype network of fu shot learning.
(2) The similarity in calculating the loss function is different from the Euclidean distance of the prototype network, and uses a component projected on a normalized support feature.
(3) Expand the prototype net to determine known or unknown, and add a margin or unknown class discrimination parameter.

As described above, in the information processing apparatus 100 shown in FIG. 1, the score for action generation for each person is obtained based on the data of the person detected from the input data of the image and / or the sound. Therefore, it is possible to better simulate human behavior without performing user registration.

The network configuration of FIG. 9 may be used for supervised learning and semi-supervised learning. In this case, labeled data of learning data is used as support data in both supervised learning and semi-supervised learning. This is because supervised learning and semi-supervised learning are different from fu shot learning, and do not consider new categories when performing inference in a learned network.
The loss function at the time of learning uses the following equation (21) when the input data is labeled data.

Here, y is a label converted to a one-hot vector expression, and is a different expression of the equation equivalent to equation (14). When input data is unlabeled data, the following equation (22) is used so that calculation can be performed without a label.

Furthermore, semi-supervised learning may be used in the meta-learning process in fu shot learning. Use equation (22) above for unlabeled data. Furthermore, since the unlabeled data does not know whether it is a known category or an unknown category, it combines the unknown category correspondence of Equation (18), (19) or Equation (20).

<2. Modified example>
In the above-mentioned embodiment, although the example which uses an impression score and a relation score was shown, it is not limited to these scores, It is also considered to use other scores. For example, it is possible to calculate the frequency of encountering the user (person) (number of encounters) and the number of times the two persons are together as a score.

Furthermore, the present technology can also be configured as follows.
(1) A processing unit that obtains a score for generating an action for each person based on data of a person detected from input data of an image or a voice.
(2) The information processing apparatus according to (1), wherein the score includes an impression score obtained by evaluating the quality of the person's behavior.
(3) The information processing apparatus according to (1) or (2), wherein the score includes a relationship score obtained by evaluating a personal relationship.
(4) The processing unit
Collating the data of the detected person with the data of a predetermined number of stored persons;
If there is no match, the data of the detected person and the score obtained based on the data of the person are stored as data of a new person,
If a match is found, the stored score of the corresponding person is updated using the score obtained based on the data of the detected person described in any one of (1) to (3) above. Information processing equipment.
(5) The processing unit uses the identification module learned by using the neural network that realizes the above matching by calculating the similarity between representative features of the labeled teacher data and the features of the input data. The information processing apparatus according to any one of (1) to (4).
(6) The identification module
Convert the detected input data into a feature amount;
Calculating a component of the feature amount of the predetermined number of stored label data with respect to the converted feature amount;
Based on the calculated component, the score of the predetermined number of stored classes for the detected input data is detected, and based on the detected score, the information of the matching class or any of the classes is matched. The information processing apparatus according to (5), which outputs information indicating that the information processing is not performed.
(7) The components of the above feature amount are
It is calculated by converting the feature quantities of the stored labeled data into unit vectors, and taking the inner product of the feature quantities of the detected class data and the converted unit vectors. Information processing equipment.
(8) The processing unit
The information processing apparatus according to any one of (1) to (7), further generating an action corresponding to each person using the score for each person.
(9) The score includes an impression score obtained by evaluating the person's behavior and a relationship score obtained by evaluating the person relationship,
The information processing apparatus according to (8), wherein the processing unit generates the action based on an integrated score calculated based on the impression score and the relationship score.
(10) An information processing method, comprising: a processing step of obtaining a score for action generation for each person based on data of a person detected from input data of an image or a sound.
(11) computer,
A program that functions as processing means for obtaining a score for action generation for each person based on data of a person detected from input data of an image or a sound.
(12) A robot comprising: a processing unit that obtains a score for generating an action for each person based on data of a person detected from input data of an image or a sound.

100 ... information processing apparatus 101 ... input unit 102 ... human area / section detection module 103 ... person identification module 104 ... impression score calculation module 105 ... relationship score calculation module 106 · · · -Person score update module 107 ... Person score table 108 ... Action generation module 109 ... Output part

Claims

An information processing apparatus comprising: a processing unit that obtains a score for action generation for each person based on data of a person detected from input data of an image or a sound.
The information processing apparatus according to claim 1, wherein the score includes an impression score obtained by evaluating a person's behavior.
The information processing apparatus according to claim 1, wherein the score includes a relationship score obtained by evaluating a personal relationship.
The above processing unit
Collating the data of the detected person with the data of a predetermined number of stored persons;
If there is no match, the data of the detected person and the score obtained based on the data of the person are stored as data of a new person,
The information processing apparatus according to claim 1, wherein when there is a match, the stored score of the corresponding person is updated based on the score obtained based on the data of the detected person.
The processing unit performs the matching using an identification module learned using a neural network, which is realized by calculating a similarity between representative features of labeled teacher data and features of input data. The information processing apparatus according to Item 4.
The above identification module
Convert the detected input data into a feature amount;
Calculating a component of the feature amount of the predetermined number of stored label data with respect to the converted feature amount;
Based on the calculated component, the score of the stored predetermined class data with respect to the detected input data is detected, and based on the detected score, information of a matching class or any class The information processing apparatus according to claim 5, which outputs information indicating that they do not match.
The components of the above feature amount are
The feature quantity of the stored labeled data is converted into a unit vector, and calculated by taking the inner product of the feature quantity of the detected class data and the converted unit vector. Information processing device.
The above processing unit
The information processing apparatus according to claim 1, further generating an action corresponding to each person using the score for each person.
The above-mentioned score includes an impression score obtained by evaluating the quality of the person's behavior and a relationship score obtained by evaluating the person relationship,
The information processing apparatus according to claim 8, wherein the processing unit generates the action based on an integrated score calculated based on the impression score and the relationship score.
An information processing method comprising: a processing step of obtaining a score for action generation for each person based on data of a person detected from input data of an image or a sound.
Computer,
A program that functions as processing means for obtaining a score for action generation for each person based on data of a person detected from input data of an image or a sound.
A robot comprising: a processing unit that obtains a score for action generation for each person based on data of a person detected from input data of an image or a sound.