WO2020100532A1

WO2020100532A1 - Information processing device, information processing method, and information processing program

Info

Publication number: WO2020100532A1
Application number: PCT/JP2019/041218
Authority: WO
Inventors: アンドリューシン
Original assignee: ソニー株式会社
Priority date: 2018-11-14
Filing date: 2019-10-18
Publication date: 2020-05-22

Abstract

An information processing device according to the present disclosure comprises: an acquisition unit (151) for acquiring an image, a question pertaining to the image, and a correct answer corresponding to the question; and a recording unit (154) for recording a combination of the image, question, and correct answer acquired by the acquisition unit (151) as support information used to determine a response to query information including one image and one question pertaining to the one image.

Description

Information processing apparatus, information processing method, and information processing program

The present disclosure relates to an information processing device, an information processing method, and an information processing program.

Information processing using machine learning is used in various technical fields, and with the development of deep learning, household agents and robots are able to learn and identify many types of objects. For example, systems have been implemented that answer questions about images.

According to the conventional technology, the question about the image is answered using the information about the image and the question.

However, the conventional technology is not always able to make an appropriate response to a question about an image. For example, in the prior art, the number of identifiable answers (classes) is limited, and other than that answer cannot be identified. As described above, in the related art, it is difficult to appropriately make a response to a question regarding an image unless it is within the range of a correct answer prepared in advance.

Therefore, the present disclosure proposes an information processing device, an information processing method, and an information processing program capable of appropriately responding to a question regarding an image.

In order to solve the above problems, an information processing device according to an aspect of the present disclosure is an image, a question related to the image, and an acquisition unit that acquires a correct answer corresponding to the question, and is acquired by the acquisition unit. And a registration unit that registers the combination of the image, the question, and the correct answer as support information used to determine a response to query information including one image and one question related to the one image. ..

It is a figure showing an example of information processing concerning a 1st embodiment of this indication. It is a figure which shows another example of the information processing which concerns on the 1st Embodiment of this indication. It is a figure which shows the structural example of the robot apparatus which concerns on the 1st Embodiment of this indication. FIG. 3 is a diagram illustrating an example of a support information storage unit according to the first embodiment of the present disclosure. It is a figure showing an example of a model information storage part concerning a 1st embodiment of this indication. It is a figure showing an example of a mode information storage part concerning a 1st embodiment of this indication. It is a figure which shows an example of the identification which concerns on this indication. It is a figure which shows an example of the network structure for learning which concerns on this indication. FIG. 20 is a block diagram showing an example of a configuration from input to output according to the present disclosure. 3 is a flowchart showing a procedure of information processing according to the first embodiment of the present disclosure. 5 is a flowchart showing a procedure of correct answer registration processing by a dialog with a user according to the first embodiment of the present disclosure. It is a figure showing an example of information processing concerning a 2nd embodiment of this indication. It is a figure which shows the structural example of the robot apparatus which concerns on the 2nd Embodiment of this indication. It is a figure showing an example of a mode information storage part concerning a 2nd embodiment of this indication. 9 is a flowchart illustrating a procedure of correct answer registration processing by a dialog with a user according to the second embodiment of the present disclosure. It is a figure which shows the structural example of the information processing system which concerns on the modification of this indication. It is a figure showing an example of composition of an information processor concerning a modification of this indication. It is a figure which shows an example of the adjustment process of the camera viewpoint of this indication. FIG. 16 is a block diagram showing an example of a configuration from input to output relating to adjustment of a camera viewpoint of the present disclosure. 7 is a flowchart illustrating a procedure of a camera viewpoint adjustment process of the present disclosure. It is a hardware block diagram which shows an example of a computer which implement | achieves the function of a robot apparatus and an information processing apparatus.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings. The information processing apparatus, the information processing method, and the information processing program according to the present application are not limited to this embodiment. Further, in each of the following embodiments, the same parts are designated by the same reference numerals, and a duplicate description will be omitted.

The present disclosure will be described in the following item order.
1. First Embodiment 1-1. Outline of information processing according to first embodiment of the present disclosure 1-2. Configuration of robot device according to first embodiment 1-3. Information processing procedure according to the first embodiment 2. Second embodiment 2-1. Overview of information processing according to second embodiment of the present disclosure 2-2. Configuration of robot apparatus according to second embodiment 2-3. Information processing procedure according to the second embodiment 3. Other Embodiments 3-1. Other configuration examples 3-2. Adjustment of camera viewpoint 4. Hardware configuration

(1. First embodiment)
[1-1. Overview of information processing according to the first embodiment of the present disclosure]
FIG. 1 is a diagram illustrating an example of information processing according to the first embodiment of the present disclosure. The information processing according to the first embodiment of the present disclosure is realized by the robot device 100 shown in FIG.

The robot device 100 is an information processing device that executes information processing according to the first embodiment. The robot apparatus 100 is an information processing apparatus that registers an image, a question associated with the image, and a combination of correct answers corresponding to the question in response to a dialog with a user. In the first embodiment, the robot apparatus 100 detects an image with a camera (corresponding to the sensor unit 16 in FIG. 3) and a user's utterance (question) with a microphone (corresponding to the input unit 12 in FIG. 3). .. Then, the robot apparatus 100 outputs the detected image and the response corresponding to the user's question by the speaker (corresponding to the output unit 13 in FIG. 3). The robot device 100 may be any device as long as it can realize the processing in the first embodiment. For example, the robot device 100 interacts with a human (user) such as an entertainment robot or a household robot. It may be a robot that does.

FIG. 1 shows a case where the robot device 100 as an agent registers a combination of an image, a question, and a correct answer (hereinafter also referred to as “support information”) through a dialogue with the user U1. First, the robot apparatus 100 acquires an image (step S11). In the example of FIG. 1, the robot apparatus 100 detects the image IM1 by capturing images of two ice creams (also simply referred to as “ices”) with a camera. For example, an image is input to the robot apparatus 100 through a camera attached to the robot apparatus 100. The image (visual information) acquired through the camera may have various forms, but for example, the camera may detect RGB information as image information. The robot device 100 may acquire the image IM1 detected by the external device from the external device.

Then, the user U1 designates the mode of the robot apparatus 100 as the question registration mode in order to register the question in the robot apparatus 100 (step S12). The user U1 makes an input to the robot apparatus 100, which indicates that the question registration mode is designated. In the example of FIG. 1, the user U1 inputs the command MD1 that specifies the question registration mode registered in the robot device 100 in advance. Specifically, the user U1 inputs the question registration mode to the robot apparatus 100 by speaking “question” indicating that a question is to be asked. The command MD1 is not limited to “question”, but may be set appropriately such as “attention”, “listen”, “hey XXX (robot name)”. A microphone provided in the robot device 100 receives an input by detecting a user's utterance. As a result, the robot apparatus 100 receives an input designating the question registration mode.

For example, the robot apparatus 100 converts voice information into text (character information) using various voice recognition techniques. The robot device 100 may be able to acquire information from a voice recognition server that provides a voice recognition service. In this case, the robot device 100 may acquire the character information obtained by converting the voice information from the voice recognition server by transmitting the voice information to the voice recognition server. In the example of FIG. 1, the robot device 100 is assumed to have a voice recognition function, and the robot device 100 recognizes the user's utterance by appropriately using various conventional techniques and identifies the user who uttered the utterance. The description will be appropriately omitted as the estimation.

For example, the robot apparatus 100 may store association information in which the storage unit 120 (see FIG. 3) associates the mode ID with the character information (keyword) for shifting to the mode corresponding to the mode ID. .. Then, the robot apparatus 100 may compare the character string corresponding to the utterance of the user with the character information of the association information, and shift to the mode of the corresponding character information. Note that the mode designation is not limited to voice, and various modes may be used. For example, the operation may be performed by the user operating a button (which may be on hardware or software) provided in the robot apparatus 100 itself to shift to the question registration mode.

Then, the robot apparatus 100 changes the mode according to the user's input (step S13). In the example of FIG. 1, the robot apparatus 100 changes the mode to the question registration mode (see FIG. 6) corresponding to the command “MD1” of the user U1, “question”.

Then, the user U1 inputs a question to the robot device 100 (step S14). In the example of FIG. 1, the user U1 inputs the question QS1 “What color is the right?” To the robot apparatus 100 by speaking “What color is the right?”. Note that the robot device 100 may acquire the image IM1 and the question QS1, and at any timing, may acquire the image IM1 and the question QS1 as long as it can respond. For example, the robot apparatus 100 may acquire an image after inputting a question. For example, the robot apparatus 100 may perform step S11 after step S14.

A microphone provided in the robot device 100 receives an input by detecting a user's utterance. The robot apparatus 100 uses various techniques as appropriate to determine whether the question registration is completed. For example, the robot apparatus 100 enters the question registration mode, and after the voice question input from the user U1 starts, if no voice is input at an interval equal to or more than a certain threshold, the robot apparatus 100 ends the question registration mode. Then, the robot apparatus 100 converts the voice question input in the question registration mode into character information, and natural language processing is performed based on the character information. For example, the robot apparatus 100 may determine (estimate) the meaning or content of the question by analyzing the character information by appropriately using a natural language processing technique such as morphological analysis.

By the above processing, the robot apparatus 100 accepts the question QS1 "What color is right?". As described above, the robot apparatus 100 is not limited to a question (closed question) that is answered by “yes”, “no”, or “A, B, or C”, and the like. We accept various types of questions such as open questions. The color of the ice cream package on the right side in the image IM1 shown in FIG. 1 is light purple.

Then, the robot device 100 responds to the question QS1 using various techniques. The robot apparatus 100 identifies the correct answer based on the input image IM1 and the input question QS1. The robot apparatus 100 uses the input image IM1 and the question QS1 as query information (hereinafter also simply referred to as “query”) to identify the correct answer corresponding to the query. Then, the robot device 100 outputs the result to the user via a speaker, a monitor, or the like. In the example of FIG. 1, the robot apparatus 100 performs the identification processing as shown in FIG. 8 using the determined support information. In this case, the robot apparatus 100 uses the correct answer of the support information determined by the identification process as a response.

The robot apparatus 100 outputs the correct answer of the support information determined by the identification processing as a correct answer candidate (step S15). Here, in the example of FIG. 1, it is assumed that the robot device 100 does not have the support information whose correct answer is “light purple” registered. Therefore, the robot apparatus 100 outputs “red”, which is different from the correct answer “light purple”, as the correct answer candidate AA1.

Note that the robot apparatus 100 may respond to the question QS1 of the user U1 by appropriately using any technique as long as it can respond to the question QS1 of the user U1. For example, the robot apparatus 100 may recognize an object and make a response using a technique related to object recognition.

Since the robot device 100 cannot identify (determine) whether or not to register the responded correct answer candidate as the correct answer, the user U1 provides the robot device 100 with information regarding the correct answer. The user U1 inputs, into the robot apparatus 100, a response to the correct answer that the robot apparatus 100 has responded to (step S16). The user U1 inputs a negative reaction to the robot device 100 because the correct candidate AA1 “red” output by the robot device 100 does not correspond to the color of the ice cream package on the right side of the image IM1.

In the example of FIG. 1, the user U1 inputs a negative command NG1 that denies a correct answer candidate registered in advance in the robot apparatus 100. Specifically, the user U1 inputs "correct answer registration mode" to the robot apparatus 100 by speaking "no" indicating that the correct answer candidate is incorrect. It should be noted that the negative command NG1 is not limited to “different”, but may be set appropriately such as “incorrect answer” or “wrong”. A microphone provided in the robot device 100 receives an input by detecting a user's utterance. Thereby, the robot apparatus 100 accepts a negative reaction of the user U1 as an input for designating the correct answer registration mode. Note that the robot apparatus 100 may shift to the correct answer registration mode by a user operating a button provided in the robot apparatus 100 itself to shift to the correct answer registration mode. In addition, the robot device 100 may request the user U1 for a response when the robot device 100 cannot determine whether the reaction of the user U1 is positive or negative. In this case, the robot device 100 may output a voice that prompts the user U1 to select affirmative or negative, such as "Is the answer correct? Then, the robot apparatus 100 determines that the user U1 is negative when the user U1 responds with “No” or “No”, and affirmative when the user U1 responds with “Yes” or “Good”. May be determined. The robot apparatus 100 compares the negative response information, which is the negative response list stored in the storage unit 120 (see FIG. 3), and the positive response list information, which is the positive response list, with the user response. Therefore, the above determination may be performed.

Then, the robot apparatus 100 changes the mode according to the user's input (step S17). In the example of FIG. 1, the robot apparatus 100 changes the mode to the correct answer registration mode (see FIG. 6) corresponding to the negative command NG1 of “NO” of the user U1. In this way, when the negative command NG1 is input, the robot apparatus 100 enters the correct answer registration mode, and waits until the correct answer input from the user U1 is received.

Then, the user U1 provides the correct answer to the robot device 100 (step S18). In the example of FIG. 1, the user U1 inputs the correct answer AS1 “light purple” to the question QS1 “what color is the right?” By speaking “light purple” to the robot apparatus 100. At the time of providing such a correct answer, as in the case of the question registration, if there is no further input in the interval equal to or more than the threshold after the user's utterance starts, the robot apparatus 100 regards the input up to that point as the “correct answer”. Accept.

Then, the robot apparatus 100 registers, as support information, a combination of an image, a question related to the image, and a correct answer corresponding to the question (step S19). For example, the robot apparatus 100 shifts to the support information registration mode, and registers a combination of an image, a question related to the image, and a correct answer corresponding to the question as the support information. In the example of FIG. 1, the robot apparatus 100 uses the combination of the image IM1, the question QS1, and the correct answer AS1 as shown in the additional registration information RINF1 as the support information (support information SP1) identified by the support information ID “SP1”. register. In this way, the robot apparatus 100 registers the three elements of the input image, the input question, and the correct answer as one set. For example, the robot apparatus 100 stores the support information SP1 in the support information storage unit 141 (see FIG. 4).

As described above, the robot apparatus 100 registers the correct answer AS1 provided by the user U1 as the correct answer corresponding to the image IM1 and the question QS1 instead of the correct answer candidate AA1 output by the robot apparatus 100 itself. In the example of FIG. 1, at the time when the correct answer candidate AA1 is output for the image IM1 and the question QS1, the robot apparatus 100 does not have the support information including the correct answer “light purple”, which is the correct answer AS1, registered therein. .. That is, the robot apparatus 100 did not acquire the concept of “light purple” that is a concept related to color at the time of outputting the correct answer candidate AA1 to the image IM1 and the question QS1. Therefore, the robot apparatus 100 could not respond to the image IM1 and the question QS1 with the correct answer of “light purple”.

It is impossible to solve such a situation by processing that specifies an identifiable answer (class). For example, such a situation may occur in VQA (Visual Question Answering) in which the feature amount of an image and the feature amount of a question are projected on a common space and the correct answer is identified based on the feature amount. That is, in VQA, since the number of identifiable answers is limited (generally within the range of 1000 to 1500), the answers that are not included therein cannot be identified.

On the other hand, the robot apparatus 100 can make a response of "light purple" to the input image and question by using the newly added support information SP1 in the response after step S19 in FIG. Becomes As described above, the robot apparatus 100 has an advantage in that, when compared with VQA, not only the answers of 1000 to 1500 or the like, which have a high frequency in the data set, but also new answers can be learned. ..

More specifically, after step S19 in FIG. 1, the robot apparatus 100 can output the correct correct answer candidate “light purple” for the image IM1 and the question QS1. That is, the robot apparatus 100 acquires the concept of “light purple”, which is a concept related to color, by additionally registering the support information SP1. In other words, the robot apparatus 100 can acquire the concept regarding the property of the object called ice cream on the right side of the image IM1. In this way, the robot apparatus 100 can continuously acquire new concepts in the dialog with the user, and thus can acquire an unknown concept at the time of providing the robot apparatus 100, for example, a concept corresponding to a new word. Therefore, the robotic device 100 can enable an appropriate response to a question regarding an image.

Also, like the robot device 100, there is a learning model such as one-shot learning that enables class identification from a small number of samples. However, in the one-shot learning, the identification is based mainly on the visual element, that is, only the image, and like the robot device 100, the learning of a new concept, particularly a concept such as the property of an object, is appropriately performed. Is difficult. On the other hand, the robot apparatus 100 can acquire any concept corresponding to the image by using the combination of the image, the question related to the image, and the correct answer corresponding to the question as the support information. May allow appropriate responses to questions regarding. As described above, the robot device 100 is advantageous in that, when compared with one-shot learning, not only the visual similarity but also learning of different concepts can be performed for the same object or the same image depending on the context given in the question. There is a nature.

More specifically, in the one-shot learning, learning is performed so that an image (identifiable candidate image) having the same label as the query image (identified image) can be identified. On the other hand, the robot apparatus 100 can identify the image and the question having the same “answer” by adding the question to the image and considering not only the visual similarity but also the context given from the question. .. As a result, the robot apparatus 100 can acquire attributes including various properties and relationships such as color, size, number, and purpose depending on the context from the question even if the images of the same object (target object). it can. In this way, the robot apparatus 100 can acquire attributes including the property of the target object and the relationship between the target object and another target object. That is, the robot apparatus 100 can acquire various concepts including the attribute of the target object in addition to the name of the target object itself. For example, the robot apparatus 100 can acquire a concept regarding a property or a state of an object included in an image. Thus, the robot apparatus 100 can register support information corresponding to concepts such as various attributes from the same image or the same object by repeating the question and answer. As a result, the robot apparatus 100 can acquire knowledge corresponding to the concept of support information.

Further, since the user can specify the correct answer in the robot apparatus 100, it is possible to acquire the concept of the user's impression of the object included in the image. In addition, the robot apparatus 100 can acquire a concept regarding the amount, color, temperature, hardness, etc. of the object included in the image. The robot apparatus 100 can also acquire a concept (relative concept) regarding the relationship between the target object included in the image and another target object. For example, when a concept based on a relationship with another object such as “large” or “small” is used as a correct answer, the robot apparatus 100 can also acquire a concept to be related. In this way, the robot apparatus 100 can acquire as many concepts as the number of questions even for images of the same object (object). In addition, the robot apparatus 100 can learn a new concept regarding an image through a context obtained from a question and answer session. That is, the robot apparatus 100 can learn a new concept through images and questions.

Also, it is becoming possible to learn a new object from a small number of samples through the above one-shot learning. However, as described above, an object (object) in the real world has not only a name (label) but also various attributes. For example, the target object has various attributes such as properties and relationships with other objects. Furthermore, the attributes of an object are often affected by the situation or context. For example, any object (object) has an attribute of size, and the size is a relative concept. For example, it can be said that such a relative concept is a concept based on a relationship with another object. Therefore, it was difficult for the existing model except learning for identifying one object. In addition, when a user directly causes an agent or robot to learn a new object, it is assumed that the agent or robot recognizes a sample of an object corresponding to the object through a camera or a file. At that time, the user may input the information of the object verbally or by characters, but in most cases, the information is limited to a simple label or name of the object.

On the other hand, the robot apparatus 100 can acquire not only images but also various concepts that differ depending on the context for one object through a question and answer session with the user, and thus can appropriately respond to a question about the image. Can be The robot apparatus 100 can appropriately add information for learning a new concept to an entity (for example, a computer) other than a human who performs information processing. For example, the robot apparatus 100 can improve the practicality by applying the one-shot learning methodology to the setting of VQA so that a wider range of concepts can be learned.

Further, the robot apparatus 100 may register the correct answer candidate as the correct answer even when the correct answer candidate responded by the robot apparatus 100 itself is correct. This point will be described with reference to FIG. FIG. 2 is a diagram showing another example of information processing according to the first embodiment of the present disclosure. Note that steps S21 to S24 in FIG. 2 are the same as steps S11 to S14 in FIG. 1, so description thereof will be omitted.

In the example of FIG. 2, the robot apparatus 100 that has received the question QS1 "what color is right?" Responds to the question QS1 using various techniques. The robot apparatus 100 uses the input image IM1 and the question QS1 as a query and identifies the correct answer corresponding to the query. The robot apparatus 100 outputs the correct answer of the support information determined by the identification processing as a correct answer candidate (step S25). Here, in the example of FIG. 2, it is assumed that the robot apparatus 100 has already registered support information whose correct answer is “light purple”. Therefore, the robot apparatus 100 determines the support information by the identification processing as shown in FIG. 8 and uses the support information whose correct answer is “light purple” for the response to the question QS1. As a result, the robot apparatus 100 outputs the correct answer “light purple” as the correct answer candidate AA2.

Then, the user U1 inputs, into the robot apparatus 100, a reaction to the correct answer that the robot apparatus 100 has responded to (step S26). The user U1 inputs a positive reaction to the robot device 100 because the correct answer candidate AA2 “light purple” output by the robot device 100 corresponds to the color of the ice cream package on the right side in the image IM1.

In the example of FIG. 2, the user U1 inputs the affirmative command OK1 for affirming the correct answer candidate registered in advance in the robot apparatus 100. Specifically, the user U1 inputs by inputting a correct answer candidate registration mode in response to the robot apparatus 100 by speaking “correct answer” indicating that the correct answer candidate is correct. The affirmative command OK1 is not limited to “correct answer”, but may be set appropriately such as “matched” or “yes”. For example, when the robot device 100 receives the affirmative command OK1, it shifts to the support information registration mode.

Then, the robot apparatus 100 registers, as support information, the combination of the image, the question related to the image, and the correct answer corresponding to the question (step S27). In the example of FIG. 2, the robot apparatus 100 identifies the combination of the image IM1, the question QS1, and the correct answer candidate AA2, which is the correct answer, as shown in the additional registration information RINF2, by the support information (support information ID “SP1” (support Register as information SP1). In this way, the robot apparatus 100 registers the three elements of the input image, the input question, and the correct answer as one set. For example, the robot apparatus 100 stores the support information SP1 in the support information storage unit 141 (see FIG. 4).

As described above, the robot apparatus 100 registers the correct answer candidate AA2 output by the robot apparatus 100 itself as the correct answer corresponding to the image IM1 and the question QS1. Thereby, the robot apparatus 100 can enable a more appropriate response to the question regarding the image.

[1-2. Configuration of Robot Device According to First Embodiment]
Next, the configuration of the robot apparatus 100, which is an example of an information processing apparatus that executes information processing according to the first embodiment, will be described. FIG. 3 is a diagram showing a configuration example of the robot apparatus 100 according to the first embodiment of the present disclosure.

As shown in FIG. 3, the robot apparatus 100 includes a communication unit 11, an input unit 12, an output unit 13, a storage unit 14, a control unit 15, a sensor unit 16, and a drive unit 17.

The communication unit 11 is realized by, for example, a NIC (Network Interface Card) or a communication circuit. The communication unit 11 is connected to a network N (Internet or the like) by wire or wirelessly, and transmits / receives information to / from other devices or the like via the network N.

The user inputs various operations to the input unit 12. The input unit 12 receives an input from the user. The input unit 12 receives a response to the correct candidate by the user. When the user's reaction to the correct answer candidate is negative, the input unit 12 accepts a correct answer candidate different from the correct answer candidate by the user. The input unit 12 has a function of detecting voice. For example, the input unit 12 has a microphone that detects voice. The input unit 12 receives a user's utterance as an input. The input unit 12 may receive various operations from the user via buttons or a touch panel provided on the robot apparatus 100.

The output unit 13 outputs various information. The output unit 13 has a function of outputting voice. For example, the output unit 13 has a speaker that outputs sound. The output unit 13 outputs correct answer candidates corresponding to the question. The output unit 13 outputs the question. The output unit 13 outputs a question when the user is detected by the sensor unit 16. The output unit 13 outputs a response using the support information (determined support information) determined by the determination unit 156. The output unit 13 outputs a voice requesting a correct answer from the user. For example, the output unit 13 outputs the correct answer included in the decision support information. The output unit 13 may output various operations by displaying various information on a display unit such as a display provided in the robot apparatus 100.

The storage unit 14 is realized by, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk. The storage unit 14 includes a support information storage unit 141, a model information storage unit 142, and a mode information storage unit 143.

The support information storage unit 141 stores various information regarding support. FIG. 4 is a diagram illustrating an example of the support information storage unit according to the first embodiment of the present disclosure. FIG. 4 shows an example of the support information storage unit 141 according to the first embodiment. In the example of FIG. 4, the support information storage unit 141 has items such as “support information ID”, “image”, “question”, and “correct answer”.

“Support information ID” indicates identification information for identifying support information. "Image" indicates an image registered as support information. Although FIG. 4 shows an example in which conceptual information such as “IM11” and “IM12” is stored in “image”, in reality, image information, moving image information, or a file path name indicating the storage location thereof. Is stored. “Question” indicates a question registered as support information. Although FIG. 4 shows an example in which conceptual information such as “QS11” or “QS12” is stored in the “question”, in reality, character information or voice information indicating the question, or a file indicating the storage location thereof. Path name etc. are stored. “Correct answer” indicates a correct answer registered as support information. Although FIG. 4 shows an example in which conceptual information such as “AS11” or “AS12” is stored in “correct answer”, in reality, character information or voice information indicating the correct answer, or a file indicating the storage location. Path name etc. are stored.

In the example of FIG. 4, the support information (support information SP11) identified by the support information ID “SP11” is a combination of the image IM11, the question QS11, and the correct answer AS11. That is, it is indicated that the support information SP11 includes information on the image IM11, information on the question QS11, and information on the correct answer AS11.

The model information storage unit 142 stores information about the model. For example, the model information storage unit 142 stores the model information (model data) learned (generated) by the learning process. FIG. 5 is a diagram illustrating an example of the model information storage unit according to the first embodiment of the present disclosure. FIG. 5 shows an example of the model information storage unit 142 according to the first embodiment. In the example shown in FIG. 5, the model information storage unit 142 includes items such as “model ID” and “model data”.

“Model ID” indicates identification information for identifying the model. For example, the model identified by the model ID “M1” corresponds to the model M1 that identifies (determines) the support information corresponding to the query as illustrated in FIG. “Model data” indicates model data. Although FIG. 5 shows an example in which conceptual information such as “MDT1” is stored in “model data”, in reality, various information that configures the model, such as information about networks included in the model and functions. included.

The mode information storage unit 143 stores information about the mode of the robot device 100. FIG. 6 is a diagram illustrating an example of the mode information storage unit according to the first embodiment of the present disclosure. FIG. 6 shows an example of the mode information storage unit 143 according to the first embodiment. As shown in FIG. 6, the mode information storage unit 143 has items such as “mode ID”, “mode”, and “flag”.

“Mode ID” indicates information for identifying the mode. “Mode” indicates the content of the mode identified by the mode ID. The "flag" is a flag indicating a mode selected from the settable modes, that is, a mode in the current state. In FIG. 6, it is assumed that the operation mode in which the value of “flag” is “1” is selected. That is, in FIG. 6, the mode “normal” identified by the mode ID “MO1” is selected, and the mode of the robot apparatus 100 in the current state is the normal mode. In the example of FIG. 6, the mode identified by the mode ID “MO2” (mode MO2) is the question registration mode. Further, the mode MO2 has a flag of "0", which indicates that the mode is not the selected mode.

Return to Figure 3 and continue the explanation. In the control unit 15, for example, a program (for example, an information processing program according to the present disclosure) stored in the robot apparatus 100 by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like is a RAM (Random Access Memory). It is realized by executing the above as a work area. The control unit 15 is a controller and may be realized by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

As illustrated in FIG. 3, the control unit 15 includes an acquisition unit 151, a determination unit 152, a generation unit 153, a registration unit 154, a learning unit 155, and a determination unit 156, and information described below. Realize or execute processing functions and actions. The internal configuration of the control unit 15 is not limited to the configuration shown in FIG. 3, and may be another configuration as long as it is a configuration for performing information processing described later.

The acquisition unit 151 acquires various information. The acquisition unit 151 acquires various types of information from an external information processing device. The acquisition unit 151 acquires various information from the storage unit 14. The acquisition unit 151 acquires the input information received by the input unit 12. The acquisition unit 151 acquires the sensor information detected by the sensor unit 16. The acquisition unit 151 acquires an image, a question related to the image, and a correct answer corresponding to the question.

The determination unit 152 makes various determinations. The determination unit 152 determines various information based on the information acquired by the acquisition unit 151. The determination unit 152 determines various information based on the information stored in the storage unit 14.

The determination unit 152 appropriately uses various techniques to determine whether the question registration by the user is completed. The determination unit 152 determines (estimates) the meaning and content of the question by analyzing the question (character information) using various techniques related to natural language processing.

The determination unit 152 determines whether the input by the user is a question. The determination unit 152 determines whether the input by the user is a question by analyzing the character information obtained by converting the voice information of the user. The determination unit 152 determines whether or not there is a response from the user. The determination unit 152 determines whether or not there is a response from the user in response to the acceptance of the input by the user. When the utterance by the user is detected by the microphone, the determination unit 152 determines that the user has reacted.

The determination unit 152 determines whether the user's reaction is positive or negative. For example, the determination unit 152 determines whether the user's reaction is affirmative or negative by analyzing the character information in which the user's reaction (voice information) is converted.

The generation unit 153 performs various generations. The generation unit 153 generates various information based on the information acquired by the acquisition unit 151. The generation unit 153 generates various information based on the information stored in the storage unit 14.

The generation unit 153 generates an episode in which the query information is the input image detected by the sensor unit 16 and the question input by the user. For example, the generation unit 153 generates an episode including query information that is a combination of the input image and the input question and the support information stored in the support information storage unit 141. For example, the generation unit 153 generates an episode including the support information group stored in the support information storage unit 141 as information for determining the response to the query information.

The registration unit 154 performs various registrations. The registration unit 154 registers various information based on the information acquired by the acquisition unit 151. The registration unit 154 registers the information acquired by the acquisition unit 151 in the storage unit 14. The registration unit 154 functions as an image registration unit that registers an image. The registration unit 154 functions as a question registration unit that registers a question. The registration unit 154 functions as a correct answer registration unit that registers a correct answer.

The registration unit 154 registers the combination of the image, the question, and the correct answer acquired by the acquisition unit 151 as support information used for determining a response to the query information including the one image and the one question related to the one image. To do. The registration unit 154 registers the three elements of the input image, the input question, and the correct answer as one set. In the example of FIG. 1, the registration unit 154 registers the combination of the image IM1, the question QS1 and the correct answer AS1 as the support information SP1. In the example of FIG. 2, the registration unit 154 registers the combination of the image IM1, the question QS1, and the correct answer candidate AA2, which is the correct answer, as the support information SP1.

The learning unit 155 performs various kinds of learning. The learning unit 155 learns various information based on the information acquired by the acquisition unit 151. The learning unit 155 learns various information based on the information stored in the storage unit 14. The learning unit 155 learns (generates) a model. The learning unit 155 learns (generates) a model based on the information acquired by the acquisition unit 151. The learning unit 155 learns (generates) a model based on the information stored in the storage unit 14.

The learning unit 155 learns the model using various machine learning technologies. For example, the learning unit 155 learns a model having a network structure as shown in FIG. The learning unit 155 learns a model that identifies support information corresponding to query information including an image and a question. For example, the learning unit 155 learns the model M1 that identifies the support information corresponding to the query information including the image and the question. The learning unit 155 may generate a model by performing a learning process using episodes as a set of learning. The learning unit 155 may generate a model by performing a learning process using each of the episodes EP1 to EP3 as shown in FIG. 7 as a learning set.

The learning unit 155 may generate a model by performing a learning process based on various learning methods. The learning unit 155 may generate a model by performing a learning process based on a method related to one-shot learning. The learning unit 155 may generate the model M1 by performing a learning process based on a method related to one-shot learning. Note that the above is an example, and the learning unit 155 may generate a model by any learning method as long as it can generate a model that identifies support information corresponding to query information including an image and a question.

The decision unit 156 makes various decisions. The determination unit 156 determines various information based on the information acquired by the acquisition unit 151. The determination unit 156 determines various information based on the information stored in the storage unit 14. The determination unit 156 may make various estimations. The determination unit 156 may estimate the surrounding space as shown in FIG.

The determining unit 156 determines one correct answer corresponding to one question of the query information based on the query information and the support information. The determining unit 156 determines one correct answer based on the one image and the one question included in the query information and the image and the one question included in the support information.

The determining unit 156 identifies the support information including the correct answer corresponding to the query information, using the episode including the query information and the support information. The determination unit 156 identifies support information including a correct answer corresponding to the query information, and determines the identified support information as correct answer support information including a correct answer corresponding to the query information. In the example of FIG. 1, the determination unit 156 uses the input image IM1 and question QS1 as query information and identifies the correct answer corresponding to the query information.

The determination unit 156 determines the mode. The determination unit 156 changes the mode based on the determined mode. The determination unit 156 compares the character string corresponding to the utterance of the user with the character information of the association information, and shifts to the corresponding character information mode. The determination unit 156 changes the mode based on the input by the user. In the example of FIG. 1, the determination unit 156 changes the mode to the question registration mode corresponding to the command MD1 "question" of the user U1. The deciding unit 156 changes the mode to the correct answer registration mode corresponding to the negative command NG1 that the user U1 says “no”. For example, when the determination unit 156 receives the affirmative command OK1, the determination unit 156 shifts to the support information registration mode.

The sensor unit 16 detects predetermined information. The sensor unit 16 has a function as an image capturing unit that captures an image. The sensor unit 16 has a function of an image sensor and detects image information. The sensor unit 16 functions as an image input unit that receives an image as an input. The sensor unit 16 is not limited to the above, and may have various sensors. The sensor unit 16 includes a position sensor, an acceleration sensor, a gyro sensor, a temperature sensor, a humidity sensor, an illuminance sensor, a pressure sensor, a proximity sensor, a sensor for acquiring biological information such as odor, sweat, heartbeat, pulse and brain wave. It may have various sensors. Further, the sensor for detecting the above various information in the sensor unit 16 may be a common sensor or may be realized by different sensors.

The drive unit 17 has a function of driving the physical configuration of the robot device 100. The drive unit 17 has a function of driving the joints of the robot 100 such as the neck and hands and feet. The drive unit 17 is, for example, an actuator. The driving unit 17 may have any configuration as long as the robot apparatus 100 can realize a desired operation. The drive unit 17 may have any configuration as long as it can drive the joints of the robot apparatus 100, move the positions, and the like. When the robot device 100 has a moving mechanism such as tracks and tires, the drive unit 17 drives the tracks and tires. The drive unit 17 drives the joint of the neck of the robot apparatus 100 to change the viewpoint of the camera provided on the head of the robot apparatus 100. For example, the drive unit 17 drives the joint of the neck of the robot apparatus 100 so as to capture the image in the direction determined by the determination unit 156, thereby changing the viewpoint of the camera provided on the head of the robot apparatus 100. change. Further, the drive unit 17 may change only the orientation of the camera or the image pickup range. The drive unit 17 may change the viewpoint of the camera.

Here, the identification process and the learning process in the robot apparatus 100 will be described with reference to FIGS. 7 and 8. FIG. 7 is a diagram showing an example of identification according to the present disclosure. In the example of FIG. 7, three episodes EP1 to EP3 are illustrated, but the episodes EP1 to EP3 are examples for explaining the identification process and the learning process, and the robot device 100 uses various episodes. May be. FIG. 8 is a diagram showing an example of a learning network structure according to the present disclosure.

In the example of FIG. 7, the robot apparatus 100 performs the identification process on the episode EP1 including the query QINF1 and the support information SP11 to SP13. The query QINF1 includes an image IM15 showing a balloon and a question QS15 “What is the color of the balloon?”. The support information SP11 includes an image IM11 showing two ice creams, a question QS11 “what color is the ice cream on the right side?”, And a correct answer AS11 “light purple”. Further, the support information SP12 includes an image IM12 showing a balloon, a question QS12 "What is this?", And a correct answer AS12 "balloon". In addition, the support information SP13 includes an image IM13 showing the sky and three balloons, a question QS13 "What color is the sky?", And a correct answer AS13 "blue".

In this case, the robot apparatus 100 identifies the support information corresponding to the query QINF1 among the support information SP11 to SP13. For example, the robot apparatus 100 identifies the support information corresponding to the query QINF1 using the model having the network structure shown in FIG.

As shown in FIG. 8, the robot apparatus 100 projects the feature amount of the image and the feature amount of the question included in the query or the support information onto a common space and compares the distances to obtain the support information corresponding to the query. Identify. The processing group PS1 in FIG. 8 corresponds to the above identification processing. The robot apparatus 100 performs processing corresponding to the partial processing PT1, the partial processing PT2, and the processing group PS1 including distance comparison and identification. For example, the robot apparatus 100 learns a model (identification model) that performs the processing group PS1 in FIG. For example, the identification model is the model M1.

The robot device 100 performs a query-related process by the partial process PT1. The robot apparatus 100 extracts the feature amount (for example, vector) of the image (input image) included in the query. The robot apparatus 100 extracts the feature amount of the input image (hereinafter also referred to as “image feature amount”) by inputting the input image into a network for extracting the image feature amount (hereinafter also referred to as “image feature extraction network”). For example, the robot apparatus 100 inputs an input image into the image feature extraction network and outputs the image feature amount of the input image to the image feature extraction network. For example, the robot apparatus 100 causes the image feature extraction network to output a vector indicating the image feature amount. In the example of episode EP1 in FIG. 7, the robot apparatus 100 extracts the image feature amount of the image IM15 by inputting the image IM15 to the image feature extraction network.

The robot apparatus 100 also extracts the feature amount of the question (input question) included in the query. The robot apparatus 100 extracts the feature amount of the input question (hereinafter also referred to as “quest feature amount”) by inputting the input question into a network for extracting the question feature amount (hereinafter also referred to as “question feature extraction network”). For example, the robot apparatus 100 inputs an input question into the question feature extraction network and outputs the question feature amount of the input question to the question feature extraction network. For example, the robot apparatus 100 causes the question feature extraction network to output a vector indicating the question feature amount. In the example of episode EP1 in FIG. 7, the robot apparatus 100 extracts the question feature amount of the question QS15 by inputting the question QS15 to the question feature extraction network.

Then, the robot apparatus 100 projects the image feature amount extracted from the image included in the query and the question feature amount extracted from the question included in the query onto a common space (for example, an N-dimensional space). The robot apparatus 100 inputs the image feature amount and the question feature amount into a network that projects the image feature amount and the question feature amount onto a common space (hereinafter also referred to as “projection network”). For example, the robot apparatus 100 integrates the image feature amount and the question feature amount, and projects them in the common space. For example, the robot apparatus 100 outputs a feature amount obtained by integrating the image feature amount and the question feature amount (hereinafter, also referred to as “integrated feature amount”) to the projection network by projecting in the common space. For example, the projection network may output an integrated feature amount (vector) obtained by simply connecting the image feature amount (vector) and the query feature amount (vector). In the example of episode EP1 in FIG. 7, the robot apparatus 100 inputs the image feature amount of the image IM15 and the question feature amount of the question QS15 to the projection network, and thereby the integrated feature amount of the image IM15 and the question QS15 (hereinafter, “integrated feature amount”). (Also referred to as feature amount FT15 ”) is output to the projection network.

The robot apparatus 100 performs processing related to support information by the partial processing PT2. For example, the robot apparatus 100 performs the partial process PT2 for each piece of support information. In the example of episode EP1 in FIG. 7, the robot apparatus 100 performs the partial process PT2 for each of the support information SP11 to SP13.

The robot device 100 extracts the feature amount (for example, vector) of the image (input image) included in the support information. The robot apparatus 100 extracts the feature amount (image feature amount) of the input image by inputting the input image to a network (image feature extraction network) for extracting the image feature amount. For example, the robot apparatus 100 inputs an input image into the image feature extraction network and outputs the image feature amount of the input image to the image feature extraction network. For example, the robot apparatus 100 causes the image feature extraction network to output a vector indicating the image feature amount. In the example of episode EP1 in FIG. 7, the robot apparatus 100 extracts the image feature amount of the image IM11 by inputting the image IM11 of the support information SP11 to the image feature extraction network.

Further, the robot apparatus 100 extracts the feature amount of the question (input question) included in the support information. The robot apparatus 100 extracts the feature quantity (quest feature quantity) of the input question by inputting the input question into a network (question feature extraction network) for extracting the question feature quantity. For example, the robot apparatus 100 inputs an input question into the question feature extraction network and outputs the question feature amount of the input question to the question feature extraction network. For example, the robot apparatus 100 causes the question feature extraction network to output a vector indicating the question feature amount. In the example of episode EP1 in FIG. 7, the robot apparatus 100 extracts the question feature quantity of the question QS11 by inputting the question QS11 of the support information SP11 to the question feature extraction network.

Then, the robot apparatus 100 projects the image feature amount extracted from the image included in the support information and the question feature amount extracted from the question included in the support information onto a common space (for example, an N-dimensional space). The robot apparatus 100 inputs the image feature amount and the question feature amount to a network (projection network) that projects the image feature amount and the question feature amount on a common space. For example, the robot apparatus 100 integrates the image feature amount and the question feature amount, and projects them in the common space. For example, the robot apparatus 100 causes the projection network to output an integrated feature amount obtained by integrating the image feature amount and the query feature amount by projecting in the common space. In the example of the episode EP1 in FIG. 7, the robot apparatus 100 inputs the image feature amount of the image IM11 of the support information SP11 and the question feature amount of the question QS11 to the projection network, and thus the integrated feature amount of the image IM11 and the question QS11. (Hereinafter, also referred to as “integrated feature amount FT11”) is output to the projection network.

Similarly, the robot apparatus 100 inputs the image feature amount of the image IM12 and the question feature amount of the question QS12 of the support information SP12 to the projection network, and thereby the integrated feature amount of the image IM12 and the question QS12 (hereinafter, “integrated feature amount FT12”). (Also referred to as “”) is output to the projection network. Similarly, the robot apparatus 100 inputs the image feature amount of the image IM13 of the support information SP13 and the question feature amount of the question QS13 to the projection network, so that the integrated feature amount of the image IM13 and the question QS13 (hereinafter, “integrated feature amount”). Quantity FT13 ") is output to the projection network. For example, a common network (image feature extraction network, question feature extraction network, projection network) is used for the partial processes PT1 and PT2.

Then, the robot apparatus 100 compares the distance between the query and each piece of support information based on the information projected in the common space. In the example of episode EP1 in FIG. 7, the robot apparatus 100 compares the distance between the query QINF1 and each of the support information SP11 to SP13 based on the information projected in the common space. For example, the robot apparatus 100 compares the distance between the integrated feature quantity FT15 of the query QINF1 and the integrated feature quantities FT11 to FT13 of each of the support information SP11 to SP13.

Then, the robot apparatus 100 identifies the support information corresponding to the query based on the result of comparison of the distance between the query and each support information. For example, the robot apparatus 100 identifies support information that approximates the query as support information that corresponds to the query. For example, the robot apparatus 100 identifies the support information having the shortest distance from the query as the support information corresponding to the query. In the example of the episode EP1 in FIG. 7, the robot apparatus 100 identifies the support information of the integrated feature amount that is the closest to the integrated feature amount FT15 of the query QINF1 as the support information corresponding to the query QINF1. The robot apparatus 100 identifies the support information SP11 of the integrated feature quantity FT11 closest to the integrated feature quantity FT15 of the query QINF1 as the support information corresponding to the query QINF1.

As described above, in the example of episode EP1 in FIG. 7, the robot apparatus 100 identifies the support information SP11 among the support information SP11 to SP13 as the support information corresponding to the query QINF1. Then, the robot apparatus 100 may determine the correct answer corresponding to the query QINF1 as the correct answer AS11 of the support information SP11. Specifically, the robot apparatus 100 may determine the correct answer corresponding to the query QINF1 to be the correct answer AS15 which is the same “light purple” as the correct answer AS11. Then, the robot apparatus 100 may register, as support information, a combination of the image IM15 and the question QS15 included in the query QINF1 and the correct answer AS15.

Further, the robot device 100 may perform the learning process based on the identification result when the correct answer AS15 of the query QINF1 has been acquired. For example, the robot apparatus 100 may learn an identification model including the image feature extraction network, the question feature extraction network, and the projection network described above. For example, when the support information corresponding to the query QINF1 is identified as the support information SP12 by the identification model as shown in FIG. 8, the robot apparatus 100 identifies the support information corresponding to the query QINF1 as the support information SP11. Alternatively, the discrimination model may be learned. For such learning, for example, an arbitrary learning method such as back propagation or stochastic gradient descent can be adopted.

Further, in the example of FIG. 7, the robot device 100 performs the identification process on the episode EP2 including the query QINF2 and the support information SP21 to SP23. The query QINF2 includes an image IM25 showing two balloons and a question QS25 asking "how many balloons?". The support information SP21 includes an image IM21 showing two ice creams, a question QS21 “How many ice creams?”, And a correct answer AS21 “two”. Further, the support information SP22 includes an image IM22 showing a balloon, a question QS22 "what color is this?", And a correct answer AS22 "blue". In addition, the support information SP23 includes an image IM23 showing the sky and three balloons, a question QS23 "How many balloons?", And a correct answer AS23 "3".

In this case, the robot apparatus 100 identifies the support information corresponding to the query QINF2 among the support information SP21 to SP23. For example, the robot apparatus 100 identifies the support information corresponding to the query QINF2 using the model having the network structure as shown in FIG. The robot apparatus 100 identifies the support information SP21 of the integrated feature quantity closest to the integrated feature quantity of the query QINF2 as the support information corresponding to the query QINF2.

As described above, in the example of the episode EP2 in FIG. 7, the robot apparatus 100 identifies the support information SP21 among the support information SP21 to SP23 as the support information corresponding to the query QINF2. Then, the robot apparatus 100 may determine the correct answer corresponding to the query QINF2 as the correct answer AS21 of the support information SP21. Specifically, the robot apparatus 100 may determine the correct answer AS25 corresponding to the query QINF2 to be the correct answer AS25 which is the same as the correct answer AS21. Then, the robot apparatus 100 may register the combination of the image IM25 and the question QS25 included in the query QINF2 and the correct answer AS25 as support information. Thereby, the robot apparatus 100 can acquire the concept of numbers.

Also, the robot apparatus 100 may perform the learning process based on the identification result when the correct answer AS25 of the query QINF2 has been acquired. For example, in the robot apparatus 100, when the support information corresponding to the query QINF2 is identified as the support information SP23 by the identification model as shown in FIG. 8, the support information corresponding to the query QINF2 is identified as the support information SP21. Alternatively, the discrimination model may be learned.

Further, in the example of FIG. 7, the robot apparatus 100 performs the identification process on the episode EP3 including the query QINF3 and the support information SP31 to SP33. The query QINF3 includes an image IM35 showing ice and a question QS35 "What does it feel like to touch?" The support information SP31 includes an image IM31 showing two ice creams, a question QS31 "What does it feel like to touch?", And a correct answer AS31 "cold". Further, the support information SP32 includes an image IM32 showing a balloon, a question QS32 "what shape is this?", And a correct answer AS32 "round". In addition, the support information SP33 includes an image IM33 showing the sky and three balloons, a question QS33 "what kind of atmosphere?", And a correct answer AS33 "fluffy".

In this case, the robot apparatus 100 identifies the support information corresponding to the query QINF3 among the support information SP31 to SP33. For example, the robot apparatus 100 identifies the support information corresponding to the query QINF3 using the model having the network structure as shown in FIG. The robot apparatus 100 identifies the support information SP31 of the integrated feature quantity closest to the integrated feature quantity of the query QINF3 as the support information corresponding to the query QINF3.

In this way, in the example of episode EP2 in FIG. 7, the robot apparatus 100 identifies the support information SP31 among the support information SP31 to SP33 as the support information corresponding to the query QINF3. Then, the robot apparatus 100 may determine the correct answer corresponding to the query QINF3 as the correct answer AS31 of the support information SP31. Specifically, the robot apparatus 100 may determine the correct answer corresponding to the query QINF3 to be the correct answer AS35 that is the same "cold" as the correct answer AS31. Then, the robot apparatus 100 may register the combination of the image IM35 and the question QS35 included in the query QINF3 and the correct answer AS35 as support information. Thereby, the robot apparatus 100 can acquire the concept regarding the impression of the user. Specifically, the robot apparatus 100 can acquire the concept regarding the temperature of the object. Note that the robot device 100 can acquire various concepts such as hardness as well as temperature as long as the concept is related to the impression of the user. The robot apparatus 100 can acquire the concept of hardness by registering support information including correct answers such as “hard” and “soft”.

Further, the robot apparatus 100 may perform the learning process based on the identification result when the correct answer AS35 of the query QINF3 has been acquired. For example, in the robot apparatus 100, when the support information corresponding to the query QINF3 is identified as the support information SP33 by the identification model as shown in FIG. 8, the support information corresponding to the query QINF3 is identified as the support information SP31. Alternatively, the discrimination model may be learned.

When performing the learning process, the robot apparatus 100 may select one piece of support information from the support information group and perform learning using the selected support information as query information.

Here, the flow of processing from when the robot device 100 receives a query input to when it outputs a correct answer candidate will be described with reference to FIG. 9. FIG. 9 is a block diagram showing an example of a configuration from input to output according to the present disclosure.

First, the robot apparatus 100 extracts the feature amount (image feature amount) of the image detected by the camera from the query. Further, the robot apparatus 100 extracts the feature amount (question feature amount) of the voice of the question input by the query microphone. Then, the robot apparatus 100 projects the image feature amount and the question feature amount on the common space. Thereby, the robot device 100 completes the preparation of the information corresponding to the query.

Then, the robot device 100 generates an episode. The robot apparatus 100 generates an episode by adding the support information. The robot apparatus 100 uses the support information stored in the support information storage unit 141 to generate an episode. For example, the robot apparatus 100 may generate an episode by using a part of the support information stored in the support information storage unit 141, or all the support information stored in the support information storage unit 141. May be used to generate episodes.

Then, the robot apparatus 100 performs identification processing based on the episode. For example, the robot apparatus 100 identifies the support information corresponding to the query from the support information of the episode. The robot apparatus 100 determines the identified support information as the support information used to determine the response to the query information.

Then, the robot apparatus 100 outputs the correct answer of the determined support information as a correct answer candidate corresponding to the query by the speaker.

[1-3. Information Processing Procedure According to First Embodiment]
Next, a procedure of information processing according to the first embodiment will be described with reference to FIGS. 10 and 11. First, the flow of learning processing according to the first embodiment of the present disclosure will be described using FIG. 10. FIG. 10 is a flowchart showing a procedure of information processing according to the first embodiment of the present disclosure.

As shown in FIG. 10, the robot apparatus 100 acquires an image (step S101). For example, the robot device 100 acquires an image captured by a camera.

The robot apparatus 100 acquires a question related to the image (step S102). For example, the robot apparatus 100 acquires the user's question input by the microphone.

Then, the robot apparatus 100 acquires the correct answer corresponding to the question (step S103). For example, the robot apparatus 100 outputs the correct answer candidate corresponding to the question, and acquires the correct answer according to the user's reaction to the correct answer candidate. For example, the robot apparatus 100 acquires the correct answer candidate as the correct answer when the user's reaction to the correct answer candidate is positive. For example, the robot apparatus 100 acquires the correct answer provided by the user when the user's reaction to the correct answer candidate is negative.

Then, the robot apparatus 100 registers the combination of the acquired image, question, and correct answer as support information (step S104). For example, if the user's reaction to the correct answer candidate is affirmative, the robot apparatus 100 registers the combination of the image, the question, and the correct answer candidate that is the correct answer as support information. For example, if the user's reaction to the correct answer candidate is negative, the robot apparatus 100 registers the combination of the image, the question, and the correct answer provided by the user as support information. For example, the robot apparatus 100 stores the acquired combination of the image, the question, and the correct answer in the support information storage unit 141 in association with the unallocated support information ID.

Next, the detailed flow of the registration process based on the dialog with the user according to the first embodiment of the present disclosure will be described using FIG. 11. FIG. 11 is a flowchart showing a procedure of correct answer registration processing by a dialog with a user according to the first embodiment of the present disclosure.

As shown in FIG. 11, the robot apparatus 100 receives an input (step S201). For example, the robot apparatus 100 receives a voice uttered by the user as an input. In the example of FIG. 11, the process of the interaction part with the user will be mainly described. Therefore, in the example of FIG. 11, although not shown, it is assumed that the robot apparatus 100 has acquired the image (input image) before step S201. For example, the robot apparatus 100 captures and acquires an input image with a camera.

Then, the robot apparatus 100 determines whether the input is a question (step S202). For example, the robot apparatus 100 determines whether the input is a question by analyzing the character information obtained by converting the voice information of the user. When determining that the input is not a question (step S202; No), the robot apparatus 100 returns to step S201 and repeats the processing.

On the other hand, when the robot device 100 determines that the input is a question (step S202; Yes), it generates an episode (step S203). For example, the robot device 100 generates an episode in which the input image and question that have been input are used as query information. For example, the robot apparatus 100 generates an episode including query information that is a combination of the input image and the question that have been input and the support information stored in the support information storage unit 141.

Then, the robot apparatus 100 performs identification (step S204). For example, the robot apparatus 100 uses the episode to identify the support information including the correct answer corresponding to the query information. For example, the robot apparatus 100 determines the support information used to determine the response to the query information by identifying the support information including the correct answer corresponding to the query information. For example, the robot apparatus 100 determines the support information used for determining the response to the query information, and selects the support information (determined support information) as the information used for determining the response to the query information.

Then, the robot device 100 outputs a response (step S205). For example, the robot apparatus 100 outputs a response using the decision support information. For example, the robot apparatus 100 outputs the correct answer included in the decision support information.

Then, the robot apparatus 100 determines whether or not there is a user reaction (step S206). For example, the robot apparatus 100 determines whether there is a reaction of the user based on whether or not the input by the user is accepted. For example, the robot apparatus 100 determines that there is a reaction of the user when the utterance by the user is detected by the microphone. When it is not determined that the user has reacted (step S206; No), the robot apparatus 100 returns to step S206 and repeats the processing.

On the other hand, when it is determined that the user's reaction has been received (step S206; Yes), the robot apparatus 100 determines whether the user's reaction is affirmative (step S207). For example, the robot device 100 determines whether the user's reaction is positive or negative by analyzing the character information in which the user's reaction (voice information) is converted.

When the robot device 100 determines that the user's reaction is affirmative (step S207; Yes), the output response is registered as a correct answer (step S208). For example, when the robot device 100 determines that the user's reaction is affirmative, the output response (correct answer candidate) is set as a correct answer, and is registered as support information combined with the input image and the question included in the query information. That is, the robot apparatus 100 registers, as the support information, the combination of the input image, the question, and the correct answer that is the output response included in the query information.

On the other hand, when the robot device 100 determines that the user's reaction is negative (step S207; No), it requests the user for the correct answer (step S209). For example, when the robot device 100 determines that the reaction of the user is negative, the robot device 100 outputs a voice requesting the correct answer to the user, such as “Please tell me the correct answer”. If the robot device 100 determines that the user's reaction is negative, the robot device 100 may wait until the user's next reaction is detected.

Then, the robot apparatus 100 acquires the correct answer (step S210). For example, the robot apparatus 100 acquires the input by the user as the correct answer. For example, the robot apparatus 100 acquires the character information obtained by converting the user's utterance (voice information) as the correct answer.

Then, the robot apparatus 100 registers the acquired correct answer (step S211). For example, the robot apparatus 100 registers support information that is a combination of the acquired correct answer, the input image and the question included in the query information. That is, the robot apparatus 100 registers the combination of the input image, the question, and the acquired correct answer included in the query information as support information.

(2. Second embodiment)
[2-1. Outline of information processing according to second embodiment of the present disclosure]
In the first embodiment, the case where the robot apparatus 100 acquires a question and a response to the correct answer from the user has been described. However, the information processing apparatus such as the robot apparatus acquires the question and the correct answer in various modes. May be. For example, the robotic device may ask the question. Therefore, in the second embodiment, an example in which the robot apparatus 100A asks a question to the user and acquires an answer from the user will be described. Note that description of the same points as those of the robot apparatus 100 according to the first embodiment will be appropriately omitted.

FIG. 12 shows a case where the robot apparatus 100A, which is an agent, asks a question to the user U1 and registers a combination of images, a question, and a correct answer (support information) through a dialogue with the user U1. First, the robot device 100A acquires an image (step S31). In the example of FIG. 12, the robot apparatus 100A detects the image IM2 by capturing images of two ice creams with the camera. For example, an image is input to the robot device 100A through a camera attached to the robot device 100A.

Then, the robot device 100A detects the presence or absence of a user (step S32). The robot apparatus 100A detects whether a person exists around the robot apparatus 100A in order to ask the user a question. For example, the robot apparatus 100A recognizes (determines) whether or not there is a user who is a partner for the question and answer session. For example, the robot device 100A detects whether or not there is a user around by using a camera. The robot device 100A detects the presence / absence of a user based on the image captured by the camera. The robot device 100A may individually include a camera that captures an image used for a query and a camera that detects the presence or absence of a user. The robot apparatus 100A is not limited to the camera, and may detect or recognize the user by various sensors as long as the presence or absence of the user can be detected. In the example of FIG. 12, the robot device 100A determines that the user is present in the surroundings because the user U1 is included in the image captured by the camera. In this case, for example, the robot apparatus 100A changes the mode to the question mode (see FIG. 14) because the user U1 is detected.

Then, the robot device 100A generates a question (step S33). For example, the robot device 100A generates a question based on the image captured by the camera. In the example of FIG. 12, the robot device 100A generates a question related to the image IM2 based on the acquired image IM2. For example, the robot device 100A may estimate a target object included in the image IM2 and generate a question using a technique related to object recognition. The robot device 100A may generate a question according to the estimated target object. For example, the robot device 100A may generate a question based on the information stored in the question information storage unit 144 (see FIG. 13). In the example of FIG. 12, the robot apparatus 100A estimates that the target object included in the image IM2 is ice cream and generates the question QS2 “What if I touch it?”.

For example, the robot device 100A may generate the question based on the question candidate information in which the name of the object stored in the question information storage unit 144 and the question candidate are associated with each other. For example, in the robot device 100A, the name of the object “ice cream” stored in the question information storage unit 144 and the questions such as “What if I touch it?”, “Is the price expensive?” The question may be generated based on the associated question candidate information. For example, the robot device 100A may determine the question to be output based on a predetermined criterion among the question candidates associated with the name “ice cream” of the object stored in the question information storage unit 144. . For example, the robot device 100A may determine a question candidate selected at random among the question candidates as a question to be output. For example, the robot device 100A may count the number of times the question candidate is output, and determine the question candidate with a small output number as the question to be output. Note that the above is an example, and the robot apparatus 100A may ask the user U1 any technique as long as it can ask the user U1.

Then, the robot device 100A outputs the generated question (step S34). In the example of FIG. 12, the robot apparatus 100A outputs the question QS2, "What if I touch it?"

Then, the user U1 provides the robot apparatus 100A with the correct answer to the question QS2 (step S35). The user U1 inputs the correct answer to the question output by the robot apparatus 100A into the robot apparatus 100A. The user U1 inputs the correct answer AS2 “cold” to the robot apparatus 100A for the question QS2 “What if I touch it?” Output by the robot apparatus 100A.

Then, the robot apparatus 100A registers, as support information, a combination of the image, the question related to the image, and the correct answer corresponding to the question (step S36). For example, the robot apparatus 100A transitions to the support information registration mode, and registers a combination of an image, a question related to the image, and a correct answer corresponding to the question as support information. In the example of FIG. 12, the robot apparatus 100A uses the combination of the image IM2, the question QS2, and the correct answer AS2 as shown in the additional registration information RINF21 as the support information (support information SP2) identified by the support information ID “SP2”. register. In this way, the robot apparatus 100A registers the three elements of the input image, the input question, and the correct answer as one set. For example, the robot device 100A stores the support information SP2 in the support information storage unit 141 (see FIG. 13).

As described above, the robot apparatus 100A itself outputs the question QS2 regarding the image IM2, and registers the correct answer AS2 provided by the user U1 as the correct answer corresponding to the image IM2 and the question QS2. In this way, the robot apparatus 100A asks a question to the user and asks for a response. As a result, the robot apparatus 100A can spontaneously acquire a new concept by outputting a question by itself without waiting for an input from the user. Therefore, the robot device 100A can enable an appropriate response to the question regarding the image.

[2-2. Configuration of Robot Device According to Second Embodiment]
Next, the configuration of the robot device 100A, which is an example of an information processing device that executes information processing according to the second embodiment, will be described. FIG. 13 is a diagram illustrating a configuration example of a robot device 100A according to the second embodiment of the present disclosure.

As shown in FIG. 13, the robot device 100A includes a communication unit 11, an input unit 12, an output unit 13, a storage unit 14A, a control unit 15A, a sensor unit 16A, and a drive unit 17.

The input unit 12 receives the input of the correct answer corresponding to the question from the user. The output unit 13 outputs the question. The output unit 13 outputs a question when the user is detected by the sensor unit 16A. The output unit 13 outputs the question generated by the generation unit 153A.

The storage unit 14A is realized by, for example, a semiconductor memory device such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk. The storage unit 14A includes a support information storage unit 141, a model information storage unit 142, a mode information storage unit 143A, and a question information storage unit 144.

The mode information storage unit 143A stores information regarding the mode of the robot device 100A. FIG. 14 is a diagram illustrating an example of the mode information storage unit according to the second embodiment of the present disclosure. FIG. 14 shows an example of the mode information storage unit 143A according to the second embodiment. As shown in FIG. 14, the mode information storage unit 143A has items such as “mode ID”, “mode”, and “flag”. In the example of FIG. 14, the mode identified by the mode ID “MO21” (mode MO21) is the inquiry mode. For example, the mode MO21 indicates that the robot apparatus 100A itself outputs a question.

The question information storage unit 144 stores various information regarding the question. The question information storage unit 144 stores information used by the robot apparatus 100A itself to output a question. The question information storage unit 144 stores question candidate information in which the name of the object and the question candidate are associated with each other. For example, the question information storage unit 144 associates the question candidate information with the name of the object “ice cream” and the questions such as “What if I touch it?”, “Is the price expensive?”, And “What is the taste?”. Memorize

Return to FIG. 13 and continue the explanation. The control unit 15A is realized by, for example, a CPU, an MPU, or the like executing a program (for example, an information processing program according to the present disclosure) stored inside the robot apparatus 100A using a RAM or the like as a work area. The control unit 15A is a controller, and may be realized by an integrated circuit such as an ASIC or FPGA.

As illustrated in FIG. 13, the control unit 15A includes an acquisition unit 151, a determination unit 152, a generation unit 153A, a registration unit 154, a learning unit 155, and a determination unit 156, and information described below. Realize or execute processing functions and actions. The internal configuration of the control unit 15A is not limited to the configuration shown in FIG. 13, and may be any other configuration as long as it is a configuration for performing information processing described later.

The acquisition unit 151 acquires the question output by the output unit 13. The determination unit 152 recognizes (determines) whether or not a user who is a partner of the question and answer is in the vicinity. The generation unit 153A generates a question. For example, the generation unit 153A generates a question related to the image based on the acquired image. The generation unit 153A generates a question by estimating the target object included in the image by using a technique of a technique related to object recognition. The generation unit 153A generates a question based on the information stored in the question information storage unit 144. In the example of FIG. 12, the generation unit 153A estimates that the target object included in the image IM2 is ice cream, and generates the question QS2 “What if I touch it?”. The registration unit 154 registers the combination including the correct answer input by the user as support information.

The sensor unit 16A has a function as a detection unit that detects the presence or absence of a user. The sensor unit 16A detects the presence or absence of a user. For example, the sensor unit 16A detects whether or not there is a user around by using the camera. The sensor unit 16A detects the presence or absence of the user based on the image captured by the camera.

[2-3. Information Processing Procedure According to Second Embodiment]
Next, a procedure of information processing according to the second embodiment will be described with reference to FIG. The detailed flow of the registration process based on the dialog with the user according to the second embodiment of the present disclosure will be described using FIG. 15. FIG. 15 is a flowchart showing the procedure of a correct answer registration process by a dialog with a user according to the second embodiment of the present disclosure.

As shown in FIG. 15, the robot apparatus 100 detects a person (step S301). For example, the robot apparatus 100 detects the presence of the user by appropriately using various sensors such as a camera. For example, the robot apparatus 100 analyzes the image captured by the camera, and determines whether the user exists based on the analysis result. When the robot device 100 does not detect a person (step S301; No), the process returns to step S301 and repeats the processing.

On the other hand, when the robot device 100 detects a person (step S301; Yes), it outputs a question (step S302). Note that in the example of FIG. 15, the processing of the dialog portion with the user will be mainly described. Therefore, in the example of FIG. 15, although not shown, it is assumed that the robot apparatus 100 has already acquired an image (input image) to be a query before step S302. For example, the robot apparatus 100 captures and acquires an input image with a camera. For example, the robot apparatus 100 generates and outputs a question based on the input image that is input. For example, the robot apparatus 100 generates and outputs a question based on the question information stored in the question information storage unit 144 (see FIG. 13).

Then, the robot apparatus 100 determines whether or not there is a user reaction (step S303). For example, the robot apparatus 100 determines whether there is a reaction of the user based on whether or not the input by the user is accepted. For example, the robot apparatus 100 determines that there is a reaction of the user when the utterance by the user is detected by the microphone. When it is not determined that the user has reacted (step S303; No), the robot apparatus 100 returns to step S303 and repeats the processing.

On the other hand, when it is determined that the user's reaction has been received (step S303; Yes), the robot apparatus 100 registers the input response as the correct answer (step S304). For example, the robot apparatus 100 registers support information that is a combination of an input image, an output question, and a correct answer input by the user.

(3. Other embodiments)
The processing according to each of the above-described embodiments may be implemented in various different modes (modifications) other than each of the above-described embodiments.

[3-1. Other configuration examples]
For example, in the above-described example, the information processing apparatus that performs information processing is the

robot apparatuses

100 and 100A, but the information processing apparatus and the robot apparatus may be separate bodies. This point will be described with reference to FIGS. 16 and 17. FIG. 16 is a diagram showing a configuration example of an information processing system according to a modification of the present disclosure. FIG. 17 is a diagram illustrating a configuration example of the information processing device according to the modification of the present disclosure.

As shown in FIG. 16, the information processing system 1 includes a robot device 10 and an information processing device 100B. The robot device 10 and the information processing device 100B are communicably connected to each other via a network N in a wired or wireless manner. Note that the information processing system 1 illustrated in FIG. 16 may include a plurality of robot devices 10 and a plurality of information processing devices 100B. In this case, the information processing apparatus 100B may communicate with the robot apparatus 10 via the network N and may instruct the model learning or the robot apparatus 10 response based on the information collected by the robot apparatus 10. ..

The robot device 10 detects an image with a camera and a user's utterance (question) with a microphone. Then, the robot apparatus 10 outputs the detected image and the response corresponding to the user's question by the speaker. The robot device 10 may be any device as long as it can send and receive information to and from the information processing device 100B. For example, a human (user) such as an entertainment robot or a household robot is used. ) May interact with the robot. For example, the robot device 10 transmits the captured image and information collected by a dialogue with the user to the information processing device 100B.

The information processing apparatus 100B is an information processing apparatus that registers an image, a question related to the image, and a correct answer corresponding to the question as support information. For example, the information processing apparatus 100 </ b> B may transmit information to the robot apparatus 10 and remotely control the robot apparatus 10 to realize the dialogue with the user by the robot apparatus 10. Then, the information processing apparatus 100B acquires various information by receiving the various information acquired by the robot apparatus 10 from the robot apparatus 10. In this way, the information processing apparatus 100B registers the support information based on the information collected through the dialogue with the user of the robot apparatus 10.

As shown in FIG. 17, the information processing device 100B has a communication unit 11B, a storage unit 14B, and a control unit 15. The communication unit 11B is connected to a network N (Internet or the like) by wire or wirelessly, and transmits / receives information to / from the robot apparatus 10 via the network N. The storage unit 14B includes a support information storage unit 141 and a model information storage unit 142. As described above, the information processing apparatus 100B does not have a sensor unit, a driving unit, or the like, and may not have a configuration for realizing the function as the robot device. The information processing apparatus 100B includes an input unit (for example, a keyboard and a mouse) that receives various operations from an administrator who manages the information processing apparatus 100B, and a display unit (for example, a liquid crystal display) for displaying various information. ) May be included.

[3-2. Adjustment of camera viewpoint]
In addition, the

robot devices

100 and 100A and the information processing device 100B may adjust the camera viewpoint in order to make the question and answer more efficient. This point will be described with reference to FIGS. 18 to 20. Although the robot device 100 will be described below as an example and the case where the robot device 100 adjusts the camera viewpoint will be described as an example, any device may be used as long as the camera viewpoint can be adjusted.

For example, in the above-described embodiment, it is premised that an object (object) is clearly captured in the camera of an agent such as a robot device. However, in a real environment, the viewpoint of the camera is shifted from the object or the object is out of focus. Therefore, even if a question about the object is given, it may not be possible to make a judgment necessary for an answer. For example, in the data set composed of an image captured by a visually impaired person or the like and a question disclosed in Non-Patent Document 2, about 30% of the entire data is an image for which judgment (identification) necessary for an answer is impossible. It has become. With the data disclosed in Non-Patent Document 2, it is difficult to associate an appropriate correct answer with about 30% of images. Such a problem from the viewpoint of a camera not only learns a new concept but also seriously hinders the purpose of the original agent such as user recognition and communication. Therefore, it is useful to estimate the surrounding space based on the image captured by the camera and determine how to move the camera viewpoint to obtain the necessary amount of information.

For example, the information processing apparatus 100B as a wearable terminal worn by the user may adjust the camera viewpoint described later. In this case, the information processing apparatus 100B may output a voice designating the moving direction or the direction to the user by using an output unit such as a speaker. Further, the wearable terminal worn by the user and the information processing apparatus 100B may be separate bodies. In this case, the information processing apparatus 100B may receive information from the wearable terminal worn by the user, and may instruct the wearable terminal to adjust the camera viewpoint based on the received information. For example, the information processing apparatus 100B may transmit voice information designating a moving direction and a direction to the user to the wearable terminal and cause the wearable terminal to output the voice. In this case, the information processing system 1 may include the information processing device 100B and the wearable terminal, and may not include the robot device 10. As described above, the information processing apparatus 100B may instruct the user to output a speaker, a display, or the like to cause the user to take a camera position (camera viewpoint) for more efficient question and answer.

From here, the adjustment of the camera viewpoint by the robot apparatus 100 will be described with reference to FIG. FIG. 18 is a diagram illustrating an example of a camera viewpoint adjustment process according to the present disclosure. In the example of FIG. 18, in order to make the question and answer more efficient, the robot apparatus 100 estimates the peripheral space from the image captured by the camera, drives the drive unit 17, and sets the camera position more appropriate for communication. take.

User U1 inputs a question to robot device 100 (step S51). In the example of FIG. 18, the user U1 inputs the question QS51 “How many ice creams are there?” To the robot apparatus 100 by speaking “How many ice creams are there?”.

The robot apparatus 100 also acquires an image (step S52). In the example of FIG. 18, the robot device 100 detects the image IM50. In this way, the robot apparatus 100 images the image IM50 including only a part (upper part) of the two ice creams before adjusting the camera viewpoint.

Then, the robot apparatus 100 estimates the surrounding space (step S53). The robot apparatus 100 estimates the peripheral space in the range included in the image IM50 using various techniques. For example, the robot apparatus 100 estimates the surrounding space in the vertical and horizontal directions of the range included in the image IM50. For example, the robot apparatus 100 estimates the surrounding space by appropriately using various conventional techniques. The robot apparatus 100 uses various conventional techniques as appropriate to generate an image (hereinafter also referred to as an “estimation image”) corresponding to the space in order to estimate what is in the space above, below, left, and right.

For example, the robot apparatus 100 uses the model for restoring the image (restoring model) to generate the estimation image. For example, the robot apparatus 100 generates an estimation image using a restoration model (network) that is learned to restore the entire object from an image showing a part of the object. For example, the robot apparatus 100 acquires the restoration model from the external device that has generated the restoration model. The robot device 100 may learn a restoration model (network) that intentionally cuts out a part of an image in which an object is clearly captured and restores the cut out part.

First, the robot apparatus 100 specifies a range in which an object is captured in the image IM50 by appropriately using various conventional techniques such as a technique related to object recognition. In the example of FIG. 18, the robot apparatus 100 specifies the range (lower range) in which the object is seen in the lower direction of the image IM50, and cuts out the lower range (lower image) of the image IM50. Then, the robot apparatus 100 uses the lower image of the image IM50 and the restoration model to generate a lower restored image ES51 that is an estimation image. Further, the robot apparatus 100 similarly processes the upward, rightward, and leftward directions of the image IM50 to generate an upper restored image, a right restored image, and a left restored image.

Note that the robot apparatus 100 may generate the estimation image by any method as long as the estimation image can be generated. For example, the robot device 100 may generate the estimation images of the spaces above, below, to the left and to the right of the range included in the image IM50 based on the method disclosed in Non-Patent Document 3 described above. For example, the robot device 100 may generate an estimation image of the space above, below, to the left, and to the right of the range included in the image IM50 by using a technology related to a hostile generation network (GAN: Generative Adversarial Networks).

The robot device 100 may generate estimation images of the spaces above, below, to the left, and right of the range included in the image IM50 based on the method disclosed in Non-Patent Document 4 described above. For example, the robot device 100 may generate an image for estimation of the space above, below, to the left, and to the right of the range included in the image IM50 by using a technology such as PixelRNN (Recurrent Neural Network).

The robot apparatus 100 uses the lower restored image ES51, the upper restored image, the right restored image, and the left restored image to estimate the surrounding space in the left, right, up, and down directions.

In the example of FIG. 18, since the question QS51 has been acquired, the robot apparatus 100 estimates the surrounding space using the information of the question QS51. The robot apparatus 100 projects the estimation image and the given question in the common space, and sets the direction having the highest confidence in the answer identification result as the target of the viewpoint policy. In this way, the robot apparatus 100 can adjust the camera viewpoint in the direction in consideration of the suitability for answering the question by adding the question information.

For example, the robot apparatus 100 projects the lower restored image ES51, the upper restored image, the right restored image, the left restored image, and the question QS51 into a common space, and indicates the confidence (reliability) of the identification result of the answer (hereinafter The direction corresponding to the estimation image having the largest "score") is determined as the direction in which the camera viewpoint is directed. For example, the robot apparatus 100 uses the identification model to compare the integrated feature amount obtained by integrating the image feature amount of the lower restored image ES51 and the question feature amount of the question QS51 with the integrated feature amount of each support information. You may calculate the score based on it. For example, the robot apparatus 100 may calculate a score based on the distance between the combination of the image feature amount of the lower restored image ES51 and the question QS51 and the support information having the shortest distance. For example, the score may have a larger value as the distance is shorter, and may be a value output by the identification model.

If the question has not been acquired, the robot apparatus 100 may set the direction in which the estimated object identification confidence in the image is highest as the target of the viewpoint policy. For example, the robot apparatus 100 may use only the lower restored image ES51, the upper restored image, the right restored image, and the left restored image to set the direction in which the score of the object identification of the image IM50 is the highest as the target of the viewpoint policy. ..

The robot apparatus 100 determines the viewpoint policy (step S54). In the example of FIG. 18, since the lower restored image ES51 includes the object corresponding to “ice” included in the question QS51, it is assumed that the lower restored image ES51, that is, the score corresponding to the downward direction is the maximum. Therefore, the robot apparatus 100 determines to adjust the camera viewpoint in the downward direction corresponding to the lower restored image ES51. In this case, the robot apparatus 100 generates the viewpoint adjustment information IS51, which is an instruction to change the camera direction to "down camera" to "down".

Then, the robot apparatus 100 performs an operation of adjusting the camera viewpoint (step S55). In the example of FIG. 18, the robot apparatus 100 drives an actuator or the like so that the camera faces downward based on the viewpoint adjustment information IS51 that instructs the camera to face downward. For example, when the camera is provided on the head, the robot apparatus 100 drives an actuator or the like so that the head faces downward. Thereby, the robot apparatus 100 captures the image IM51 including the two ice creams.

Then, the robot apparatus 100 outputs the correct answer of the determined support information based on the image IM51 and the question QS51 (step S56). In the example of FIG. 18, the correct answer AS 51 “two” is output to the robot apparatus 100.

As described above, the robot apparatus 100 realizes the adjustment of the camera viewpoint by inputting an image, estimating the surrounding space, determining the viewpoint policy, and performing an operation based on the determination. As described above, the robot apparatus 100 can acquire the appropriate image by adjusting the camera viewpoint according to the input image, even if the appropriate image cannot be acquired. .. Thereby, the robot apparatus 100 can make an appropriate response to the question about the image.

Here, the flow of processing from when the robot device 100 receives an input to when the camera position is adjusted will be described with reference to FIG. FIG. 19 is a block diagram showing an example of the configuration from the input to the output related to the adjustment of the camera viewpoint of the present disclosure.

First, the robot apparatus 100 extracts the feature amount (image feature amount) of the image detected by the camera from the query. Further, the robot apparatus 100 extracts the feature amount (question feature amount) of the voice of the question input by the query microphone. Then, the robot apparatus 100 estimates the peripheral space based on the extracted image feature amount. Thereby, the robot apparatus 100 generates an estimation image based on the extracted image feature amount. For example, the robot apparatus 100 estimates the peripheral space in the downward direction based on the extracted image feature amount, and generates the estimation image in the downward direction. Then, the robot apparatus 100 projects the feature amount and the question feature amount regarding the estimation image in the common space.

Then, the robot device 100 generates an episode. The robot apparatus 100 generates an episode by adding the support information. The robot apparatus 100 uses the support information stored in the support information storage unit 141 to generate an episode.

Then, the robot apparatus 100 examines the identification confidence based on the episode. For example, the robot apparatus 100 determines the direction having the largest score among the examined directions as a candidate for the moving direction. Then, if there is an unexamined direction, the robot apparatus 100 estimates the surrounding space for the unexamined direction, and repeats the examination of the identification confidence until the unexamined direction disappears. For example, the robot apparatus 100 estimates the peripheral space in the upward direction, the rightward direction, and the leftward direction in the same manner as the estimated downward direction, and repeats the examination of the identification confidence.

Then, the robot apparatus 100 determines the camera movement policy based on the direction having the highest score among all the examined directions. For example, the robot apparatus 100 determines a moving direction candidate direction as a moving direction after examining all directions. The robot apparatus 100 determines, as the moving direction, the direction having the highest score among all the examined directions.

Then, the robot apparatus 100 drives the actuator to adjust the camera viewpoint in the determined moving direction.

Next, the flow of adjustment processing of the camera viewpoint will be described with reference to FIG. FIG. 20 is a flowchart showing the procedure of the camera viewpoint adjustment process of the present disclosure.

As shown in FIG. 20, the robot device 100 receives an input (step S501). For example, the robot device 100 receives an input of an image or a question.

Then, the robot apparatus 100 estimates the peripheral space in the specific direction (step S502). For example, the robot apparatus 100 estimates a peripheral space in an unexamined direction among the directions to be estimated. Then, the robot apparatus 100 examines the identification confidence in the specific direction (step S503). For example, the robot apparatus 100 considers whether the score in the specific direction is the maximum, and if the score in the specific direction is the maximum among the examined directions, the specific direction is determined as a candidate for the moving direction.

Then, the robot apparatus 100 determines whether or not all directions have been considered (step S504). If all directions have not been considered (step S504; No), the robot apparatus 100 returns to step S502 and repeats the process until there are no unexamined directions.

On the other hand, when the robot device 100 considers all the directions (step S504; Yes), the robot device 100 performs the process of step S505. In this case, the robot apparatus 100 determines a moving direction candidate as a moving direction after examining all the directions. For example, the robot apparatus 100 may determine not to adjust the camera viewpoint when the scores in all directions are less than the predetermined threshold.

The robot apparatus 100 determines whether it operates (step S505). When it is determined that the robot device 100 does not operate (step S505; No), the robot device 100 performs identification without changing the camera viewpoint (step S508). For example, the robot apparatus 100 performs the identification without changing the camera viewpoint when the scores in all directions are less than the predetermined threshold value.

On the other hand, when it is determined that the robot device 100 operates (step S505; Yes), the robot device 100 determines an operation policy (step S506). For example, the robot apparatus 100 determines a moving direction candidate as the moving direction. The robot apparatus 100 determines to point the camera viewpoint in the direction having the maximum score.

The robot apparatus 100 operates according to the determination (step S507). For example, the robot apparatus 100 drives the actuator and adjusts the camera viewpoint in the determined moving direction.

Then, the robot apparatus 100 performs identification (step S508). For example, the robot apparatus 100 determines the support information used for the response based on the query including the image and the question and the support information.

Then, the robot apparatus 100 outputs based on the identification result (step S509). For example, the robot apparatus 100 outputs the correct answer of the determined support information based on the query including the image and the question.

Further, of the processes described in the above embodiments, all or part of the processes described as being automatically performed may be manually performed, or the processes described as being manually performed. All or part of the above can be automatically performed by a known method. In addition, the processing procedures, specific names, information including various data and parameters shown in the above-mentioned documents and drawings can be arbitrarily changed unless otherwise specified. For example, the various kinds of information shown in each drawing are not limited to the illustrated information.

Also, each component of each device shown in the drawings is functionally conceptual, and does not necessarily have to be physically configured as shown. That is, the specific form of distribution / integration of each device is not limited to that shown in the drawings, and all or part of the device may be functionally or physically distributed / arranged in arbitrary units according to various loads or usage conditions. It can be integrated and configured.

Also, the above-described respective embodiments and modified examples can be appropriately combined within a range in which the processing content is not inconsistent.

Also, the effects described in this specification are merely examples and are not limited, and there may be other effects.

(4. Hardware configuration)
The information devices such as the

robot devices

100 and 100A and the information processing device 100B according to the above-described embodiments are realized by, for example, a computer 1000 configured as shown in FIG. FIG. 21 is a hardware configuration diagram showing an example of a computer 1000 that realizes the functions of an information processing apparatus such as the

robot apparatuses

100 and 100A and the information processing apparatus 100B. Hereinafter, the robot device 100 according to the first embodiment will be described as an example. The computer 1000 has a CPU 1100, a RAM 1200, a ROM (Read Only Memory) 1300, an HDD (Hard Disk Drive) 1400, a communication interface 1500, and an input / output interface 1600. Each unit of the computer 1000 is connected by a bus 1050.

The CPU 1100 operates based on a program stored in the ROM 1300 or the HDD 1400, and controls each part. For example, the CPU 1100 expands a program stored in the ROM 1300 or the HDD 1400 into the RAM 1200 and executes processing corresponding to various programs.

The ROM 1300 stores a boot program such as a BIOS (Basic Input Output System) executed by the CPU 1100 when the computer 1000 starts up, a program dependent on the hardware of the computer 1000, and the like.

The HDD 1400 is a computer-readable recording medium that non-temporarily records a program executed by the CPU 1100, data used by the program, and the like. Specifically, the HDD 1400 is a recording medium that records the information processing program according to the present disclosure, which is an example of the program data 1450.

The communication interface 1500 is an interface for connecting the computer 1000 to an external network 1550 (for example, the Internet). For example, the CPU 1100 receives data from another device or transmits the data generated by the CPU 1100 to another device via the communication interface 1500.

The input / output interface 1600 is an interface for connecting the input / output device 1650 and the computer 1000. For example, the CPU 1100 receives data from an input device such as a keyboard or a mouse via the input / output interface 1600. The CPU 1100 also transmits data to an output device such as a display, a speaker, a printer, etc. via the input / output interface 1600. Also, the input / output interface 1600 may function as a media interface for reading a program or the like recorded in a predetermined recording medium (medium). Examples of media include optical recording media such as DVD (Digital Versatile Disc) and PD (Phase change rewritable Disk), magneto-optical recording media such as MO (Magneto-Optical disk), tape media, magnetic recording media, and semiconductor memory. Is. For example, when the computer 1000 functions as the information processing apparatus 100 according to the embodiment, the CPU 1100 of the computer 1000 realizes the functions of the control unit 15 and the like by executing the information processing program loaded on the RAM 1200. Further, the HDD 1400 stores the information processing program according to the present disclosure and the data in the storage unit 14. Note that the CPU 1100 reads the program data 1450 from the HDD 1400 and executes the program data, but as another example, these programs may be acquired from another device via the external network 1550.

Note that the present technology may also be configured as below.
(1)
An image, a question related to the image, and an acquisition unit that acquires a correct answer corresponding to the question;
The combination of the image, the question, and the correct answer acquired by the acquisition unit is registered as support information used to determine a response to query information including one image and one question related to the one image. Registration department,
An information processing apparatus including.
(2)
The acquisition unit is
Obtaining the image, the question about the concept corresponding to the image, and the correct answer about the concept,
The registration unit is
The information processing apparatus according to (1), wherein the combination of the image, the question regarding the concept, and the correct answer regarding the concept is registered as the support information.
(3)
The acquisition unit is
Acquiring the image, the question about the object contained in the image, and the correct answer about the object,
The registration unit is
The information processing apparatus according to (2), wherein the combination of the image, the question regarding the object, and the correct answer regarding the object is registered as the support information.
(4)
The acquisition unit is
Acquiring the image, the question regarding the property or state of an object included in the image, and the correct answer regarding the property or state,
The registration unit is
The information processing apparatus according to (3), wherein the combination of the image, the question regarding the property or the state, and the correct answer regarding the property or the state is registered as the support information.
(5)
The acquisition unit is
Obtaining the image, the question regarding the user's impression of the image, and the correct answer regarding the impression,
The registration unit is
The information processing apparatus according to (4), wherein the combination of the image, the question regarding the impression, and the correct answer regarding the impression is registered as the support information.
(6)
The acquisition unit is
Acquiring the image, the amount of the object contained in the image, the color, the question about the temperature or hardness, and the amount, the color, the temperature or the correct answer about the hardness,
The registration unit is
The image and the question regarding the amount, the color, the temperature, or the hardness, and the question and the combination of the correct answers regarding the amount, the color, the temperature, or the hardness are registered as the support information. The information processing apparatus according to (5) above.
(7)
An input unit that accepts user input,
Equipped with
The acquisition unit is
The information processing apparatus according to any one of (1) to (6), which acquires the question input by the user.
(8)
An output unit that outputs a correct answer candidate corresponding to the question,
Equipped with
The input unit is
Accepting a response to the correct candidate by the user,
The registration unit is
The information processing device according to (7), wherein the combination including the correct answer determined according to the reaction to the correct answer candidate is registered as the support information.
(9)
The registration unit is
The information processing device according to (8), wherein when the reaction to the correct answer candidate is affirmative, the combination including the correct answer candidate as the correct answer is registered as the support information.
(10)
The input unit is
When the reaction to the correct answer candidate is negative, a correct answer candidate different from the correct answer candidate by the user is accepted,
The registration unit is
The information processing apparatus according to (8), wherein the combination including the another correct answer candidate input by the user as the correct answer is registered as the support information.
(11)
An output unit that outputs the question,
Equipped with
The acquisition unit is
The information processing apparatus according to any one of (1) to (6), wherein the question output by the output unit is acquired.
(12)
An input unit that receives an input of the correct answer corresponding to the question by the user,
Equipped with
The registration unit is
The information processing apparatus according to (11), wherein the combination including the correct answer input by the user is registered as the support information.
(13)
A detection unit that detects the presence or absence of a user,
Equipped with
The output unit is
The information processing device according to (11) or (12), which outputs the question when the user is detected by the detection unit.
(14)
An image capturing unit that captures the image,
Equipped with
The acquisition unit is
The information processing apparatus according to any one of (1) to (13), which acquires the image detected by the imaging unit.
(15)
A determination unit that determines one correct answer corresponding to the one question of the query information, based on the query information and the support information,
The information processing apparatus according to any one of (1) to (14) above, including:
(16)
The determination unit is
The information processing device according to (15), wherein the one correct answer is determined based on the one image and the one question included in the query information, and the image and the question included in the support information. ..
(17)
The determination unit is
The information according to (16), wherein the one correct answer is determined based on a comparison between the one image and the one question included in the query information and the image and the question included in the support information. Processing equipment.
(18)
The acquisition unit is
Obtaining the query information,
The determination unit is
Based on the query information acquired by the acquisition unit and a plurality of support information, one of the plurality of support information to be used for the one correct answer is determined (15) to (17) The information processing device according to any one of 1.
(19)
Obtaining an image, a question associated with the image, and a correct answer corresponding to the question,
Registering the acquired combination of the image, the question, and the correct answer as support information used for determining the response to the query information including the one image and the one question related to the one image. Information for executing a process. Processing method.
(20)
Obtaining an image, a question associated with the image, and a correct answer corresponding to the question,
Registering the acquired combination of the image, the question, and the correct answer as support information used for determining a response to query information including one image and one question related to the one image Information for executing processing Processing program.

100, 100A robot device 100B information processing device 11, 11B communication unit 12 input unit (microphone)
13 Output section (speaker)
14, 14A, 14B storage unit 141 support information storage unit 142 model

information storage unit

143, 143A mode information storage unit 144 question information storage unit 15, 15A control unit 151 acquisition unit 152

determination unit

153, 153A generation unit 154 registration unit 155 learning Part 156 Deciding part 16, 16A Sensor part (imaging part, detecting part, camera)
17 Drive unit (actuator)

Claims

An image, a question related to the image, and an acquisition unit that acquires a correct answer corresponding to the question;
The combination of the image, the question, and the correct answer acquired by the acquisition unit is registered as support information used for determining a response to query information including one image and one question related to the one image. Registration department,
An information processing apparatus including.
The acquisition unit is
Acquiring the image, the question regarding the concept corresponding to the image, and the correct answer regarding the concept,
The registration unit is
The information processing apparatus according to claim 1, wherein the combination of the image, the question regarding the concept, and the correct answer regarding the concept is registered as the support information.
The acquisition unit is
Acquiring the image, the question about the object contained in the image, and the correct answer about the object,
The registration unit is
The information processing apparatus according to claim 2, wherein the combination of the image, the question regarding the object, and the correct answer regarding the object is registered as the support information.
The acquisition unit is
Acquiring the image, the question regarding the property or state of an object included in the image, and the correct answer regarding the property or state,
The registration unit is
The information processing device according to claim 3, wherein the combination of the image, the question regarding the property or the state, and the correct answer regarding the property or the state is registered as the support information.
The acquisition unit is
Obtaining the image, the question about the user's impression of the image, and the correct answer about the impression,
The registration unit is
The information processing apparatus according to claim 4, wherein the combination of the image, the question regarding the impression, and the correct answer regarding the impression is registered as the support information.
The acquisition unit is
Acquiring the image, the amount of the object contained in the image, the color, the question about the temperature or hardness, and the amount, the color, the temperature or the correct answer about the hardness,
The registration unit is
The image and the question regarding the amount, the color, the temperature, or the hardness, and the question, and the combination of the correct answers regarding the amount, the color, the temperature, or the hardness are registered as the support information. The information processing apparatus according to claim 5.
An input unit that accepts user input,
Equipped with
The acquisition unit is
The information processing apparatus according to claim 1, wherein the question input by the user is acquired.
An output unit that outputs a correct answer candidate corresponding to the question,
Equipped with
The input unit is
Accepting a response to the correct candidate by the user,
The registration unit is
The information processing apparatus according to claim 7, wherein the combination including the correct answer determined according to the reaction to the correct answer candidate is registered as the support information.
The registration unit is
The information processing apparatus according to claim 8, wherein when the reaction to the correct answer candidate is affirmative, the combination including the correct answer candidate as the correct answer is registered as the support information.
The input unit is
When the reaction to the correct answer candidate is negative, a correct answer candidate different from the correct answer candidate by the user is accepted,
The registration unit is
The information processing apparatus according to claim 8, wherein the combination including the another correct answer candidate input by the user as the correct answer is registered as the support information.
An output unit that outputs the question,
Equipped with
The acquisition unit is
The information processing apparatus according to claim 1, wherein the question output by the output unit is acquired.
An input unit that receives an input of the correct answer corresponding to the question by the user,
Equipped with
The registration unit is
The information processing apparatus according to claim 11, wherein the combination including the correct answer input by the user is registered as the support information.
A detection unit that detects the presence or absence of a user,
Equipped with
The output unit is
The information processing apparatus according to claim 11, wherein when the user is detected by the detection unit, the question is output.
An image capturing unit that captures the image,
Equipped with
The acquisition unit is
The information processing apparatus according to claim 1, wherein the image detected by the imaging unit is acquired.
A determination unit that determines one correct answer corresponding to the one question of the query information, based on the query information and the support information,
The information processing apparatus according to claim 1, further comprising:
The determination unit is
The information processing apparatus according to claim 15, wherein the one correct answer is determined based on the one image and the one question included in the query information, and the image and the question included in the support information.
The determination unit is
The information processing according to claim 16, wherein the one correct answer is determined based on a comparison between the one image and the one question included in the query information and the image and the question included in the support information. apparatus.
The acquisition unit is
Obtaining the query information,
The determination unit is
The information processing according to claim 15, wherein among the plurality of support information, one support information to be used for the one correct answer is determined based on the query information acquired by the acquisition unit and a plurality of support information. apparatus.
Obtaining an image, a question associated with the image, and a correct answer corresponding to the question,
Registering the acquired combination of the image, the question, and the correct answer as support information used for determining a response to query information including one image and one question related to the one image Information for executing processing Processing method.
Obtaining an image, a question associated with the image, and a correct answer corresponding to the question,
Registering the acquired combination of the image, the question, and the correct answer as support information used for determining a response to query information including one image and one question related to the one image Information for executing a process Processing program.