WO2019039352A1

WO2019039352A1 - Information processing device, control method, and program

Info

Publication number: WO2019039352A1
Application number: PCT/JP2018/030272
Authority: WO
Inventors: 伸明川瀬
Original assignee: 日本電気株式会社
Priority date: 2017-08-25
Filing date: 2018-08-14
Publication date: 2019-02-28

Abstract

An information processing device (2000) determines whether or not the utterance of a person (10) included in speech data is directed from the person (10) to an object (20). To perform the determination, the information processing device (2000) estimates the line of sight of the person (10). The line of sight of the person (10) is estimated by analyzing a captured image including the person (10). The determination is performed using the estimated line of sight of the person (10).

Description

INFORMATION PROCESSING APPARATUS, CONTROL METHOD, AND PROGRAM

The present invention relates to a technology for processing human speech.

There is a technology for processing speech representing human speech by a computer. For example, Patent Document 1 discloses a robot that responds to a user's voice input sentence.

JP 2014-240864 A

Even when an object such as a robot is installed near the person, the person may speak to other than the object. For example, they may talk with other people or say single words. Thus, if an utterance not directed to a robot or the like is processed in the same manner as an utterance directed to a robot or the like, the robot or the like performs an operation which is not expected, and the convenience thereof is lowered. .

In this regard, Patent Document 1 discloses a technology that treats a voice obtained when a camera detects a face in front of the user as having a high possibility of representing a voice instruction by the user. However, just because the front face of the user faces the robot or the like, the user does not necessarily give a voice instruction to the robot or the like.

The present invention has been made in view of the above problems. One of the objects of the present invention is to provide a technology for processing human speech accurately.

An information processing apparatus according to the present invention includes: 1) image analysis means for estimating the line of sight of a person included in a captured image; and 2) voice data representing an utterance directed from a person to an object using the estimated line of sight. And voice determination means for determining whether or not the device is one.

The control method of the present invention is executed by a computer. The control method includes: 1) an image analysis step of estimating the line of sight of a person included in a captured image; and 2) voice data representing an utterance directed from a person to an object using the estimated line of sight. And an audio determination step of determining whether or not to be.

The program of the present invention causes a computer to execute each step of the control method of the present invention.

According to the present invention, a technique for processing human speech with high accuracy is provided.

The objects described above, and other objects, features and advantages will become more apparent from the preferred embodiments described below and the following drawings associated therewith.

FIG. 5 is a diagram for describing an overview of the operation of the information processing device (the information processing device illustrated in FIG. 2) of the first embodiment. FIG. 1 is a diagram illustrating the configuration of an information processing apparatus of a first embodiment. It is a figure which illustrates the computer for realizing an information processor. 5 is a flowchart illustrating the flow of processing executed by the information processing apparatus of the first embodiment; It is a figure which illustrates the relationship between the length of the period when the speech represented by speech data was performed, and the length of the period when a person's gaze is facing to a subject. It is a figure which illustrates the information memorized by storage part in the form of a table. FIG. 8 is a block diagram illustrating a functional configuration of the information processing apparatus of the second embodiment. It is a figure which illustrates feature data in a table form. It is a figure which illustrates utterance data memorized, after specifying an utterance other party other than a subject. FIG. 14 is a block diagram illustrating a functional configuration of the information processing apparatus of the third embodiment.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same components are denoted by the same reference numerals, and the description thereof will be appropriately omitted. Further, in each block diagram, each block represents a configuration of a function unit, not a configuration of a hardware unit, unless otherwise described.

Embodiment 1
<Overview>
FIG. 1 is a diagram for explaining an outline of the operation of the information processing apparatus (the information processing apparatus 2000 illustrated in FIG. 2) of the first embodiment. The operation of the information processing apparatus 2000 described below is an example for facilitating the understanding of the information processing apparatus 2000, and the operation of the information processing apparatus 2000 is not limited to the following example. Details and variations of the operation of the information processing apparatus 2000 will be described later.

The information processing device 2000 determines whether the speech of the person 10 included in the voice data is directed from the person 10 to the target 20. The object 20 is any object that the person 10 can target for speech. For example, the target 20 is an arbitrary computer such as a robot that operates by processing the speech of the person 10. The object 20 in the example of FIG. 1 is an interactive robot that responds to the speech of the person 10.

In order to make the above determination, the information processing apparatus 2000 estimates the line of sight of the person 10. The estimation of the line of sight of the person 10 is performed by image analysis of a captured image in which the person 10 is included. The above determination is performed using the estimated line of sight of the person 10. Audio data determined to represent an utterance directed from the person 10 to the target 20 is stored in the storage unit.

A specific example will be described using FIG. As mentioned above, in FIG. 1, the object 20 is an interactive robot. In both the example of the left row and the example of the right row in FIG. 1, the front direction of the face of the person 10-1 who is the speaker is directed to the object 20. However, in these examples, the direction of the line of sight of the person 10-1 is different.

In the example in the left column, the person 10-1 turns his eyes on the robot and utters, "Is the weather tomorrow fine?" Since the gaze is directed to the robot, it can be said that this utterance requires the robot to examine the weather the next day. The information processing apparatus 2000 uses the line of sight of the person 10-1 to determine whether the utterance included in the voice data is directed from the person 10 to the robot. As a result, it is determined that the speech included in the voice data is directed from the person 10 to the robot. Then, in response to the above-mentioned utterance, the robot checks the weather of the next day using the Internet etc. and outputs a response. For example, a response such as "Tomorrow is a forecast for rain" is output.

On the other hand, in the example in the right column, the person 10-1 utters, "Are the weather of tomorrow tomorrow fine?" Without looking at the robot. As described above, also in this example, the face of the person 10-1 is directed to the front. Therefore, if it is determined whether the speech of the person 10-1 is directed to the robot based on the direction of the face of the person 10-1, the speech of the person 10-1 is directed to the robot also in this example. It is determined that the However, in this example, the person 10-1 is talking while looking at the person 10-2, and it can be said that the person 10-2 is clearly speaking to the person 10-2. Therefore, when the robot responds to the speech of the person 10-1, it becomes a form of interrupting the conversation between the person 10-1 and the person 10-2, and it can be said that the convenience of the robot is lowered.

In this regard, as described above, the information processing apparatus 2000 determines, using the line of sight of the person 10-1, whether or not the utterance included in the voice data is directed from the person 10 to the robot. Therefore, in the example in the right column of FIG. 1, it is determined that the utterance included in the voice data is not directed to the robot from the person 10. Therefore, the robot has not responded to this utterance.

As described above, the information processing apparatus 2000 of the present embodiment is invented focusing on the fact that, in the case where a person speaks toward an object, the line of sight of the person is often directed to the object. It is a thing. Specifically, the information processing apparatus 2000 according to the present embodiment estimates the line of sight of the person 10, and uses the estimated line of sight to determine whether the speech of the person 10 is directed to the object 20. In this way, it can be accurately determined whether the utterance is directed from the person 10 to the object 20 or not. For example, in the example in the right column of FIG. 1 described above, it is possible to prevent the object 20 from erroneously responding to the utterance of the person 10-1 to the person 10-2.

Hereinafter, the information processing apparatus 2000 according to the present embodiment will be described in more detail.

<Example of Functional Configuration of Information Processing Apparatus 2000>
FIG. 2 is a diagram illustrating the configuration of the information processing apparatus 2000 according to the first embodiment. In FIG. 2, the information processing apparatus 2000 includes an image analysis unit 2020 and a voice determination unit 2040. The image analysis unit 2020 acquires a captured image, and estimates the line of sight of the person 10 included in the acquired captured image. The voice determination unit 2040 obtains voice data (hereinafter, voice data) including the voice of the person 10. Further, using the line of sight estimated by the image analysis unit 2020, the voice determination unit 2040 determines whether the utterance data represents an utterance directed from the person 10 to the target 20. If it is determined that the speech data represents a speech directed from the person 10 to the target 20, the speech discrimination unit 2040 stores the speech data in the storage unit 50. The storage unit 50 is an arbitrary storage device capable of storing speech data. The storage unit 50 may be provided inside the information processing apparatus 2000 or may be provided outside.

<Hardware Configuration of Information Processing Apparatus 2000>
Each functional component of the information processing apparatus 2000 may be realized by hardware (for example, a hard-wired electronic circuit or the like) that realizes each functional component, or a combination of hardware and software (for example: It may be realized by a combination of an electronic circuit and a program for controlling it. Hereinafter, the case where each functional configuration unit of the information processing apparatus 2000 is realized by a combination of hardware and software will be further described.

FIG. 3 is a diagram illustrating a computer 1000 for realizing the information processing apparatus 2000. The computer 1000 is an arbitrary computer. For example, the computer 1000 is a chip such as a System on Chip (SoC), a personal computer (PC), a server machine, a tablet terminal, a smartphone, or the like. The computer 1000 may be a dedicated computer designed to realize the information processing apparatus 2000, or may be a general-purpose computer.

The computer 1000 may be installed inside the object 20 or may be installed outside the object 20. For example, assume that the object 20 is a robot. In this case, the computer 1000 installed inside the object 20 is, for example, a control chip incorporated in a robot. On the other hand, the computer 1000 installed outside the object 20 is a server device that controls the robot from the outside via, for example, a network.

The computer 1000 includes a bus 1020, a processor 1040, a memory 1060, a storage device 1080, an input / output interface 1100, and a network interface 1120. The bus 1020 is a data transmission path for the processor 1040, the memory 1060, the storage device 1080, the input / output interface 1100, and the network interface 1120 to mutually transmit and receive data. However, the method of connecting the processors 1040 and the like to each other is not limited to the bus connection. The processor 1040 is various processors such as a central processing unit (CPU) and a graphics processing unit (GPU). The memory 1060 is a main storage device implemented using a random access memory (RAM) or the like. The storage device 1080 is an auxiliary storage device implemented using a hard disk, a solid state drive (SSD), a memory card, or a read only memory (ROM).

The input / output interface 1100 is an interface for connecting the computer 1000 and an input / output device. For example, the camera 30 and the microphone 40 are connected to the input / output interface 1100. The description of the camera 30 and the microphone 40 will be described later.

The network interface 1120 is an interface for connecting the computer 1000 to a communication network. This communication network is, for example, a LAN (Local Area Network) or a WAN (Wide Area Network). The method of connecting the network interface 1120 to the communication network may be wireless connection or wired connection. For example, when the information processing apparatus 2000 is installed outside the object 20, the information processing apparatus 2000 communicates with another computer incorporated in the object 20 via the network.

The storage device 1080 stores program modules for realizing the respective functional components of the information processing apparatus 2000. The processor 1040 implements the functions corresponding to each program module by reading the program modules into the memory 1060 and executing them. Further, when the storage unit 50 is provided inside the information processing apparatus 2000, for example, the storage unit 50 is realized using the storage device 1080.

The computer 1000 may be realized using a plurality of computers. For example, the image analysis unit 2020 and the voice determination unit 2040 can be realized by different computers. In this case, the program modules stored in the storage device of each computer may be only the program modules corresponding to the functional components realized by the computer.

<< About the camera 30 >>
The camera 30 is an arbitrary camera that captures an image of the person 10 and generates moving image data. The captured image is a moving image frame that constitutes this moving image data. The camera 30 may be installed on the object 20 or may be installed at a location other than the object 20. For example, assume that the object 20 is a robot. In this case, the camera 30 installed on the object 20 is, for example, a camera (treated as an eye of the robot) used to visually recognize the surrounding situation.

<< About Mike 40 >>
The microphone 40 is any microphone that converts the sound around the object 20 into an electrical signal. For example, the microphone 40 is installed on the object 20. The speech data is a portion of speech data generated from the electrical signal generated by the microphone 40, and a portion presumed to represent a person's speech. In addition, the existing technology can be used for the technology of converting the electric signal generated by the microphone 40 into voice data, and the technology of cutting out a portion representing human speech from the voice data. For example, voice data can be divided into utterance units by removing data of a period (a period of silence) in which the sound pressure is equal to or less than a predetermined value continuously for a predetermined time or more.

The microphone 40 may be installed on the object 20 or may be installed in a place other than the object 20. For example, assume that the object 20 is a robot. In this case, the microphone 40 installed on the object 20 is, for example, a microphone (handled as a robot's ear) used to aurally recognize the surrounding situation by the robot.

<Flow of processing>
FIG. 4 is a flowchart illustrating the flow of processing executed by the information processing apparatus 2000 of the first embodiment. The image analysis unit 2020 acquires a captured image (S102). The image analysis unit 2020 analyzes the captured image to estimate the line of sight of the person 10 (S104). The speech discrimination unit 2040 acquires speech data (S106). The speech discrimination unit 2040 uses the estimated line of sight of the person 10 to determine whether the speech data represents a speech directed from the person 10 to the target 20 (S108). If the speech data represents a speech directed from the person 10 to the target 20 (S108: YES), the voice discrimination unit 2040 stores the speech data in the storage unit 50.

When the speech data does not represent the speech directed from the person 10 to the target 20 (S108: NO), the process of FIG. 4 ends. However, the voice determination unit 2040 may also store in the storage unit 50 utterance data that does not represent an utterance directed from the person 10 to the target 20. In this case, the utterance data representing the utterance directed to the object 20 from the person 10 and the utterance data not representing the utterance directed to the object 20 from the person 10 are stored in the storage unit 50 in a mutually distinguishable state Be done. The specific method will be described later.

<Acquisition of captured image: S102>
The image analysis unit 2020 acquires a captured image (S102). The method by which the image analysis unit 2020 acquires a captured image is arbitrary. For example, the image analysis unit 2020 receives a captured image transmitted from the camera 30. Further, for example, the image analysis unit 2020 accesses the camera 30 and acquires a captured image stored in the camera 30.

The camera 30 may store the captured image in a storage device provided outside the camera 30. In this case, the image analysis unit 2020 accesses this storage device to acquire a captured image.

The timing at which the image analysis unit 2020 acquires a captured image is arbitrary. For example, each time a captured image is generated by the camera 30, the image analysis unit 2020 acquires the newly generated captured image. In addition, for example, the image analysis unit 2020 may periodically acquire an unacquired captured image. For example, when the image analysis unit 2020 acquires a captured image once per second, the image analysis unit 2020 may generate a plurality of captured images generated by the camera 30 in one second (for example, frame rates of moving image data generated by the camera 30). If it is 30 fps (frames / second), 30 captured images will be acquired collectively.

The image analysis unit 2020 may acquire all captured images generated by the camera 30, or may acquire only a part of the captured images. In the latter case, for example, the image analysis unit 2020 acquires a captured image generated by the camera 30 at a ratio of one to a predetermined number.

<Gaze estimation: S104>
The image analysis unit 2020 estimates the line of sight of the person 10 included in the captured image (S104). Various existing technologies can be used as a technology for estimating the line of sight of a person included in a captured image. For example, the image analysis unit 2020 estimates the line of sight of the person 10 based on the direction of the face of the person 10, the position of the eyeball, and the like included in the captured image. The orientation of the face can be estimated based on the position in the face area of the characteristic part such as the eyes, nose, and mouth, the relative positional relationship between these, and the like.

When a plurality of people are included in the captured image, the image analysis unit 2020 estimates the line of sight for each of the plurality of people.

<Acquisition of utterance data: S106>
The speech discrimination unit 2040 acquires speech data (S106). The speech data is generated by dividing speech data obtained using the microphone 40 into speech units. Here, the process of dividing voice data obtained using the microphone 40 into speech units may be performed by the voice determination unit 2040 or may be performed by a device other than the voice determination unit 2040.

For example, it is assumed that the object 20 is a robot, and the information processing apparatus 2000 is a server apparatus that remotely controls the robot. In this case, for example, the microphone 40 is attached to the robot. For example, the robot transmits the entire audio data obtained using the microphone 40 to the information processing apparatus 2000. In this case, the voice determination unit 2040 receives voice data transmitted by the robot and divides the voice data into speech units to obtain one or more pieces of speech data. In addition, for example, the robot generates one or more pieces of speech data by dividing speech data obtained using the microphone 40 in units of speech, and transmits each piece of speech data to the information processing apparatus 2000. In this case, the voice determination unit 2040 acquires one or more utterance data by receiving the one or more utterance data transmitted from the robot.

The method of acquiring speech data is not limited to the method of transmitting voice data from the object 20 to the information processing apparatus 2000. For example, the voice discrimination unit 2040 acquires voice data stored inside the object 20 by accessing the object 20, or is stored in a storage unit communicably connected to the object 20. Audio data may be acquired. In the latter case, the target 20 stores the voice data obtained from the microphone 40 or the speech data cut out from the voice data in the storage unit.

The timing at which the speech discrimination unit 2040 acquires speech data is arbitrary. For example, every time the target 20 newly generates speech data, the newly generated speech data is transmitted to the information processing apparatus 2000. In this case, the speech discrimination unit 2040 acquires the speech data at the timing when the speech data is newly generated. In addition, for example, the information processing apparatus 2000 may periodically access the object 20 or a storage unit communicably connected to the object 20 to acquire unacquired audio data.

Here, it is preferable that the information processing apparatus 2000 can specify the speaker of the speech represented by the speech data. This particular method is varied. For example, the speaker can be identified based on the movement of the person's mouth included in the captured image during the period in which the utterance is performed. For example, the information processing apparatus 2000 performs image analysis on each captured image generated in a period in which the utterance represented by the utterance data is performed, thereby specifying a person moving the mouth in the period and uttering the person Identify as a person. When there are a plurality of people moving their mouth during this period, for example, the information processing apparatus 2000 identifies a person who has the longest moving time of the mouth during that period as a speaker. In addition, other existing techniques can also be used for the technique of specifying the speaker of the utterance contained in audio | speech data.

In addition, for example, the characteristic of each candidate may be determined in advance for a plurality of candidates of the speaker, and this information may be used to specify the speaker. For example, in the case where the target 20 is a robot used at home, the identification information (for example, identification of a word having high probability of being included in each person's utterance) of the person living in the house or the person who often visits the house The information correlated with the number is determined in advance (stored in the storage unit).

For example, it is assumed that three people, a father A, a mother B, and a daughter C, live in a certain family. In this case, for the identification information of the father A, words (for example, "golf" or "stock price") frequently issued by the father A (for example, words for the mother B) are frequently issued For example, "shopping", "cleaning", etc.) are associated with the identification information of the daughter C, for example, words (for example, "college", "byte", etc.) frequently issued by the daughter C.

When a plurality of persons move their mouth during a period in which an utterance is performed, the information processing apparatus 2000 determines whether a word included in the utterance is associated with identification information of each person. For example, in the above-mentioned household example, it is assumed that fathers A and B move their mouths as a result of image analysis of a captured image generated during a period in which speech is performed. In this case, the information processing device 2000 determines whether the word included in the utterance is associated with the identification information of the father A or the mother B. For example, if the utterance includes the word “stock price”, the utterance includes the word associated with the identification information of the father A. Therefore, the information processing apparatus 2000 specifies that the speaker is the father A. As described above, by using the words previously associated with the person for specifying the speaker, the speaker can be specified with higher accuracy.

Note that the information processing apparatus 2000 does not necessarily have to uniquely identify the speaker. For example, the information processing apparatus 2000 calculates the likelihood that the person is a speaker for each person who is a candidate for the speaker (for example, for each member of a family). For example, if the likelihood calculated for the father is the largest, it means that "the speaker is probably the father but may be another person". For example, the information processing apparatus 2000 calculates the likelihood that each person is a speaker based on the degree of coincidence between the words included in the above-described utterance and the words associated with the identification information of each person.

<Determination by voice determination unit 2040: S108>
The voice determination unit 2040 determines whether the utterance data represents an utterance directed from the person 10 to the target 20 (S108). The determination as to whether or not a certain utterance is directed from the person 10 to the target 20 is made based on the line of sight of the person 10 in the period in which the utterance is performed. In addition, when two or more persons 10 are contained in a captured image, the said determination is performed based on the eyes of the speaker of the speech represented by speech data.

For example, when the time when the line of sight of the person 10 is directed to the object 20 is included in the period in which the speech represented by the speech data is performed, the speech discrimination unit 2040 is represented by the speech data. It is determined that the speech is directed from the person 10 to the target 20 (S108: YES). On the other hand, when the time point at which the line of sight of the person 10 is directed to the object 20 is not included in the period in which the utterance represented by the utterance data is performed (during that period, the line of sight of the person If the user does not turn to 20), it is determined that the utterance represented by the utterance data is not directed from the person 10 to the object 20 (S108: NO).

Besides, for example, the speech discrimination unit 2040 takes into consideration the length of time during which the line of sight of the person 10 is directed to the object 20 during the period when the speech represented by the speech data is performed. It is determined whether the target 10 is directed to the target 20 or not. For example, the voice determination unit 2040 calculates the ratio of the length of the period in which the line of sight of the person 10 is directed to the object 20 with respect to the length of the period in which the utterance represented by the utterance data is performed. Then, the voice determination unit 2040 determines that the utterance represented by the utterance data is directed from the person 10 to the target 20 when the ratio is equal to or more than the predetermined size (S108: YES). . On the other hand, when the ratio is less than the predetermined size, it is determined that the utterance represented by the utterance data is not directed from the person 10 to the object 20 (S108: NO).

FIG. 5 is a diagram illustrating the relationship between the length of the period during which the utterance represented by the utterance data is performed and the length of the period during which the line of sight of the person 10 is directed to the object 20. In FIG. 5, the period during which the utterance represented by the utterance data is performed is from time t1 to time t6, and the length of the period is p1. Also, during this period, the line of sight of the person 10 is directed to the object 20 during the period from time t2 to time t3 and the period from time t4 to time t5, and the length of these periods is P2 and p3 respectively. Accordingly, the ratio r of the length of the period in which the line of sight of the person 10 is directed to the object 20 to the length of the period in which the utterance represented by the utterance data is performed is (p2 + p3) / p1. The voice determination unit 2040 determines whether the ratio r is equal to or greater than a predetermined size.

<< Method to determine whether the line of sight of the person 10 is directed to the object 20 >>
There are various methods of determining whether the line of sight of the person 10 is directed to the object 20. For example, the voice discrimination unit 2040 uses the line of sight direction of the person 10 estimated by the image analysis unit 2020 and the start point (for example, the center of the black eye) of the line of sight of the person 10 in the captured image. It is determined whether or not it intersects with 20. Then, when the line of sight of the person 10 intersects with the object 20, the voice determination unit 2040 determines that the line of sight of the person 10 is directed to the object 20. On the other hand, when the line of sight of the person 10 does not cross the object 20, the voice determination unit 2040 determines that the line of sight of the person 10 is not directed to the object 20.

Here, the voice discrimination unit 2040 may determine whether or not the line of sight of the person 10 intersects with the object 20 only with respect to the line of sight of either of the eyes of the person 10, , It may be determined whether or not it intersects with the object 20. In the latter case, the voice determination unit 2040 may determine that the line of sight of the person 10 is directed to the object 20 only when both lines of sight of the two eyes cross the object 20, or at least one of the lines of sight of both eyes May intersect with the object 20, it may be determined that the line of sight of the person 10 is directed to the object 20.

In addition, the image analysis unit 2020 may specify one line of sight of the person 10 based on the lines of sight of the person 10. For example, the image analysis unit 2020 sets a midpoint between the center of the black eye of the left eye of the person 10 and the center of the black eye of the right eye of the person 10 as the start point of the line of sight of the person 10. Further, the image analysis unit 2020 sets a vector obtained by adding the vector representing the line-of-sight direction of the left eye of the person 10 and the vector representing the line-of-sight direction of the right eye of the person 10 as the line-of-sight direction of the person 10. Then, the voice determination unit 2040 determines whether or not the line of sight of the person 10 is directed to the object 20 by determining whether or not the one line of sight specified in this manner intersects the object 20.

Note that the voice determination unit 2040 does not “whether or not the line of sight of the person 10 intersects the object 20”, “whether or not the line of sight of the person 10 intersects a predetermined size range including the object 20” May be determined. This is because when a person looks at an object and talks, the line of sight does not necessarily intersect the object, and sometimes it looks around the object. The predetermined range is, for example, a range in which the size of the object 20 is enlarged at a predetermined rate (for example, 10%).

<Process of storing speech data: S110>
If the speech data represents a speech directed from the person 10 to the target 20 (S108: YES), the speech discrimination unit 2040 stores the speech data in the storage unit 50 (S110). Here, as described above, the storage unit 50 stores not only speech data representing a speech directed from the person 10 to the object 20, but also speech data representing speech not directed from the person 10 to the object 20. It may be done. In this case, the storage unit 50 stores the utterance data in association with the information indicating whether or not the data is directed from the person 10 to the target 20.

FIG. 6 is a diagram illustrating information stored in the storage unit 50 in the form of a table. The table of FIG. 6 is referred to as a table 200. The table 200 has three columns: speech data 202, an identification flag 204, and a speaker 206. The speech data 202 indicates speech data. The identification flag 204 indicates whether the associated speech data 202 is directed from the person 10 to the object 20 or not. In the table 200, a record in which the identification flag 204 indicates “Y” indicates that the utterance data 202 represents an utterance directed from the person 10 to the object 20. On the other hand, a record in which the identification flag 204 indicates “N” indicates that the utterance data 202 represents an utterance that is not directed from the person 10 to the object 20.

The speaker 206 indicates identification information for identifying the person 10 who has made the utterance represented by the utterance data 202. For example, the identification information is a feature amount of the face of the speaker obtained from the captured image. The method of identifying the speaker is as described above. Besides, for example, the identification information may be an identifier (such as a unique number) assigned to the person 10. In this case, for example, the information processing apparatus 2000 repeatedly detects the face of a person on the captured image, and when a new person is detected from the captured image, associates a new identifier with the feature amount of the face of the person. It is stored in the storage unit. Then, the voice determination unit 2040 sets an identifier associated with the feature amount of the face of the person 10 who has made the utterance represented by the utterance data 202 in the utterer 206.

Second Embodiment
FIG. 7 is a block diagram illustrating the functional configuration of the information processing apparatus 2000 of the second embodiment. The information processing apparatus 2000 of the second embodiment has the same function as the information processing apparatus 2000 of the first embodiment except for the points described below.

The information processing apparatus 2000 of the second embodiment has a feature data generation unit 2060. The feature data generation unit 2060 generates feature data representing the feature of the person 10 using the speech data representing the utterance directed from the person 10 to the target 20. The generated feature data is stored in the storage unit. The storage unit may be the storage unit 50, or may be a storage unit other than the storage unit 50.

For example, in the utterance of a person, information related to the company or school to which the person belongs, information on the relationship between the person, information on the person in which the person is interested, information on the person Various information may be included such as information related to the schedule, information related to the character of the person, and the like. Therefore, for example, the feature data generation unit 2060 extracts keywords representing the various information described above from the speech data determined by the speech discrimination unit 2040 to represent the speech directed to the object 20 from the person 10 By doing this, feature data of the person 10 is generated. In addition, the existing technology can be used for the technology itself which extracts a keyword from the utterance.

For example, the feature data is a set of all the keywords extracted from the speech data. Besides, for example, the feature data is a set of keywords having high importance among keywords extracted from speech data. For example, the importance of a keyword is represented by the frequency with which the keyword is included in the speech data. That is, the higher the frequency of the keyword included in the utterance, the higher the degree of importance.

In addition, it is preferable that the extracted keyword be further associated with the attribute of the keyword. For example, in the case of a keyword related to a schedule, an attribute "schedule" is associated with the keyword. In addition, for example, in the case of a keyword related to interest (for example, the name of a product that is interested), an attribute “interest” is associated with the keyword. A plurality of attributes may be associated with one keyword. Here, the existing technology can be used as a technology for specifying an attribute related to a keyword from the utterance.

FIG. 8 is a diagram illustrating feature data in the form of a table. The table of FIG. 8 is called a table 300. The table 300 has three columns: keyword 302, attribute 304, and importance 306. Here, a table 300 representing feature data of a person is associated with identification information of the person. For example, FIG. 8 shows feature data of a person specified by “ID 0001”.

Here, the feature data generation unit 2060 further utilizes not only utterance data representing an utterance directed from the person 10 to the object 20 but also utterance data not representing an utterance directed from the person 10 to the object 20. The feature data of the person 10 may be generated. In this case, the feature data generation unit 2060 uses the keyword extracted from the utterance data representing the utterance directed to the object 20 from the person 10 and the utterance data not representing the utterance directed to the object 20 from the person 10 Feature data is generated in distinction from extracted keywords.

For example, the feature data generation unit 2060 extracts a keyword extracted from speech data representing a speech directed from the person 10 to the object 20 from speech data not representing a speech directed from the person 10 to the object 20 Prior to using keywords. For example, the keyword A extracted from the utterance data representing the utterance directed to the object 20 from the person 10 contradicts the keyword B extracted from the utterance data not representing the utterance directed to the object 20 from the person 10 In the case of the relationship, the feature data generation unit 2060 includes the keyword A in the feature data and does not include the keyword B in the feature data.

In addition, for example, the feature data generation unit 2060 is an utterance that does not represent the utterance directed from the person 10 to the object 20 in the keyword extracted from the utterance data representing the utterance directed from the person 10 to the object 20 The above-mentioned importance is calculated by giving a larger weight than the keyword extracted from the data. For example, there is a method of calculating the importance of a keyword as an integrated value of frequency and weight.

Here, as to the utterance other than the utterance directed from the person 10 to the object 20, the speech discrimination unit 2040 identifies the other party (hereinafter referred to as the utterance other party) to whom the utterance is directed, using the line of sight of the person 10 May be For example, the information processing apparatus 2000 exists around the target 20 by causing the camera 30 to capture the periphery of the target 20 while changing the imaging range of the camera 30 at an arbitrary timing (for example, periodically). Understand the position of people. Then, based on the positional relationship of the person present around the object 20 and the line of sight of the person 10, the voice discrimination unit 2040 identifies the person to whom the line of sight of the person 10 is directed when the person 10 speaks. And identify the person as a speaking partner. The speech discrimination unit 2040 associates the speech data representing the speech with the identification information of the speech partner, and stores the speech data in the storage unit 50.

FIG. 9 is a view exemplifying speech data stored after specifying a speech partner other than the target 20. As shown in FIG. The table 200 of FIG. 9 is different from the table 200 of FIG. 6 in that the identification information 208 is included instead of the identification flag 204. Identification information 208 indicates identification information of the other party. Here, in the example of FIG. 9, the utterance data associated with the identification information “target” represents the utterance directed to the target 20. The speech data associated with the other identification information represents the speech received by the person specified by the identification information. For example, in the second record of the table 200 of FIG. 9, the utterance represented by the voice data file 002. wav is directed from the person specified by the ID 002 to the person specified by the ID 001. It shows that there is.

Even when the speaking partner is a person, the characteristic data of the person 10 can be generated in more detail by specifying the speaking partner. Specifically, when a certain keyword is included in the feature data of the person 10, the other party who speaks the utterance related to the keyword is also included in the feature data as a related person related to the keyword.

For example, it is assumed that the feature data generation unit 2060 estimates from the speech of the person A that the person A is going to travel. In this case, the feature data generation unit 2060 includes information “keyword: travel, attribute: schedule” in the feature data of the person A. Furthermore, from the utterance of person A, the feature data generation unit 2060 determines whether person A travels alone or with other people (is there any other person related to travel)? The estimation is made and the estimation result is also included in the feature data of person A.

For example, when it is detected that the person A is making an utterance related to travel (for example, an utterance such as “Where are you going next?”) To the person B at a high frequency, the feature data generation unit 2060 It is determined that A is likely to go on a trip with person B. Therefore, the feature data generation unit 2060 includes the person B as a related person in the feature data on the travel of the person A.

Similarly, for example, it is assumed that the feature data generation unit 2060 estimates that the person A is interested in the product X from the speech of the person A. In this case, the feature data generation unit 2060 includes information “keyword: product X, attribute: interest” in the feature data of person A. Furthermore, the feature data generation unit 2060 estimates from the speech of the person A whether the person A uses the product X alone or jointly with another person (whether there is another person related to the product X) The estimation result is also included in the feature data of the person A. For example, when it is estimated that the probability that the person A uses the product X with the person C is high, the feature data generation unit 2060 includes the person C as a related person in the feature data related to the product X of the person A.

<Example of hardware configuration>
The hardware configuration of a computer that implements the information processing apparatus 2000 of the second embodiment is represented, for example, by FIG. 3 as in the first embodiment. However, in the storage device 1080 of the computer 1000 for realizing the information processing apparatus 2000 of the present embodiment, a program module for realizing the function of the information processing apparatus 2000 of the present embodiment is further stored.

<Operation and effect>
According to the information processing apparatus 2000 of the present embodiment, the feature data of the person 10 is generated from the speech of the person 10 and the line of sight of the person 10 when the utterance is performed. By doing this, it is possible to grasp in detail the features of the person 10, such as the schedule of the person 10 and the things that the person 10 is interested in. In particular, according to the method of including the person related to the feature in the feature data of the person 10, the feature of the person 10 can be grasped in more detail. In order to grasp the features of the person 10 in detail in this manner, for example, when the robot operates based on the utterance of the person 10 as described later, services etc. provided by the robot according to the features of the person 10 are There is an advantage of being able to personalize details. In other words, it is possible to provide services tailored to the characteristics of each person.

Third Embodiment
FIG. 10 is a block diagram illustrating the functional configuration of the information processing apparatus 2000 of the third embodiment. The information processing apparatus 2000 of the third embodiment has the same function as the information processing apparatus 2000 of the first or second embodiment except for the points described below.

The information processing apparatus 2000 of the third embodiment has a process determination unit 2080. The process determining unit 2080 determines the process to be performed based on the content of the utterance data directed from the person 10 to the target 20. For example, it is assumed that the object 20 is a device (a robot or the like) that operates in response to an utterance from the person 10 to the object 20. In this case, the information processing apparatus 2000 determines the operation of the object 20 based on the content of the utterance data directed from the person 10 to the object 20 and controls the operation of the object 20.

It is assumed that the content of the speech of the person 10 represents some kind of request. For example, it is assumed that the content of the utterance of the person 10 is a predetermined voice command for operating the object 20. In this case, the information processing device 2000 causes the object 20 to operate in accordance with the voice command included in the utterance data representing the utterance directed from the person 10 to the object 20.

Here, by controlling the operation of the object 20 by the information processing apparatus 2000, the object 20 does not operate in response to the voice command included in all the speech data, and is directed from the person 10 to the object 20. It operates only in response to the voice command included in the voiced speech. By doing this, it is possible to operate the object 20 only when the person 10 issues a voice command to the object 20. Thus, for example, in the case where the same words as an accidental voice command are accidentally included in the words spoken by the person 10 for another person, it is possible to prevent the object 20 from operating erroneously. .

Here, the utterance representing any request is not limited to a predetermined voice command. For example, it is assumed that the object 20 has a function of interpreting the content of human speech and performing an operation according to the content. Specifically, in response to a request “take a cup on the table”, an operation of taking a cup on the table and giving it to a speaker may be considered.

Here, the request issued from the person 10 may be for the object 20 or for another person around. In such a case, when the object 20 is operated in response to all the requests issued from the person 10, the object 20 erroneously responds to a request that is not issued to the object 20. It will be. In this respect, by controlling the operation of the object 20 using the information processing apparatus 2000, the object 20 responds to the request issued to the object 20, and the other than the object 20 (for example, the other It is possible that the object 20 does not respond to the request issued to a person. Thus, it is possible to prevent the target 20 from responding to a request for something other than the target 20 by mistake.

In addition, the command etc. for operating the target object 20 may be defined not only in the content of the speech but in combination with a person's operation | movement. That is, it may be detected that the person 10 utters a specific utterance toward the object 20 and that a specific action (for example, wink) is made, and the object 20 may perform an action according to the combination of these. . By doing this, it is possible to prevent the malfunction of the object 20 while using the object 20 with a simple operation. Note that the motion of a person can be detected by image analysis of a captured image generated by the camera 30.

The object 20 may have a function of responding according to the content of the utterance of the person 10. In this case, the information processing apparatus 2000 determines whether to make the subject 20 respond in response to the utterance of the person 10. Specifically, when it is determined that the utterance data is an utterance directed from the person 10 to the object 20, the process determining unit 2080 causes the object 20 to reply using the content of the utterance data. Decide that. On the other hand, when it is determined that the utterance data is not an utterance directed from the person 10 to the object 20, the process determining unit 2080 determines not to make the object 20 reply. By doing this, it is possible to prevent the target 20 from erroneously replying to an utterance that the person 10 has directed to another person instead of the target 20.

Here, it is preferable that the motion of the object 20 according to the utterance of the person 10 is determined using the feature data described in the second embodiment in addition to the content of the utterance. For example, when information is searched according to the utterance of the person 10 and the search result is presented to the person 10, the information processing apparatus 2000 narrows down the search result using the feature data of the person 10 and presents the result. Is preferred. The information processing apparatus 2000 may further use schedule data of the person 10 as well as the feature data of the person 10. Here, various existing technologies are used for the technology of determining the content of the operation (content of the response, etc.) according to the utterance of the person 10 using data representing the characteristic of the person, schedule data, etc. be able to. When using feature data, the information processing apparatus 2000 of the third embodiment includes the feature data generation unit 2060 described in the second embodiment.

For example, it is assumed that the utterance of the person A is an utterance asking the object 20 to search for a candidate of a hotel to stay at the time of travel (for example, "search for a hotel to be stayed at a trip next month"). In this case, for example, the information processing apparatus 2000 specifies a travel schedule, a destination, and the like by referring to feature data and schedule data of the person A, and searches for available hotels based on the specified schedule and destination. Furthermore, the information processing apparatus 2000 refers to the one that the person A is interested in, which is shown in the feature data of the person A, and preferentially presents the hotel having a high degree of association with the one that is interested as a search result. Do. For example, when “hot spring” is included in the thing that person A is interested in, the information processing apparatus 2000 preferentially presents a hotel having a hot spring facility or a hotel having a hot spring facility nearby.

Here, as described above, it is assumed that the feature data indicates a related person (e.g., a person who travels together) associated with the keyword. In this case, the motion of the object 20 is preferably determined in consideration of the relevant person as well. For example, in the above-described example of searching for a hotel, the information processing apparatus 2000 grasps that the person A goes on a trip with the person B by referring to the feature data of the person A. Then, the information processing apparatus 2000 searches for a hotel in which a room in which two people can stay is vacant. Further, the information processing apparatus 2000 refers to the one in which the person B is interested, which is indicated in the feature data of the person B, and searches for a hotel in consideration of the person B's interest. For example, when "seafood" is included in the thing that person B is interested in, the information processing apparatus 2000 is a hotel having a high degree of association with "hot spring" and "seafood" (eg, hot spring facilities and seafood dishes). We present the search results for hotels that are close to both the store and the store).

Similarly, for example, it is assumed that the speech of the person A is a speech for requesting the object 20 to purchase a product. In this case, it is assumed that the feature data of the person A indicates that the person A jointly uses the product with the person C. In this case, it is preferable that the information processing apparatus 2000 presents products suitable for both the person A and the person C as candidates by referring to the feature data of the person A and the person C.

<Example of hardware configuration>
The hardware configuration of a computer for realizing the information processing apparatus 2000 of the third embodiment is represented, for example, by FIG. 3 as in the first embodiment. However, in the storage device 1080 of the computer 1000 for realizing the information processing apparatus 2000 of the present embodiment, a program module for realizing the function of the information processing apparatus 2000 of the present embodiment is further stored.

As mentioned above, although the embodiment of the present invention was described with reference to drawings, these are the illustrations of the present invention, and the composition which combined each above-mentioned embodiment, and various composition other than the above can also be adopted.

This application claims priority based on Japanese Patent Application No. 2017-162058 filed on Aug. 25, 2017, the entire disclosure of which is incorporated herein.

Claims

Image analysis means for estimating the line of sight of a person included in the captured image;
An information processing apparatus, comprising: audio determination means for determining whether audio data represents an utterance directed to an object from the person using the estimated line of sight.
The information processing apparatus according to claim 1, wherein the voice determination unit causes the storage unit to store the voice data determined as an utterance directed from the person to the target.
The image analysis means estimates the line of sight for each person when a plurality of people are included in the captured image;
The information processing apparatus according to claim 1, wherein the voice determination unit performs the determination on each voice data representing an utterance performed by each of a plurality of persons.
The information processing according to claim 3, wherein said voice discrimination means stores said voice data determined to be an utterance directed from the person to the object in association with identification information of the person in storage means. apparatus.
When the time during which the estimated line of sight is directed to the object is equal to or greater than a predetermined percentage of the time during which the person is uttering, the voice judging means makes the object utter the person from the person The information processing apparatus according to any one of claims 1 to 4, wherein the information processing apparatus is determined to be directed.
The voice discrimination means stores the voice data not representing an utterance directed to the object from the person in a manner distinguishable from the voice data representing an utterance directed to the object from the person. The information processing apparatus according to any one of claims 1 to 5, wherein the information is stored in the means.
7. The apparatus according to any one of claims 1 to 6, further comprising feature data generation means for generating feature data representing a feature of the person using contents of the voice data representing an utterance directed to the object from the person. Information processor as described.
7. The apparatus according to claim 1, further comprising feature data generation means for generating feature data representing a feature of the person based on the content of the voice data representing the utterance of the person and the other party to whom the utterance is directed. The information processing apparatus according to any one of the above.
9. The apparatus according to claim 1, further comprising process determination means for determining a process to be executed by the information processing apparatus or the object based on the content of the voice data representing an utterance directed to the object from the person. Information processor as described.
The object is a device that operates in response to the utterance,
10. The information processing apparatus according to claim 9, wherein the process determining means determines an operation of the object based on contents of the voice data representing an utterance directed to the object from the person.
The process determining means generates a response to an utterance directed from the person to the object based on contents of voice data representing an utterance directed to the object from the person, and the generated response is The information processing apparatus according to claim 9, wherein the output is made to an object.
It has feature data generation means for generating feature data representing the feature of the person using contents of the voice data representing an utterance directed from the person to the object;
The process determination means according to claim 10 or 11, wherein the action of the object is determined using content of voice data representing an utterance directed from the person to the object and feature data of the person. Information processing device.
The voice discrimination means stores the voice data not representing an utterance directed to the object from the person in a manner distinguishable from the voice data representing an utterance directed to the object from the person. Stored in the means,
The feature data generation means uses the content of the voice data representing the utterance directed to the object from the person and the content of the voice data not representing the utterance directed to the object from the person The information processing apparatus according to claim 7, wherein the feature data of the person is generated.
And a feature data generation unit configured to generate feature data representing a feature of the person based on the content of the voice data representing the utterance of the person and the other party to which the utterance is directed;
The information processing apparatus according to claim 10, wherein the process determining unit determines an operation of the object using the content of the voice data representing the speech of the person and the feature data of the person.
The information processing apparatus according to any one of claims 1 to 14, wherein the object is an interactive robot.
A control method implemented by a computer,
An image analysis step of estimating the line of sight of a person included in the captured image;
A voice determination step of determining whether voice data represents an utterance directed to an object from the person using the estimated line of sight.
The control method according to claim 16, wherein the voice data determined as the speech directed from the person to the target is stored in the storage means in the voice determination step.
In the image analysis step, when a plurality of people are included in the captured image, the line of sight is estimated for each person;
The control method according to claim 16 or 17, wherein in the voice determination step, the determination is performed on each voice data representing an utterance made by each of a plurality of persons.
19. The control method according to claim 18, wherein said voice data determined to be an utterance directed to an object from said person in said voice determining step is stored in storage means in association with identification information of said person. .
In the voice determination step, when the time during which the estimated line of sight is directed to the object is equal to or greater than a predetermined ratio in the time during which the person is speaking, the utterance is transmitted from the person to the object The control method according to any one of claims 16 to 19, wherein it is determined that it is directed.
In the voice determination step, the voice data not representing an utterance directed to the object from the person is stored in a manner distinguishable from the voice data representing an utterance directed to the object from the person 21. The control method according to any one of claims 16 to 20, wherein the control method is stored in the means.
22. The method according to any one of claims 16 to 21, further comprising: a feature data generation step of generating feature data representing a feature of the person using contents of the voice data representing an utterance directed to the object from the person. Control method described.
22. A feature data generation step of generating feature data representing a feature of a person based on the content of voice data representing the utterance of the person and a person to whom the utterance is directed. The control method described in the section.
The process according to any one of claims 16 to 23, further comprising a process determining step of determining a process to be performed by the computer or the object based on the content of the voice data representing an utterance directed to the object from the person. Control method.
The object is a device that operates in response to the utterance,
The control method according to claim 24, wherein in the processing determination step, the operation of the object is determined based on the content of the audio data representing an utterance directed to the object from the person.
In the processing determination step, a response to the utterance directed to the object from the person is generated based on the content of voice data representing the utterance directed to the object from the person, and the generated response is The control method according to claim 24 or 25, wherein the output is made to an object.
The feature data generation step of generating feature data representing the feature of the person using content of the voice data representing an utterance directed to the object from the person,
27. The process according to claim 25 or 26, wherein in the process determining step, the action of the object is determined using content of voice data representing an utterance directed to the object from the person and feature data of the person. Control method.
In the voice determination step, the voice data not representing an utterance directed to the object from the person is stored in a manner distinguishable from the voice data representing an utterance directed to the object from the person Stored in the means,
In the feature data generation step, the content of the voice data representing the utterance directed to the object from the person and the content of the voice data not representing the utterance directed to the object from the person are used The control method according to claim 22 or 27, wherein the feature data of the person is generated.
The feature data generation step of generating feature data representing the feature of the person based on the content of the voice data representing the utterance of the person and the person to whom the utterance is directed,
The control method according to claim 25 or 26, wherein in the processing determination step, the motion of the object is determined using the content of the voice data representing the speech of the person and the feature data of the person.
The control method according to any one of claims 16 to 29, wherein the object is an interactive robot.
A program that causes a computer to execute each step of the control method according to any one of claims 16 to 30.