WO2019171780A1

WO2019171780A1 - Individual identification device and characteristic collection device

Info

Publication number: WO2019171780A1
Application number: PCT/JP2019/001488
Authority: WO
Inventors: 純平松永
Original assignee: オムロン株式会社
Priority date: 2018-03-08
Filing date: 2019-01-18
Publication date: 2019-09-12
Also published as: JP6819633B2; JP2019154575A

Abstract

Provided is an individual identification device characterized by comprising a voice output means for outputting a voice, a voice input means for acquiring a voice in response to the outputted voice, an image input means, a detection means for detecting a person from a motion image inputted to the image input means, and a person identification means for identifying the person detected by the detection means, wherein the person identification means comprises a first identification means for identifying the person on the basis of the motion image and a second identification means for identifying the person on the basis of the inputted voice obtained as the response to the outputted voice, and identifies the person on the basis of a first identification result being the result of the identification performed by the first identification means and a second identification result being the result of the identification performed by the second identification means.

Description

Personal identification device and feature collection device

The present invention relates to a personal identification technique for identifying a person shown in a captured image.

Although there is a technology that uses face recognition as a personal identification technology, it cannot be identified if the subject does not face the camera. In addition, if it is limited to the home, there is a problem that it is difficult to distinguish because the face is genetically similar. In view of this, it is conceivable that identification can be made even when a face cannot be photographed by performing identification using human body features (posture, silhouette) and behavioral features (flow line pattern, staying place).

In order to perform identification using human body characteristics and behavioral characteristics, it is necessary to learn by collecting human body characteristics and behavioral characteristics for each target person. In addition, behavioral features may change according to changes in time and environment, so it is desirable to continuously collect these features and respond to changes by updating registration information used for identification. .

In order to collect human body features and behavioral features, it is necessary to identify who the detected person is, but there are situations in which accurate personal identification cannot be performed only by identification based on images. As described above, face recognition cannot be used unless a face can be photographed, and identification based on these features is not accurate if learning of human body features and behavior features is insufficient.

Patent Document 1 discloses that a call voice is output to a person and an optimum face to be registered is selected based on a change in face before and after the call voice is output. For example, it is disclosed to select a face turned around in response to a call. However, Patent Document 1 only discloses a method for acquiring a learning face image, and cannot address the problem that accurate personal identification cannot be performed unless an image showing a face is obtained at the time of identification. .

JP 2013-182325 A

The present invention has been made in view of the above circumstances, and an object thereof is to provide a personal identification technique capable of accurately identifying a person in an image.

In order to achieve the above object, the present invention employs a technique of identifying a person based on identification based on an image and identification based on a response voice to the output voice.

Specifically, the first aspect of the present invention includes voice output means, voice input means, image input means, detection means, and person identification means. The sound output means outputs sound. The voice input means acquires a voice that responds to the output voice. The image input means acquires a moving image. The detection means detects a human body from the input moving image. Any method can be adopted as the human body detection method. The person identifying means includes a first identifying means for performing identification based on an image and a second identifying means for performing identification based on sound. For example, the first identification means can employ an identification method that uses facial features, human body features (postures and silhouettes), and behavioral features (flow line patterns and places of stay). Other identification methods may be employed. The second discriminating means can employ discriminating means based on acoustic feature quantities obtained by waveform analysis of input speech or feature quantities of words or sentences obtained by natural language analysis. A technique may be adopted. The person identifying means identifies the detected person based on the identification result by the first identifying means (hereinafter referred to as the first identification result) and the identification result by the second identifying means (hereinafter referred to as the second identification result).

According to such a configuration, even when it is not possible to accurately identify a person based only on an image, it is possible to accurately identify a person by making a determination together with a sound-based identification result.

In this aspect, the person identifying means may be configured to perform identification considering the second identifying means when the first identifying means cannot be identified with high reliability. That is, the person identifying means performs identification by the second identifying means when the reliability by the first identifying means is less than a first threshold, and based on the first identification result and the second identification result. The person may be identified. In the identification based on the first identification result and the second identification result, for example, the person identification means determines the identification result when the first identification result and the second identification result match, and the two identification results are different. The identification result may be unconfirmed.

Further, in this aspect, when the reliability by the first identification means is equal to or higher than the first threshold, the first identification result may be the person identification result.

According to such a configuration, when the identification can be performed with high reliability only from the image, the processing amount can be reduced, and the question by voice is also omitted, so that it is not necessary to force the user to deal with it.

In this aspect, the sound output from the sound output means may be performed at a predetermined timing. The predetermined timing includes the timing at which the person's identification reliability becomes less than the threshold, the timing at which the first predetermined time has elapsed from the timing at which the person is detected, and the state in which there is substantially no change in the time of the person at the second predetermined time. It can be at least one of the continued timing, the timing when the person goes out of the imaging range, and the timing when the distance between the information processing apparatus and the person becomes equal to or less than a predetermined distance. The identification reliability of the person may be the reliability of identification by the first identification means, or may be the reliability obtained by integrating the identification results of the first identification means and the second identification means. The first predetermined time may be a fixed time, or may be a time determined according to the number of feature acquisitions or the data amount of features when acquiring features from an image.

In this aspect, it is preferable that the person identifying unit determines the content of the output sound according to the reliability by the first identifying unit. For example, when the reliability is less than a second threshold, the person identifying means displays the content including the name of the person of the first identification result or the content for inquiring who the person is as the output voice. The contents can be determined. Note that the reliability may be divided into three or more levels, and the content of the output sound may be determined according to each level. For example, when the reliability is divided into three levels, if the level is high, the content does not include the name of the person. If the level is medium, the content includes the name of the person. It is good also as the content which inquires who the person is. For example, “What are you doing there?” At the high level, “Mom, what are you doing there” at the middle level, “Who are you there” at the low level? Can be adopted.

In this aspect, when the first identification result and the second identification result do not match, the person identification means outputs a new voice and responds to the new output voice based on the input voice. You may identify by the said 2nd identification means. In this case, the content of the new output sound may be content that confirms the person more directly than the content of the previous output sound. If the first identification result and the second identification result do not match, it is in a state where it is not possible to determine who the person is. Therefore, the voice output is performed so that the person is more directly confirmed. It is good to identify who the person is from the response.

According to a second aspect of the present invention, the personal identification device described above, and a feature acquisition unit that acquires at least one of the features related to the human body or behavior of the detected person from the moving image input to the image input unit, A feature collection device comprising: a feature registration unit that registers the feature acquired by the feature acquisition unit in association with the person identified by the person identification unit. The features collected in this way can be used for classifier learning.

Note that the present invention can be understood as an information processing system having at least a part of the above configuration or function. The present invention also provides an information processing method or information processing system control method including at least a part of the above processing, a program for causing a computer to execute these methods, or such a program non-temporarily. It can also be understood as a recorded computer-readable recording medium. Each of the above configurations and processes can be combined with each other to constitute the present invention as long as there is no technical contradiction.

According to the present invention, a person in an image can be accurately identified.

FIG. 1 is a block diagram showing a configuration example of a personal identification device to which the present invention is applied. FIG. 2 is a block diagram illustrating a configuration example of the feature collection device according to the first embodiment. FIG. 3 is a flowchart illustrating an example of feature collection processing according to the first embodiment. FIG. 4 is a flowchart illustrating an example of utterance content determination processing according to the first embodiment. 5A and 5B are diagrams illustrating human body skeleton information detected from an input image and posture detection based on the skeleton. FIG. 6 is a flowchart illustrating an example of utterance content determination processing in the second embodiment.

<Application example>
An application example of the present invention will be described. When collecting personal body characteristics and behavioral characteristics, it is necessary to identify who is the target person being collected. There is face recognition as a personal identification method based on an image. However, since a face cannot always be captured, accurate identification may not be performed. Although identification based on human body characteristics and behavioral characteristics is possible, it is not realistic to rely entirely on identification based on human body characteristics or behavioral characteristics in feature collection for identification based on human body characteristics or behavioral characteristics.

FIG. 1 is a block diagram showing a configuration example of a personal identification device 10 to which the present invention is applied. The personal identification device 110 identifies a person in the moving image from the input moving image or sound. The personal identification device 10 includes an image input unit 11, a human body detection unit 12, a person identification unit 13, a voice output unit 14, and a voice input unit 15.

The image input unit 11 acquires a captured moving image (moving image data). For example, the image input unit 11 is an input terminal to which image data is input. The image input unit 11 is an example of an image acquisition unit of the present invention.

The human body detection unit 12 detects a human body region (human body region) from the processing target frame image of the moving image data. Human body detection may be performed using a detector learned by an existing arbitrary algorithm. The human body detection unit 12 is an example of the detection means of the present invention.

The voice output unit 14 outputs the speech content obtained from the person identification unit 13 as voice data. The audio output unit 14 is an example of the audio output means of the present invention.

The voice input unit 15 acquires voice data. The voice input unit 15 is an example of a voice input unit of the present invention.

The person identifying unit 13 identifies a person in the moving image based on the moving image acquired by the image input unit 11 and the sound acquired by the audio input unit 15. The person identification unit 13 also determines the content of the utterance for extracting the response voice used for voice identification and the timing of the utterance. The person identification unit 13 is an example of a person identification unit of the present invention.

The first identification unit 131 identifies a person detected by the human body detection unit 12 from the moving image acquired by the image input unit 11. The first identification unit 131 identifies a person based on the input image. Specifically, the detected person is identified based on at least one of a facial feature, a human body feature, and a behavior feature in the image. The 1st identification part 131 is an example of the 1st identification means of this invention.

The second identification unit 133 identifies the person detected by the human body detection unit 12 based on the voice acquired by the voice input unit 15. The second identification unit 133 may perform identification based on an acoustic feature obtained by performing speech waveform analysis, or may perform identification based on a language feature obtained by performing natural language analysis. . The identification based on the acoustic feature amount can be performed based on the degree of coincidence with the acoustic feature amount registered in advance. The identification based on the language feature amount can be performed based on the meaning of the content of the output speech and the content of the response speech. The 2nd identification part 133 is an example of the 2nd identification means of this invention.

The person identification unit 13 determines the timing of speech from the voice output unit 14 and the content of the speech sentence. In this application example, the voice output unit 14 utters (outputs voice) because it obtains a voice response from the detected person and performs detection based on the voice. Therefore, the person identifying unit 13 determines a timing at which it is assumed that it is necessary to perform identification based on speech as an utterance timing. The person identifying unit 13 may typically determine a timing at which the reliability of identification by the first identifying unit 131 is low (less than the threshold value TH1) as the utterance timing. The person identifying unit 13 utters regularly, utters when there is no movement of the person, or utters when the identification results of the first identifying unit 131 and the second identifying unit 133 are different. The utterance timing may be determined so that The utterance content may be determined according to the reliability of person identification. For example, utterance of natural communication is performed when the reliability is high, and utterance including the name of the estimated person or an utterance that directly inquires who the detected person is when the reliability is low. Can be considered. Here, the utterance including the name of the person and the utterance for inquiring who the person is are examples of the utterance for confirming who the person is.

It should be noted that the utterance and the response may be performed a plurality of times, and the second identifying unit 133 may perform the identification for the plurality of response sounds. In this case, for example, the first communication utters natural communication, and the second identification unit 133 performs identification based on the acoustic characteristics of the response voice. Here, when the identification result of the 1st identification part 131 and the 2nd identification part 133 corresponds, or when the reliability of the identification result of the 2nd identification part 133 is high, it is thought that the identification of a person is performed correctly. Therefore, the second utterance may not be made, and natural communication may be continued. On the other hand, when the identification result of the second identification unit 133 based on the first response is different from the identification result of the first identification unit 131 or the identification reliability is low, the second utterance content is changed to the first utterance. It is better to confirm the person more directly than Since the person can be identified by the contents of the response (semantic interpretation by natural language analysis) the second time, it is expected that a highly reliable identification result will be obtained.

As described above, according to the personal identification device 10 according to this application example, when a person in an image is identified based on two identification methods of an image and a sound, the identification cannot be performed with high reliability only from the image. However, it can be identified accurately. In addition, since the question by voice for personal identification can be minimized, it is possible to suppress the user from feeling troublesome. In addition, by determining the utterance content as described above, natural communication is possible even when making a question to the user, and it is possible to suppress annoying the user from such a point.

(First embodiment)
The first embodiment of the present invention is a feature collection device that collects human body features and behavioral features obtained by photographing a person, and is mounted on a home communication robot 1 (hereinafter also simply referred to as a robot 1). . The feature collection device collects these features for personal identification learning.

[Constitution]
FIG. 2 is a diagram illustrating a configuration of the robot 1. The robot 1 includes a feature collection device 100, a camera 200, a speaker 300, and a microphone 400. The robot 1 includes a processor (arithmetic unit) such as a CPU, a main storage device, an auxiliary storage device, a communication device, and the like, and each process of the feature collection device 100 is executed by the processor executing a program. Is done.

The camera 200 performs continuous photographing with visible light or invisible light (for example, infrared light), and inputs the photographed image data to the feature collection device 100. The imaging range of the camera 200 may be fixed or variable. The change of the imaging range of the camera 200 may be performed by changing the direction of the camera 200 or may be performed by the robot 1 moving autonomously.

The speaker 300 converts the sound data input from the feature collection device 100 into an acoustic wave and outputs the sound wave. The microphone 400 converts an acoustic wave such as voice uttered by the user into voice data and outputs the voice data to the feature collection apparatus 100. The microphone 400 may be configured as a microphone array so that sound sources can be separated.

The feature collection device 100 acquires, from the image captured by the camera 200, at least one of the posture, gesture, silhouette, flow line, and stay location of the person shown in the image as the feature of the person. In addition, the feature collection apparatus 100 identifies a person in the image and stores the feature information in association with the individual. In order to perform such an operation, the feature collection device 100 includes a personal identification device 110, a feature acquisition unit 120, and a feature registration unit 130.

The personal identification device 110 identifies a person in the moving image from the moving image input from the camera 200 and the sound input from the microphone 400. The personal identification device 110 includes an image input unit 111, a human body detection unit 112, a person identification unit 113, an audio output unit 114, and an audio input unit 115. The person identification unit 113 includes a first identification unit 1131, an utterance control unit 1132, a second identification unit 1133, and a person identification unit 1134.

The image input unit 111 is an interface that receives image data from the camera 200. In the present embodiment, the image input unit 111 receives image data directly from the camera 200. However, the image input unit 111 may receive image data via a communication device or image data via a recording medium. .

The human body detection unit 112 detects a human body region from the input image. Any existing human detection algorithm can be used. Further, the human body detection unit 112 may detect the human body once detected by tracking processing. At the time of tracking, the human body detection unit 112 stores the type, shape, color, and the like of a person's clothes and decorations that have been detected once, and during a predetermined time (for example, about 1 hour to several hours) You may detect using a characteristic. In addition, the type, shape, and color of clothes and ornaments may be used for identification by the first identification unit 1131. Because it is unlikely that clothes will change in a short time, these features can be used effectively for tracking.

The person identifying unit 113 identifies (identifies) who the person (detected person) detected by the human body detecting unit 112 is.

The first identification unit 1131 identifies the detected person based on the image acquired by the image input unit 11. Specifically, the detected person is identified based on at least one of a facial feature, a human body feature, and a behavior feature in the image. In order to perform these identifications, facial features, human body features, and action features of each person to be identified are registered in advance. As described above, when the human body detection unit 112 acquires the type, shape, and color of the detected person's clothes and decorations, the identification may be performed using these characteristics. The first identification unit 1131 also calculates the reliability of the identification result (an index that indicates how likely the identification result is).

The utterance control unit 1132 determines the timing of uttering from the voice output unit 114 and the content of the utterance sentence. The reason why speech (speech output) is performed from the speech output unit 114 is to obtain a speech response from the detected person and perform detection based on speech. Therefore, the person identifying unit 113 determines the timing assumed to be necessary to perform the identification based on the voice as the speech timing. The utterance content should be content that the user does not feel bothersome and can be identified based on the response voice. A specific utterance timing and utterance content determination method will be described later.

The second identification unit 1133 identifies the person detected by the human body detection unit 12 based on the voice acquired by the voice input unit 15. The second identification unit 1133 may perform identification based on an acoustic feature obtained by performing waveform analysis of speech, or may perform identification based on a language feature obtained by performing natural language analysis. . The identification based on the acoustic feature amount can be performed based on the degree of coincidence with the acoustic feature amount registered in advance. The identification based on the language feature amount can be performed based on the meaning of the content of the output speech and the content of the response speech. The second identification unit 1133 also calculates the reliability of the identification result.

Based on the identification result (first identification result) of the first identification unit 1131 and the identification result (second identification result) of the second identification unit 1133, the person specifying unit 1134 finally determines who the detected person is. Identify. For example, when the first identification result and the second identification result match, the person specifying unit 1134 may use the identification result as the final result, or adopt either the first identification result or the second identification result. May be. The person specifying unit 1134 may specify who the detected person is in consideration of the identification reliability of the first identification unit 1131 and the identification reliability of the second identification unit 1133. For example, when the identification reliability of the first identification unit 1131 is sufficiently high, the identification result of the first identification unit 1131 may be the final result without performing identification by the second identification unit 1133. The person specifying unit 1134 also calculates the reliability of the identification result. This reliability is used, for example, for utterance timing and utterance content determination by the utterance control unit 1132.

The voice output unit 114 acquires the text data of the uttered voice from the utterance control unit 1132, converts it into voice data by voice synthesis processing, and outputs it from the speaker 300.

The voice input unit 115 receives a voice signal from the microphone 400, converts it into voice data, and outputs the voice data to the second identification unit 1133. The voice input unit 115 may perform preprocessing such as noise removal and speaker separation.

The feature acquisition unit 120 acquires the feature of the person detected by the human body detection unit 112 from the input image. The feature acquired by the feature acquisition unit 120 is, for example, at least one of features related to the human body (for example, features related to the body part, skeleton, posture, and silhouette) and features related to behavior (features related to gestures, flow lines, and places to stay). including. Hereinafter, the feature acquired by the feature acquisition unit 120 is also referred to as a human body / behavior feature. When these features have already been calculated by the human body detection unit 112 or the first identification unit 1131, the feature acquisition unit 120 may acquire the calculated features. Conversely, the feature calculated by the feature acquisition unit 120 may be used by the first identification unit 1131.

Acquisition of human body / behavior features will be described with reference to FIGS. 5A and 5B. The feature acquisition unit 120 acquires skeleton information indicating the skeleton of the detected person 51 from the input image 50. The skeleton information is acquired using, for example, OpenPose. The skeletal information is also characteristic information indicating a human body, and is also characteristic information indicating a part of the human body (head, neck, shoulder, elbow, hand, waist, knee, ankle, eye, ear, fingertip, etc.). In FIG. 5, the skeleton (skeleton information) 52 of the detected person 51 is detected from the image 50.

As shown in FIG. 5B, the shape of the skeleton depends on the posture. Therefore, the feature acquisition unit 120 can acquire the posture information of the person from the positional relationship (skeleton information) of the parts of the human body. The posture information may be information listing the relative positional relationships of each part of the human body, or may be a result of classifying the positional relationships of the parts of the human body. Examples of the posture classification include upright, stoop, O-leg, and X-leg. In addition, the feature acquisition unit 120 may acquire the silhouette information of the detected person, or may obtain the posture from the silhouette information.

The above-described skeleton information, posture information, and silhouette information all correspond to the human body characteristics in the present invention.

In addition, the feature acquisition unit 120 can detect motion information (a gesture) from a change in posture information of each frame. As the gesture detection result, for example, information indicating walking, bending, lying, lying down, and the like can be obtained. The positional relationship between the upper arm and the forearm differs depending on whether the arm is folded or not. Thus, the positional relationship of each part depends on the gesture. Therefore, a gesture can be detected from the positional relationship of each part based on the skeleton information. A gesture involving movement such as walking or bending may be detected using a plurality of pieces of skeleton information respectively corresponding to a plurality of images captured at different times. For walking, information indicating the ratio between the stride width and the shoulder width may be obtained.

Also, the feature acquisition unit 120 may acquire the detected person's stay location and flow line (movement route) as features. The characteristics related to the stay location can be obtained by calculating the ratio of stay time to the length of the predetermined period (stay rate) for each stay determination area based on the time change of the person position in the predetermined period such as the past several minutes. . The detection result of the stay location is obtained in the form of a stay map (heat map) indicating the stay rate of each stay determination area. The feature related to the flow line is information indicating the stay position in time series, and is obtained by the same method as the stay map.

The above-described information regarding the operation, the staying place, and the flow line all correspond to the action characteristics in the present invention.

The feature registration unit 130 registers the human body / behavior feature acquired by the feature acquisition unit 120 in a storage unit such as a matching database in association with the person specified by the personal identification device 110 (person identification unit 113). The timing at which the feature registration unit 130 performs registration may be arbitrary, but may be, for example, the timing at which the identification by the person identification unit 113 can be performed with high reliability, or the timing at which the person tracking is completed.

[processing]
FIG. 3 is a flowchart showing the overall flow of feature collection processing performed by the feature collection device 100. Hereinafter, the feature collection processing in the present embodiment will be described with reference to FIG. Note that this flowchart conceptually describes the feature collection processing in the present embodiment, and it should be noted that the processing according to this flowchart need not be implemented in the embodiment.

In step S10, the human body detection unit 112 detects a human body from the input image. If no human body is detected (S11-NO), the process returns to step S10 and human body detection is performed from the next processing target frame image. On the other hand, if a human body is detected (S11-YES), the process proceeds to step S12.

In step S12, the feature acquisition unit 120 acquires the human body / behavior feature of the detected person.

In step S13, the first identification unit 1131 performs personal identification based on the human body / behavior feature. Note that facial features may also be acquired in step S12, and the first identification unit 1131 may perform personal identification based on the facial features in step S13.

Here, the reliability of identification based on images will be briefly described. Although identification based on facial features can be expected to be performed with relatively high accuracy (high reliability), it can be performed only when the user is facing the robot 1. On the other hand, identification based on human body characteristics and behavioral characteristics can be performed if the user's body is reflected, but the accuracy is not always good. In particular, it is assumed that highly reliable identification is not possible at the stage of collecting features for initial learning of human body / behavior features.

The first identification unit 1131 continues to identify the person included in the moving image, and calculates the identification reliability at the present time by combining the identification reliability in each frame. At this time, the first identification unit 1131 may use a weighted average obtained by assigning a large weight to the reliability of recently performed identification as the identification reliability at the present time.

In step S14, the utterance control unit 1132 determines whether or not it is time to perform utterance. The timing for uttering may be set as a condition in advance, and in step S14, the person identifying unit 13 may determine whether or not the set condition is met.

The following conditions can be adopted as conditions for utterance.
(1) Time For example, the first utterance is performed after a predetermined time (for example, 10 minutes) has elapsed since the detection of the human body, and thereafter the utterance is performed at predetermined intervals.
(2) Data amount For example, when a feature of a predetermined data amount (for example, data for 100 times) is acquired, speech is performed.
(3) Action stop When the detected person's action does not change for a certain period of time. For example, it corresponds to the case where a certain time has elapsed after sitting on a sofa and starting to watch television.
(4) Movement outside the imaging range When the detected person is predicted to move outside the imaging range. For example, this corresponds to a case where the detected person moves / goes from the current room to another room.
(5) Situations where speech is easy When the robot reaches a situation suitable for interacting with the detected person. For example, when the detected person and the robot face each other (the robot can detect the face of the detected person), and the distance between the detected person and the robot is within a predetermined distance (for example, 3 meters).
(6) When the identification reliability is low The person identification unit 113 determines the final identification result using the identification results of both the first identification unit 1131 and the second identification unit 1133. Therefore, if the identification reliability by the first identification unit 1131 is equal to or higher than the threshold TH1, the result is determined as the identification result of the person identification unit 113, and if the identification reliability is less than the threshold TH1, the second identification unit 1133 Speaking may be performed for identification. Or, in consideration of the identification results of both the first identification unit 1131 and the second identification unit 1133, when the identification reliability is less than the threshold value TH1, utterance is performed for further identification by the second identification unit 1133. You may do it.

A plurality of the above conditions may be combined. For example, a speech may be made when any one of the plurality of conditions is satisfied, or an utterance may be made when some of the plurality of conditions are simultaneously satisfied. You may make it do. Furthermore, in situations where it is not appropriate to speak, such as when the detected person is asleep or concentrated, the utterance may not be made even if the above condition is satisfied.

Note that the utterance here is for the purpose of obtaining a response voice for personal identification, so it is performed at the timing as described above, but utterance at other timings is prohibited. It is not a thing. For example, you may make it talk to a user for communication in the timing which does not satisfy | fill said conditions.

If it is determined in step S14 that the utterance timing condition is satisfied, the process proceeds to step S15; otherwise, the process proceeds to step S19.

In step S15, the utterance control unit 1132 determines the content of the utterance (utterance text). In the present embodiment, the utterance control unit 1132 determines the utterance content based on the current identification reliability. FIG. 4 is a flowchart illustrating the utterance content determination process in step S15.

As shown in FIG. 4, in step S131, the utterance control unit 1132 determines the utterance level according to the current identification reliability. In the present embodiment, for example, a high reliability with an identification reliability of 0.8 or more, a medium reliability with an identification reliability of 0.5 or more and less than 0.8, and a low reliability with an identification reliability of less than 0.5. Classify into three levels. This threshold value is merely an example, and may be appropriately determined according to the system request. Further, the threshold value may change according to the situation. Further, in this embodiment, the level is divided into three stages, but it may be divided into two stages or four stages or more.

If the identification reliability is high, the process proceeds to step S152, and the utterance control unit 1132 determines the utterance content with emphasis on the naturalness of communication. For example, the utterance control unit 1132 determines the utterance content that does not include the name of the detected person. When the place where the person is detected is in the kitchen, the utterance control unit 1132 determines, for example, “What are you doing in the kitchen?” As the utterance content.

If the identification reliability is medium, the process proceeds to step S153, and the utterance control unit 1132 determines the utterance contents so as to confirm who the detected person is so as not to be unnatural. For example, the utterance control unit 1132 sets the content including the name of the identification result of the first identification unit 1131 as the utterance content. When the identification result of the first identification unit 1131 is “mother”, the utterance control unit 1132 determines, for example, “Mom, what are you doing in the kitchen?” As the utterance content.

If the identification reliability is low, the process proceeds to step S154, and the utterance control unit 1132 sets the utterance content to more directly ask who the detected person is. For example, the utterance control unit 1132 determines “who is in the kitchen?” As the utterance content.

The text data of the utterance content determined in step S15 is transferred from the utterance control unit 1132 to the voice output unit 114. In step S <b> 16, the voice output unit 114 converts the utterance text into voice data by voice synthesis and outputs the voice data from the speaker 300. In step S <b> 17, the voice input unit 115 acquires a response to the system utterance from the microphone 400.

In step S18, the second identification unit 1133 performs personal identification based on the input voice. The second identification unit 1133 performs identification based on acoustic features (acoustic analysis) and identification based on language features (semantic analysis). The identification based on the acoustic features can be obtained if a response voice is obtained, but the identification based on the language features is unknown depending on the contents of the question and the response. However, in the identification based on the linguistic feature, the meaning of “(I am the mother)” can be considered, so the identification result is considered to be reliable.

In step S19, the person specifying unit 1134 specifies the detected person in consideration of the identification result (S13) based on the image by the first identification unit 1131 and the identification result (S18) based on the sound by the second identification unit 1133. .

If the utterance content is “What are you doing in the kitchen?”, It is assumed that “Cooking now” is obtained as a response. In this case, the second identification unit 1133 cannot perform identification based on the language feature, but can obtain an identification result based on the acoustic feature. Here, if the identification result of the second identification unit 1133 matches the identification result of the first identification unit 1131, the person specifying unit 1134 can confirm that the identification result of the first identification unit 1131 is correct, Specific result. On the other hand, when these two identification results are different, the person specifying unit 1134 may adopt any identification result, or the detected person may be unknown. In this case, the person specifying unit 1134 may set the identification reliability low, and more directly confirm who is the detected person in the next utterance.

If the utterance is “Mom, what are you doing in the kitchen?”, If the person is actually “Mother”, it is assumed that “Currently cooking” is obtained as a response. If this is not the case, it is assumed that a response such as “I am not a mother” or “I am an older sister” can be obtained. In any case, the second identification unit 1133 can perform both identification based on acoustic features and identification based on language features. In identification based on language characteristics, whether or not the other party is “mother” is identified based on whether or not the response sentence contains a word that denies it in response to an inquiry using the name “mother” it can. If the response sentence includes a phrase indicating who the person is, the detected person can be identified based on the phrase. At this time, there may be a difference between the identification result based on the acoustic feature and the identification result based on the linguistic feature. May be.

If the utterance is “Who is in the kitchen?”, It is assumed that the response is “I am XX”. Therefore, the second identification unit 1133 may identify the detected person by performing identification based on the language feature. Since the utterance content that directly inquires who the detected person is is employed, the detected person can be more reliably identified from the meaning content of the response. At this time, as described above, the second identification unit 1133 may perform identification based on acoustic features.

In step S20, it is determined whether or not to continue feature collection. If feature collection is to be continued, the process returns to step S10 to process the next frame.

When the feature collection is ended, in step S21, the feature registration unit 130 registers the acquired human body / biological feature in the storage unit (not shown) in association with the detected person identification result. In the flowchart of FIG. 3, feature registration is performed at the timing when feature collection ends, but feature registration may be performed at the timing when tracking of the detected person is completed. The feature registration unit 130 registers all the features obtained from the start to the end of tracking in association with one person. However, the feature registration unit 130 may divide one tracking period into a plurality of periods and register the obtained characteristics in association with the identification result person within the period. This process can be performed when the human body detection unit 112 cannot correctly recognize that the person has been changed in the middle and has been detected as the same person, but is identified as a different person by the person identification unit 113.

[Advantageous effects of this embodiment]
According to this embodiment, personal identification based on an image and personal identification based on sound are performed, and both are combined to obtain a final identification result, so that accurate identification can be performed. In particular, when accurate identification cannot be performed only from an image, accurate identification is possible by performing system utterance, obtaining a voice response from a user, and performing identification based on voice. Furthermore, by speaking only when the identification result based on the image is unreliable, or by determining the utterance content according to the identification reliability based on the image, it is possible to minimize the user's annoyance.

(Second Embodiment)
In the first embodiment, as shown in FIG. 4, the utterance content is determined according to the identification reliability. However, in the present embodiment, the content of the utterance is determined in consideration of the fact that if a response from the user is obtained by voice, the identification based on at least the acoustic feature can be performed and that the utterance can be performed a plurality of times in one dialogue. The process is changed from the first embodiment. Hereinafter, differences from the first embodiment will be mainly described.

FIG. 6 is a flowchart showing a flow of utterance content determination processing in the present embodiment. Note that in the present embodiment, it is also assumed that speech is performed a plurality of times, and therefore the process shown in FIG. 6 is not the process of step S15 in FIG.

The utterance control unit 1132 determines the utterance content that makes natural communication, as shown in step S31, when it is time to utter. Therefore, it is not necessary to include the name of the detected person in the utterance content, and the content such as “What are you doing in the kitchen?” Is determined as the utterance content.

In step S32, the content determined in step S31 is output from the speaker 300, and the second identification unit 1133 identifies the user's response voice acquired by the microphone 400 as a result. Here, at least identification based on acoustic features may be performed. Of course, identification based on language features is also possible if possible.

In step S33, it is determined whether the identification result by the first identification unit 1131 matches the identification result by the second identification unit 1133. If they match, the detected person can be identified, so there is no need to speak further. On the other hand, if the two identification results are different, further utterance content is determined in step S34 in order to identify the person more reliably. The utterance control unit 1132 compares the contents of the second utterance with the utterances of the first time, and determines the contents to confirm who is the detected person more directly. In the present case, for example, it is possible to adopt the utterance content “Isn't my mother in the kitchen?”. Or, the utterance content including the name of the detected person (for example, “Mom, what are you doing there?”) And who is the same as when the reliability of the first embodiment is medium level or low level The utterance content (for example, “who is in the kitchen?”) May be determined as the second utterance content.

According to this embodiment, the same effects as those of the first embodiment can be obtained, and more natural dialogue can be performed when performing identification based on voice.

In the above description, it is described that the utterance is performed twice in one dialogue, but the utterance may be performed three or more times in one dialogue. In that case, a natural utterance may be performed for the first several times, and when the person cannot be identified based on the voice, the utterance may be performed to confirm who the detected person is.

In addition, the processing of the present embodiment is different from the identification result based on the face figure in the first embodiment when the utterance of the natural content with high identification reliability based on the image is performed. It is also applicable to cases.

(Other)
Each embodiment mentioned above is only illustration of this invention. The present invention is not limited to the specific form described above, and various modifications are possible within the scope of the technical idea.

In the above description, the timing at which the feature collection apparatus 100 (home robot) speaks to the user and the contents thereof are described. This is processing for utterance to obtain a response based on voice from the user. When a home robot is implemented as a communication robot, it is not necessary to apply the above processing to an utterance for communicating with a user. In the above description, as an example of natural communication, an utterance that does not include the name of the other party is cited. However, if the utterance that includes the name is a natural scene or the other party, the utterance content including the name is a natural utterance. It becomes.

In the above embodiment, an example in which the feature collection device 100 is mounted on a home robot has been described, but it may be mounted on a surveillance camera or the like. The personal identification device 110 does not need to be mounted on the feature collection device 100 that collects learning features, and may be used alone to identify an individual.

(Appendix)
Voice output means (14, 114) for outputting voice;
Voice input means (15, 115) for acquiring voice in response to the output voice;
Image input means (11, 111);
Detection means (12, 112) for detecting a person from a moving image input to the image input means;
Person identifying means (13, 113) for identifying a person detected by the detecting means;
With
The person identifying means includes
First identification means (131, 1131) for identifying the person based on the moving image;
Second identification means (133, 1133) for identifying the person based on the input voice obtained as a response to the output voice,
Identifying the person based on a first identification result that is an identification result by the first identification means and a second identification result that is an identification result by the second identification means;
A personal identification device.

DESCRIPTION OF SYMBOLS 10: Personal identification apparatus 11: Image input part 12: Human body detection part 13: Person identification part 131: 1st identification part 133: 2nd identification part 14: Audio | voice output part 15: Voice input part 100: Feature collection apparatus 110: Individual Identification device 111: Image input unit 112: Human body detection unit 113: Person identification unit 1131: First identification unit 1132: Speech control unit 1133: Second identification unit 1134: Human body identification unit 114: Audio output unit 115: Audio input unit 120 : Feature acquisition unit 130: Feature registration unit 200: Camera 300: Speaker 400: Microphone

Claims

Audio output means for outputting audio;
An audio input means for acquiring audio in response to the output audio;
Image input means;
Detecting means for detecting a person from a moving image input to the image input means;
Person identifying means for identifying a person detected by the detecting means;
With
The person identifying means includes
First identifying means for identifying the person based on the moving image;
Second identifying means for identifying the person based on an input voice obtained as a response to the output voice,
Identifying the person based on a first identification result that is an identification result by the first identification means and a second identification result that is an identification result by the second identification means;
A personal identification device.
The person identification means performs voice output from the voice output means and identification by the second identification means when the reliability by the first identification means is less than a first threshold, and the first identification result And identifying the person based on the second identification result,
The personal identification device according to claim 1.
When the reliability by the first identification unit is equal to or higher than the first threshold, the person identification unit sets the first identification result as the person identification result.
The personal identification device according to claim 2.
The output of the output sound from the sound output means is performed at a predetermined timing,
The predetermined timing is
Timing when the identification reliability of the person is less than a threshold,
A timing at which a first predetermined time has elapsed from a timing at which the person is detected;
A timing at which a state in which there is substantially no time change of the person continues for a second predetermined time;
Timing when the person goes out of the imaging range,
At the timing when the distance between the personal identification device and the person is equal to or less than a predetermined distance,
Or at least one of
The personal identification device according to any one of claims 1 to 3.
The person identifying means determines the content of the output sound according to the reliability by the first identifying means.
The personal identification device according to claim 1.
When the reliability is less than the second threshold, the person identification means includes contents including the name of the person of the first identification result or contents for inquiring who the person is. As determined,
The personal identification device according to claim 5.
If the first identification result and the second identification result do not match, the person identifying means outputs the output voice anew and outputs the output voice in response to the new output voice. To identify
The content of the new output audio is content that confirms the person more directly compared to the content of the previous output audio.
The personal identification device according to any one of claims 1 to 6.
The second identifying means identifies the person by performing at least one of waveform analysis and language analysis using the input speech;
The personal identification device according to any one of claims 1 to 7.
The first identifying means identifies the person based on at least one of a facial feature, a human body feature, and a behavioral feature obtained from the moving image;
The personal identification device according to any one of claims 1 to 8.
The personal identification device according to any one of claims 1 to 9,
Feature acquisition means for acquiring at least one of features relating to the human body or behavior of the detected person from a moving image input to the image input means;
A feature registration unit that registers the feature acquired by the feature acquisition unit in association with the person identified by the person identification unit;
A feature collection device.
A personal identification method executed by a computer,
A detection step of detecting a person from the moving image;
A first identification step for identifying the person based on the moving image;
An audio output step for outputting audio;
A voice input step for obtaining a voice in response to the output voice;
A second identifying step of identifying the person based on input speech obtained as a response to the output speech;
A third identification step for identifying the person based on a first identification result that is an identification result in the first identification step and a second identification result that is an identification result in the second identification step;
Including personal identification methods.
When the reliability in the first identification step is less than a first threshold, the voice output step, the voice input step, and the second identification step are performed;
The personal identification method according to claim 11.
If the reliability in the first identification step is greater than or equal to the first threshold, in the third identification step, the first identification result is the person identification result;
The personal identification method according to claim 12.
In the sound output step, a sound of content is output according to the reliability in the first identification step.
The personal identification method according to claim 11.
The program for performing each step of the method of any one of Claim 11 to 14.