WO2019171780A1 - Individual identification device and characteristic collection device - Google Patents

Individual identification device and characteristic collection device Download PDF

Info

Publication number
WO2019171780A1
WO2019171780A1 PCT/JP2019/001488 JP2019001488W WO2019171780A1 WO 2019171780 A1 WO2019171780 A1 WO 2019171780A1 JP 2019001488 W JP2019001488 W JP 2019001488W WO 2019171780 A1 WO2019171780 A1 WO 2019171780A1
Authority
WO
WIPO (PCT)
Prior art keywords
identification
person
voice
identifying
output
Prior art date
Application number
PCT/JP2019/001488
Other languages
French (fr)
Japanese (ja)
Inventor
純平 松永
Original Assignee
オムロン株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by オムロン株式会社 filed Critical オムロン株式会社
Publication of WO2019171780A1 publication Critical patent/WO2019171780A1/en

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/117Identification of persons
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/117Identification of persons
    • A61B5/1171Identification of persons based on the shapes or appearances of their bodies or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Definitions

  • the present invention relates to a personal identification technique for identifying a person shown in a captured image.
  • Patent Document 1 discloses that a call voice is output to a person and an optimum face to be registered is selected based on a change in face before and after the call voice is output. For example, it is disclosed to select a face turned around in response to a call.
  • Patent Document 1 only discloses a method for acquiring a learning face image, and cannot address the problem that accurate personal identification cannot be performed unless an image showing a face is obtained at the time of identification. .
  • the present invention has been made in view of the above circumstances, and an object thereof is to provide a personal identification technique capable of accurately identifying a person in an image.
  • the present invention employs a technique of identifying a person based on identification based on an image and identification based on a response voice to the output voice.
  • the first aspect of the present invention includes voice output means, voice input means, image input means, detection means, and person identification means.
  • the sound output means outputs sound.
  • the voice input means acquires a voice that responds to the output voice.
  • the image input means acquires a moving image.
  • the detection means detects a human body from the input moving image. Any method can be adopted as the human body detection method.
  • the person identifying means includes a first identifying means for performing identification based on an image and a second identifying means for performing identification based on sound.
  • the first identification means can employ an identification method that uses facial features, human body features (postures and silhouettes), and behavioral features (flow line patterns and places of stay). Other identification methods may be employed.
  • the second discriminating means can employ discriminating means based on acoustic feature quantities obtained by waveform analysis of input speech or feature quantities of words or sentences obtained by natural language analysis. A technique may be adopted.
  • the person identifying means identifies the detected person based on the identification result by the first identifying means (hereinafter referred to as the first identification result) and the identification result by the second identifying means (hereinafter referred to as the second identification result).
  • the person identifying means may be configured to perform identification considering the second identifying means when the first identifying means cannot be identified with high reliability. That is, the person identifying means performs identification by the second identifying means when the reliability by the first identifying means is less than a first threshold, and based on the first identification result and the second identification result. The person may be identified. In the identification based on the first identification result and the second identification result, for example, the person identification means determines the identification result when the first identification result and the second identification result match, and the two identification results are different. The identification result may be unconfirmed.
  • the first identification result when the reliability by the first identification means is equal to or higher than the first threshold, the first identification result may be the person identification result.
  • the processing amount can be reduced, and the question by voice is also omitted, so that it is not necessary to force the user to deal with it.
  • the sound output from the sound output means may be performed at a predetermined timing.
  • the predetermined timing includes the timing at which the person's identification reliability becomes less than the threshold, the timing at which the first predetermined time has elapsed from the timing at which the person is detected, and the state in which there is substantially no change in the time of the person at the second predetermined time. It can be at least one of the continued timing, the timing when the person goes out of the imaging range, and the timing when the distance between the information processing apparatus and the person becomes equal to or less than a predetermined distance.
  • the identification reliability of the person may be the reliability of identification by the first identification means, or may be the reliability obtained by integrating the identification results of the first identification means and the second identification means.
  • the first predetermined time may be a fixed time, or may be a time determined according to the number of feature acquisitions or the data amount of features when acquiring features from an image.
  • the person identifying unit determines the content of the output sound according to the reliability by the first identifying unit. For example, when the reliability is less than a second threshold, the person identifying means displays the content including the name of the person of the first identification result or the content for inquiring who the person is as the output voice.
  • the contents can be determined.
  • the reliability may be divided into three or more levels, and the content of the output sound may be determined according to each level. For example, when the reliability is divided into three levels, if the level is high, the content does not include the name of the person. If the level is medium, the content includes the name of the person. It is good also as the content which inquires who the person is. For example, “What are you doing there?” At the high level, “Mom, what are you doing there” at the middle level, “Who are you there” at the low level? Can be adopted.
  • the person identification means when the first identification result and the second identification result do not match, the person identification means outputs a new voice and responds to the new output voice based on the input voice. You may identify by the said 2nd identification means.
  • the content of the new output sound may be content that confirms the person more directly than the content of the previous output sound. If the first identification result and the second identification result do not match, it is in a state where it is not possible to determine who the person is. Therefore, the voice output is performed so that the person is more directly confirmed. It is good to identify who the person is from the response.
  • the personal identification device described above and a feature acquisition unit that acquires at least one of the features related to the human body or behavior of the detected person from the moving image input to the image input unit
  • a feature collection device comprising: a feature registration unit that registers the feature acquired by the feature acquisition unit in association with the person identified by the person identification unit. The features collected in this way can be used for classifier learning.
  • the present invention can be understood as an information processing system having at least a part of the above configuration or function.
  • the present invention also provides an information processing method or information processing system control method including at least a part of the above processing, a program for causing a computer to execute these methods, or such a program non-temporarily. It can also be understood as a recorded computer-readable recording medium.
  • a person in an image can be accurately identified.
  • FIG. 1 is a block diagram showing a configuration example of a personal identification device to which the present invention is applied.
  • FIG. 2 is a block diagram illustrating a configuration example of the feature collection device according to the first embodiment.
  • FIG. 3 is a flowchart illustrating an example of feature collection processing according to the first embodiment.
  • FIG. 4 is a flowchart illustrating an example of utterance content determination processing according to the first embodiment.
  • 5A and 5B are diagrams illustrating human body skeleton information detected from an input image and posture detection based on the skeleton.
  • FIG. 6 is a flowchart illustrating an example of utterance content determination processing in the second embodiment.
  • FIG. 1 is a block diagram showing a configuration example of a personal identification device 10 to which the present invention is applied.
  • the personal identification device 110 identifies a person in the moving image from the input moving image or sound.
  • the personal identification device 10 includes an image input unit 11, a human body detection unit 12, a person identification unit 13, a voice output unit 14, and a voice input unit 15.
  • the image input unit 11 acquires a captured moving image (moving image data).
  • the image input unit 11 is an input terminal to which image data is input.
  • the image input unit 11 is an example of an image acquisition unit of the present invention.
  • the human body detection unit 12 detects a human body region (human body region) from the processing target frame image of the moving image data. Human body detection may be performed using a detector learned by an existing arbitrary algorithm.
  • the human body detection unit 12 is an example of the detection means of the present invention.
  • the voice output unit 14 outputs the speech content obtained from the person identification unit 13 as voice data.
  • the audio output unit 14 is an example of the audio output means of the present invention.
  • the voice input unit 15 acquires voice data.
  • the voice input unit 15 is an example of a voice input unit of the present invention.
  • the person identifying unit 13 identifies a person in the moving image based on the moving image acquired by the image input unit 11 and the sound acquired by the audio input unit 15. The person identification unit 13 also determines the content of the utterance for extracting the response voice used for voice identification and the timing of the utterance.
  • the person identification unit 13 is an example of a person identification unit of the present invention.
  • the first identification unit 131 identifies a person detected by the human body detection unit 12 from the moving image acquired by the image input unit 11.
  • the first identification unit 131 identifies a person based on the input image. Specifically, the detected person is identified based on at least one of a facial feature, a human body feature, and a behavior feature in the image.
  • the 1st identification part 131 is an example of the 1st identification means of this invention.
  • the second identification unit 133 identifies the person detected by the human body detection unit 12 based on the voice acquired by the voice input unit 15.
  • the second identification unit 133 may perform identification based on an acoustic feature obtained by performing speech waveform analysis, or may perform identification based on a language feature obtained by performing natural language analysis. .
  • the identification based on the acoustic feature amount can be performed based on the degree of coincidence with the acoustic feature amount registered in advance.
  • the identification based on the language feature amount can be performed based on the meaning of the content of the output speech and the content of the response speech.
  • the 2nd identification part 133 is an example of the 2nd identification means of this invention.
  • the person identification unit 13 determines the timing of speech from the voice output unit 14 and the content of the speech sentence.
  • the voice output unit 14 utters (outputs voice) because it obtains a voice response from the detected person and performs detection based on the voice. Therefore, the person identifying unit 13 determines a timing at which it is assumed that it is necessary to perform identification based on speech as an utterance timing.
  • the person identifying unit 13 may typically determine a timing at which the reliability of identification by the first identifying unit 131 is low (less than the threshold value TH1) as the utterance timing.
  • the person identifying unit 13 utters regularly, utters when there is no movement of the person, or utters when the identification results of the first identifying unit 131 and the second identifying unit 133 are different.
  • the utterance timing may be determined so that The utterance content may be determined according to the reliability of person identification. For example, utterance of natural communication is performed when the reliability is high, and utterance including the name of the estimated person or an utterance that directly inquires who the detected person is when the reliability is low. Can be considered.
  • the utterance including the name of the person and the utterance for inquiring who the person is are examples of the utterance for confirming who the person is.
  • the utterance and the response may be performed a plurality of times, and the second identifying unit 133 may perform the identification for the plurality of response sounds.
  • the first communication utters natural communication
  • the second identification unit 133 performs identification based on the acoustic characteristics of the response voice.
  • the identification result of the 1st identification part 131 and the 2nd identification part 133 corresponds, or when the reliability of the identification result of the 2nd identification part 133 is high, it is thought that the identification of a person is performed correctly. Therefore, the second utterance may not be made, and natural communication may be continued.
  • the second utterance content is changed to the first utterance. It is better to confirm the person more directly than Since the person can be identified by the contents of the response (semantic interpretation by natural language analysis) the second time, it is expected that a highly reliable identification result will be obtained.
  • the personal identification device 10 when a person in an image is identified based on two identification methods of an image and a sound, the identification cannot be performed with high reliability only from the image. However, it can be identified accurately. In addition, since the question by voice for personal identification can be minimized, it is possible to suppress the user from feeling troublesome. In addition, by determining the utterance content as described above, natural communication is possible even when making a question to the user, and it is possible to suppress annoying the user from such a point.
  • the first embodiment of the present invention is a feature collection device that collects human body features and behavioral features obtained by photographing a person, and is mounted on a home communication robot 1 (hereinafter also simply referred to as a robot 1). .
  • the feature collection device collects these features for personal identification learning.
  • FIG. 2 is a diagram illustrating a configuration of the robot 1.
  • the robot 1 includes a feature collection device 100, a camera 200, a speaker 300, and a microphone 400.
  • the robot 1 includes a processor (arithmetic unit) such as a CPU, a main storage device, an auxiliary storage device, a communication device, and the like, and each process of the feature collection device 100 is executed by the processor executing a program. Is done.
  • the camera 200 performs continuous photographing with visible light or invisible light (for example, infrared light), and inputs the photographed image data to the feature collection device 100.
  • the imaging range of the camera 200 may be fixed or variable.
  • the change of the imaging range of the camera 200 may be performed by changing the direction of the camera 200 or may be performed by the robot 1 moving autonomously.
  • the speaker 300 converts the sound data input from the feature collection device 100 into an acoustic wave and outputs the sound wave.
  • the microphone 400 converts an acoustic wave such as voice uttered by the user into voice data and outputs the voice data to the feature collection apparatus 100.
  • the microphone 400 may be configured as a microphone array so that sound sources can be separated.
  • the feature collection device 100 acquires, from the image captured by the camera 200, at least one of the posture, gesture, silhouette, flow line, and stay location of the person shown in the image as the feature of the person.
  • the feature collection apparatus 100 identifies a person in the image and stores the feature information in association with the individual.
  • the feature collection device 100 includes a personal identification device 110, a feature acquisition unit 120, and a feature registration unit 130.
  • the personal identification device 110 identifies a person in the moving image from the moving image input from the camera 200 and the sound input from the microphone 400.
  • the personal identification device 110 includes an image input unit 111, a human body detection unit 112, a person identification unit 113, an audio output unit 114, and an audio input unit 115.
  • the person identification unit 113 includes a first identification unit 1131, an utterance control unit 1132, a second identification unit 1133, and a person identification unit 1134.
  • the image input unit 111 is an interface that receives image data from the camera 200. In the present embodiment, the image input unit 111 receives image data directly from the camera 200. However, the image input unit 111 may receive image data via a communication device or image data via a recording medium. .
  • the human body detection unit 112 detects a human body region from the input image. Any existing human detection algorithm can be used. Further, the human body detection unit 112 may detect the human body once detected by tracking processing. At the time of tracking, the human body detection unit 112 stores the type, shape, color, and the like of a person's clothes and decorations that have been detected once, and during a predetermined time (for example, about 1 hour to several hours) You may detect using a characteristic. In addition, the type, shape, and color of clothes and ornaments may be used for identification by the first identification unit 1131. Because it is unlikely that clothes will change in a short time, these features can be used effectively for tracking.
  • the person identifying unit 113 identifies (identifies) who the person (detected person) detected by the human body detecting unit 112 is.
  • the first identification unit 1131 identifies the detected person based on the image acquired by the image input unit 11. Specifically, the detected person is identified based on at least one of a facial feature, a human body feature, and a behavior feature in the image. In order to perform these identifications, facial features, human body features, and action features of each person to be identified are registered in advance. As described above, when the human body detection unit 112 acquires the type, shape, and color of the detected person's clothes and decorations, the identification may be performed using these characteristics. The first identification unit 1131 also calculates the reliability of the identification result (an index that indicates how likely the identification result is).
  • the utterance control unit 1132 determines the timing of uttering from the voice output unit 114 and the content of the utterance sentence.
  • the reason why speech (speech output) is performed from the speech output unit 114 is to obtain a speech response from the detected person and perform detection based on speech. Therefore, the person identifying unit 113 determines the timing assumed to be necessary to perform the identification based on the voice as the speech timing.
  • the utterance content should be content that the user does not feel bothersome and can be identified based on the response voice. A specific utterance timing and utterance content determination method will be described later.
  • the second identification unit 1133 identifies the person detected by the human body detection unit 12 based on the voice acquired by the voice input unit 15.
  • the second identification unit 1133 may perform identification based on an acoustic feature obtained by performing waveform analysis of speech, or may perform identification based on a language feature obtained by performing natural language analysis. .
  • the identification based on the acoustic feature amount can be performed based on the degree of coincidence with the acoustic feature amount registered in advance.
  • the identification based on the language feature amount can be performed based on the meaning of the content of the output speech and the content of the response speech.
  • the second identification unit 1133 also calculates the reliability of the identification result.
  • the person specifying unit 1134 Based on the identification result (first identification result) of the first identification unit 1131 and the identification result (second identification result) of the second identification unit 1133, the person specifying unit 1134 finally determines who the detected person is. Identify. For example, when the first identification result and the second identification result match, the person specifying unit 1134 may use the identification result as the final result, or adopt either the first identification result or the second identification result. May be.
  • the person specifying unit 1134 may specify who the detected person is in consideration of the identification reliability of the first identification unit 1131 and the identification reliability of the second identification unit 1133. For example, when the identification reliability of the first identification unit 1131 is sufficiently high, the identification result of the first identification unit 1131 may be the final result without performing identification by the second identification unit 1133.
  • the person specifying unit 1134 also calculates the reliability of the identification result. This reliability is used, for example, for utterance timing and utterance content determination by the utterance control unit 1132.
  • the voice output unit 114 acquires the text data of the uttered voice from the utterance control unit 1132, converts it into voice data by voice synthesis processing, and outputs it from the speaker 300.
  • the voice input unit 115 receives a voice signal from the microphone 400, converts it into voice data, and outputs the voice data to the second identification unit 1133.
  • the voice input unit 115 may perform preprocessing such as noise removal and speaker separation.
  • the feature acquisition unit 120 acquires the feature of the person detected by the human body detection unit 112 from the input image.
  • the feature acquired by the feature acquisition unit 120 is, for example, at least one of features related to the human body (for example, features related to the body part, skeleton, posture, and silhouette) and features related to behavior (features related to gestures, flow lines, and places to stay). including.
  • the feature acquired by the feature acquisition unit 120 is also referred to as a human body / behavior feature.
  • the feature acquisition unit 120 may acquire the calculated features.
  • the feature calculated by the feature acquisition unit 120 may be used by the first identification unit 1131.
  • the feature acquisition unit 120 acquires skeleton information indicating the skeleton of the detected person 51 from the input image 50.
  • the skeleton information is acquired using, for example, OpenPose.
  • the skeletal information is also characteristic information indicating a human body, and is also characteristic information indicating a part of the human body (head, neck, shoulder, elbow, hand, waist, knee, ankle, eye, ear, fingertip, etc.).
  • the skeleton (skeleton information) 52 of the detected person 51 is detected from the image 50.
  • the feature acquisition unit 120 can acquire the posture information of the person from the positional relationship (skeleton information) of the parts of the human body.
  • the posture information may be information listing the relative positional relationships of each part of the human body, or may be a result of classifying the positional relationships of the parts of the human body. Examples of the posture classification include upright, stoop, O-leg, and X-leg.
  • the feature acquisition unit 120 may acquire the silhouette information of the detected person, or may obtain the posture from the silhouette information.
  • skeleton information skeleton information, posture information, and silhouette information all correspond to the human body characteristics in the present invention.
  • the feature acquisition unit 120 can detect motion information (a gesture) from a change in posture information of each frame.
  • a gesture for example, information indicating walking, bending, lying, lying down, and the like can be obtained.
  • the positional relationship between the upper arm and the forearm differs depending on whether the arm is folded or not.
  • the positional relationship of each part depends on the gesture. Therefore, a gesture can be detected from the positional relationship of each part based on the skeleton information.
  • a gesture involving movement such as walking or bending may be detected using a plurality of pieces of skeleton information respectively corresponding to a plurality of images captured at different times. For walking, information indicating the ratio between the stride width and the shoulder width may be obtained.
  • the feature acquisition unit 120 may acquire the detected person's stay location and flow line (movement route) as features.
  • the characteristics related to the stay location can be obtained by calculating the ratio of stay time to the length of the predetermined period (stay rate) for each stay determination area based on the time change of the person position in the predetermined period such as the past several minutes. .
  • the detection result of the stay location is obtained in the form of a stay map (heat map) indicating the stay rate of each stay determination area.
  • the feature related to the flow line is information indicating the stay position in time series, and is obtained by the same method as the stay map.
  • the feature registration unit 130 registers the human body / behavior feature acquired by the feature acquisition unit 120 in a storage unit such as a matching database in association with the person specified by the personal identification device 110 (person identification unit 113).
  • the timing at which the feature registration unit 130 performs registration may be arbitrary, but may be, for example, the timing at which the identification by the person identification unit 113 can be performed with high reliability, or the timing at which the person tracking is completed.
  • FIG. 3 is a flowchart showing the overall flow of feature collection processing performed by the feature collection device 100.
  • the feature collection processing in the present embodiment will be described with reference to FIG. Note that this flowchart conceptually describes the feature collection processing in the present embodiment, and it should be noted that the processing according to this flowchart need not be implemented in the embodiment.
  • step S10 the human body detection unit 112 detects a human body from the input image. If no human body is detected (S11-NO), the process returns to step S10 and human body detection is performed from the next processing target frame image. On the other hand, if a human body is detected (S11-YES), the process proceeds to step S12.
  • step S12 the feature acquisition unit 120 acquires the human body / behavior feature of the detected person.
  • step S13 the first identification unit 1131 performs personal identification based on the human body / behavior feature. Note that facial features may also be acquired in step S12, and the first identification unit 1131 may perform personal identification based on the facial features in step S13.
  • identification based on facial features can be expected to be performed with relatively high accuracy (high reliability), it can be performed only when the user is facing the robot 1.
  • identification based on human body characteristics and behavioral characteristics can be performed if the user's body is reflected, but the accuracy is not always good. In particular, it is assumed that highly reliable identification is not possible at the stage of collecting features for initial learning of human body / behavior features.
  • the first identification unit 1131 continues to identify the person included in the moving image, and calculates the identification reliability at the present time by combining the identification reliability in each frame. At this time, the first identification unit 1131 may use a weighted average obtained by assigning a large weight to the reliability of recently performed identification as the identification reliability at the present time.
  • step S14 the utterance control unit 1132 determines whether or not it is time to perform utterance.
  • the timing for uttering may be set as a condition in advance, and in step S14, the person identifying unit 13 may determine whether or not the set condition is met.
  • the following conditions can be adopted as conditions for utterance.
  • Time For example, the first utterance is performed after a predetermined time (for example, 10 minutes) has elapsed since the detection of the human body, and thereafter the utterance is performed at predetermined intervals.
  • Data amount For example, when a feature of a predetermined data amount (for example, data for 100 times) is acquired, speech is performed.
  • Action stop When the detected person's action does not change for a certain period of time. For example, it corresponds to the case where a certain time has elapsed after sitting on a sofa and starting to watch television.
  • Movement outside the imaging range When the detected person is predicted to move outside the imaging range.
  • this corresponds to a case where the detected person moves / goes from the current room to another room.
  • Situations where speech is easy When the robot reaches a situation suitable for interacting with the detected person. For example, when the detected person and the robot face each other (the robot can detect the face of the detected person), and the distance between the detected person and the robot is within a predetermined distance (for example, 3 meters).
  • a predetermined distance for example, 3 meters.
  • the second identification unit 1133 Speaking may be performed for identification. Or, in consideration of the identification results of both the first identification unit 1131 and the second identification unit 1133, when the identification reliability is less than the threshold value TH1, utterance is performed for further identification by the second identification unit 1133. You may do it.
  • a plurality of the above conditions may be combined. For example, a speech may be made when any one of the plurality of conditions is satisfied, or an utterance may be made when some of the plurality of conditions are simultaneously satisfied. You may make it do. Furthermore, in situations where it is not appropriate to speak, such as when the detected person is asleep or concentrated, the utterance may not be made even if the above condition is satisfied.
  • the utterance here is for the purpose of obtaining a response voice for personal identification, so it is performed at the timing as described above, but utterance at other timings is prohibited. It is not a thing. For example, you may make it talk to a user for communication in the timing which does not satisfy
  • step S14 If it is determined in step S14 that the utterance timing condition is satisfied, the process proceeds to step S15; otherwise, the process proceeds to step S19.
  • step S15 the utterance control unit 1132 determines the content of the utterance (utterance text). In the present embodiment, the utterance control unit 1132 determines the utterance content based on the current identification reliability.
  • FIG. 4 is a flowchart illustrating the utterance content determination process in step S15.
  • the utterance control unit 1132 determines the utterance level according to the current identification reliability.
  • a high reliability with an identification reliability of 0.8 or more for example, a high reliability with an identification reliability of 0.8 or more, a medium reliability with an identification reliability of 0.5 or more and less than 0.8, and a low reliability with an identification reliability of less than 0.5.
  • This threshold value is merely an example, and may be appropriately determined according to the system request. Further, the threshold value may change according to the situation. Further, in this embodiment, the level is divided into three stages, but it may be divided into two stages or four stages or more.
  • the process proceeds to step S152, and the utterance control unit 1132 determines the utterance content with emphasis on the naturalness of communication. For example, the utterance control unit 1132 determines the utterance content that does not include the name of the detected person. When the place where the person is detected is in the kitchen, the utterance control unit 1132 determines, for example, “What are you doing in the kitchen?” As the utterance content.
  • the process proceeds to step S153, and the utterance control unit 1132 determines the utterance contents so as to confirm who the detected person is so as not to be unnatural. For example, the utterance control unit 1132 sets the content including the name of the identification result of the first identification unit 1131 as the utterance content. When the identification result of the first identification unit 1131 is “mother”, the utterance control unit 1132 determines, for example, “Mom, what are you doing in the kitchen?” As the utterance content.
  • step S154 the utterance control unit 1132 sets the utterance content to more directly ask who the detected person is. For example, the utterance control unit 1132 determines “who is in the kitchen?” As the utterance content.
  • step S15 The text data of the utterance content determined in step S15 is transferred from the utterance control unit 1132 to the voice output unit 114.
  • step S ⁇ b> 16 the voice output unit 114 converts the utterance text into voice data by voice synthesis and outputs the voice data from the speaker 300.
  • step S ⁇ b> 17 the voice input unit 115 acquires a response to the system utterance from the microphone 400.
  • the second identification unit 1133 performs personal identification based on the input voice.
  • the second identification unit 1133 performs identification based on acoustic features (acoustic analysis) and identification based on language features (semantic analysis).
  • the identification based on the acoustic features can be obtained if a response voice is obtained, but the identification based on the language features is unknown depending on the contents of the question and the response.
  • the meaning of “(I am the mother)” can be considered, so the identification result is considered to be reliable.
  • step S19 the person specifying unit 1134 specifies the detected person in consideration of the identification result (S13) based on the image by the first identification unit 1131 and the identification result (S18) based on the sound by the second identification unit 1133. .
  • the second identification unit 1133 cannot perform identification based on the language feature, but can obtain an identification result based on the acoustic feature.
  • the person specifying unit 1134 can confirm that the identification result of the first identification unit 1131 is correct, Specific result.
  • the person specifying unit 1134 may adopt any identification result, or the detected person may be unknown. In this case, the person specifying unit 1134 may set the identification reliability low, and more directly confirm who is the detected person in the next utterance.
  • the second identification unit 1133 can perform both identification based on acoustic features and identification based on language features. In identification based on language characteristics, whether or not the other party is “mother” is identified based on whether or not the response sentence contains a word that denies it in response to an inquiry using the name “mother” it can. If the response sentence includes a phrase indicating who the person is, the detected person can be identified based on the phrase. At this time, there may be a difference between the identification result based on the acoustic feature and the identification result based on the linguistic feature. May be.
  • the second identification unit 1133 may identify the detected person by performing identification based on the language feature. Since the utterance content that directly inquires who the detected person is is employed, the detected person can be more reliably identified from the meaning content of the response. At this time, as described above, the second identification unit 1133 may perform identification based on acoustic features.
  • step S20 it is determined whether or not to continue feature collection. If feature collection is to be continued, the process returns to step S10 to process the next frame.
  • the feature registration unit 130 registers the acquired human body / biological feature in the storage unit (not shown) in association with the detected person identification result.
  • feature registration is performed at the timing when feature collection ends, but feature registration may be performed at the timing when tracking of the detected person is completed.
  • the feature registration unit 130 registers all the features obtained from the start to the end of tracking in association with one person. However, the feature registration unit 130 may divide one tracking period into a plurality of periods and register the obtained characteristics in association with the identification result person within the period. This process can be performed when the human body detection unit 112 cannot correctly recognize that the person has been changed in the middle and has been detected as the same person, but is identified as a different person by the person identification unit 113.
  • the utterance content is determined according to the identification reliability.
  • the content of the utterance is determined in consideration of the fact that if a response from the user is obtained by voice, the identification based on at least the acoustic feature can be performed and that the utterance can be performed a plurality of times in one dialogue.
  • the process is changed from the first embodiment.
  • differences from the first embodiment will be mainly described.
  • FIG. 6 is a flowchart showing a flow of utterance content determination processing in the present embodiment. Note that in the present embodiment, it is also assumed that speech is performed a plurality of times, and therefore the process shown in FIG. 6 is not the process of step S15 in FIG.
  • the utterance control unit 1132 determines the utterance content that makes natural communication, as shown in step S31, when it is time to utter. Therefore, it is not necessary to include the name of the detected person in the utterance content, and the content such as “What are you doing in the kitchen?” Is determined as the utterance content.
  • step S32 the content determined in step S31 is output from the speaker 300, and the second identification unit 1133 identifies the user's response voice acquired by the microphone 400 as a result.
  • the second identification unit 1133 identifies the user's response voice acquired by the microphone 400 as a result.
  • at least identification based on acoustic features may be performed.
  • identification based on language features is also possible if possible.
  • step S33 it is determined whether the identification result by the first identification unit 1131 matches the identification result by the second identification unit 1133. If they match, the detected person can be identified, so there is no need to speak further. On the other hand, if the two identification results are different, further utterance content is determined in step S34 in order to identify the person more reliably.
  • the utterance control unit 1132 compares the contents of the second utterance with the utterances of the first time, and determines the contents to confirm who is the detected person more directly. In the present case, for example, it is possible to adopt the utterance content “Isn't my mother in the kitchen?”.
  • the utterance content including the name of the detected person for example, “Mom, what are you doing there?”
  • the utterance content for example, “who is in the kitchen?”
  • the utterance content May be determined as the second utterance content.
  • the utterance is performed twice in one dialogue, but the utterance may be performed three or more times in one dialogue. In that case, a natural utterance may be performed for the first several times, and when the person cannot be identified based on the voice, the utterance may be performed to confirm who the detected person is.
  • the processing of the present embodiment is different from the identification result based on the face figure in the first embodiment when the utterance of the natural content with high identification reliability based on the image is performed. It is also applicable to cases.
  • the timing at which the feature collection apparatus 100 (home robot) speaks to the user and the contents thereof are described. This is processing for utterance to obtain a response based on voice from the user.
  • a home robot is implemented as a communication robot, it is not necessary to apply the above processing to an utterance for communicating with a user.
  • an utterance that does not include the name of the other party is cited. However, if the utterance that includes the name is a natural scene or the other party, the utterance content including the name is a natural utterance. It becomes.
  • the feature collection device 100 is mounted on a home robot, but it may be mounted on a surveillance camera or the like.
  • the personal identification device 110 does not need to be mounted on the feature collection device 100 that collects learning features, and may be used alone to identify an individual.
  • Voice output means (14, 114) for outputting voice;
  • Voice input means (15, 115) for acquiring voice in response to the output voice;
  • Image input means (11, 111);
  • Person identifying means (13, 113) for identifying a person detected by the detecting means;
  • the person identifying means includes First identification means (131, 1131) for identifying the person based on the moving image;
  • Second identification means (133, 1133) for identifying the person based on the input voice obtained as a response to the output voice, Identifying the person based on a first identification result that is an identification result by the first identification means and a second identification result that is an identification result by the second identification means;
  • a personal identification device is

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Veterinary Medicine (AREA)
  • Animal Behavior & Ethology (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Collating Specific Patterns (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)

Abstract

Provided is an individual identification device characterized by comprising a voice output means for outputting a voice, a voice input means for acquiring a voice in response to the outputted voice, an image input means, a detection means for detecting a person from a motion image inputted to the image input means, and a person identification means for identifying the person detected by the detection means, wherein the person identification means comprises a first identification means for identifying the person on the basis of the motion image and a second identification means for identifying the person on the basis of the inputted voice obtained as the response to the outputted voice, and identifies the person on the basis of a first identification result being the result of the identification performed by the first identification means and a second identification result being the result of the identification performed by the second identification means.

Description

個人識別装置および特徴収集装置Personal identification device and feature collection device
 本発明は、撮像された画像に写っている人物を識別する個人識別技術に関する。 The present invention relates to a personal identification technique for identifying a person shown in a captured image.
 個人識別技術として顔認識を用いる技術が存在するが、対象者がカメラと正対しない場合には識別できない。また、家庭内に限れば、遺伝的に似た顔であるため判別が困難であるという課題もある。そこで、人体特徴(姿勢、シルエット)や行動特徴(動線パターン、滞在場所)を用いた識別を行うことで、顔が撮影できない場面でも識別可能とすることが考えられる。 Although there is a technology that uses face recognition as a personal identification technology, it cannot be identified if the subject does not face the camera. In addition, if it is limited to the home, there is a problem that it is difficult to distinguish because the face is genetically similar. In view of this, it is conceivable that identification can be made even when a face cannot be photographed by performing identification using human body features (posture, silhouette) and behavioral features (flow line pattern, staying place).
 人体特徴や行動特徴を用いた識別を行うためには、対象者ごとの人体特徴や行動特徴を収集して学習を行う必要がある。また、時間や環境の変化に応じて行動特徴が変化する可能性があるので、これらの特徴を継続的に収集して、識別に用いる登録情報を更新することで変化に対応することが望まれる。 In order to perform identification using human body characteristics and behavioral characteristics, it is necessary to learn by collecting human body characteristics and behavioral characteristics for each target person. In addition, behavioral features may change according to changes in time and environment, so it is desirable to continuously collect these features and respond to changes by updating registration information used for identification. .
 人体特徴や行動特徴を収集するためには、検出された人物が誰であるかを識別する必要があるが、画像に基づく識別のみでは正確な個人識別を行えない場面がある。上述のように顔が撮影できなければ顔認識は利用できないし、人体特徴や行動特徴の学習が不十分であればこれらの特徴に基づく識別も正確ではない。 In order to collect human body features and behavioral features, it is necessary to identify who the detected person is, but there are situations in which accurate personal identification cannot be performed only by identification based on images. As described above, face recognition cannot be used unless a face can be photographed, and identification based on these features is not accurate if learning of human body features and behavior features is insufficient.
 特許文献1は、人物に対して呼び掛け音声を出力して、呼び掛け音声出力前後での顔の変化に基づいて登録すべき最適な顔を選択することを開示する。例えば、呼び掛けに対して振り向いた顔を選択することが開示されている。しかしながら、特許文献1は、あくまでも学習用の顔画像の取得方法を開示するだけであり、識別時に顔が写った画像が得られなければ正確な個人識別が行えないという問題に対処できるものではない。 Patent Document 1 discloses that a call voice is output to a person and an optimum face to be registered is selected based on a change in face before and after the call voice is output. For example, it is disclosed to select a face turned around in response to a call. However, Patent Document 1 only discloses a method for acquiring a learning face image, and cannot address the problem that accurate personal identification cannot be performed unless an image showing a face is obtained at the time of identification. .
特開2013-182325号公報JP 2013-182325 A
 本発明は、上記実情に鑑みなされたものであって、画像内の人物を精度良く識別可能な個人識別技術を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a personal identification technique capable of accurately identifying a person in an image.
 上記目的を達成するために、本発明では、画像に基づく識別と、出力音声への応答音声に基づく識別とに基づいて人物を識別するという手法を採用する。 In order to achieve the above object, the present invention employs a technique of identifying a person based on identification based on an image and identification based on a response voice to the output voice.
 具体的には、本発明の第一態様は、音声出力手段、音声入力手段、画像入力手段、検出手段、および人物識別手段を備える。音声出力手段は音声を出力する。音声入力手段は、出力音声に応答する音声を取得する。画像入力手段は動画像を取得する。検出手段は、入力される動画像から人体を検出する。人体検出手法は任意の手法を採用可能である。人物識別手段は、画像に基づく識別を行う第1識別手段と、音声に基づく識別を行う第2識別手段を有する。第1識別手段は、例えば、顔特徴、人体特徴(姿勢やシルエット)、行動特徴(動線パターンや滞在場所)を利用した識別手法を採用できるが、画像から得られる特徴に基づく識別であればその他の識別手法を採用してもよい。第2識別手段は、入力音声の波形解析により得られる音響特徴量や、自然言語解析により得られる単語や文章の特徴量に基づく識別手段を採用できるが、入力音声に基づく識別であればその他の手法を採用してもよい。人物識別手段は、第1識別手段による識別結果(以下、第1識別結果)と、第2識別手段による識別結果(以下、第2識別結果)に基づいて、検出された人物を識別する。 Specifically, the first aspect of the present invention includes voice output means, voice input means, image input means, detection means, and person identification means. The sound output means outputs sound. The voice input means acquires a voice that responds to the output voice. The image input means acquires a moving image. The detection means detects a human body from the input moving image. Any method can be adopted as the human body detection method. The person identifying means includes a first identifying means for performing identification based on an image and a second identifying means for performing identification based on sound. For example, the first identification means can employ an identification method that uses facial features, human body features (postures and silhouettes), and behavioral features (flow line patterns and places of stay). Other identification methods may be employed. The second discriminating means can employ discriminating means based on acoustic feature quantities obtained by waveform analysis of input speech or feature quantities of words or sentences obtained by natural language analysis. A technique may be adopted. The person identifying means identifies the detected person based on the identification result by the first identifying means (hereinafter referred to as the first identification result) and the identification result by the second identifying means (hereinafter referred to as the second identification result).
 このような構成によれば、画像のみに基づいて人物の識別が精度良く行えない場合でも、音声に基づく識別結果と合わせて判断することで、人物識別を精度良く行えるようになる。 According to such a configuration, even when it is not possible to accurately identify a person based only on an image, it is possible to accurately identify a person by making a determination together with a sound-based identification result.
 本態様において、人物識別手段は、第1識別手段が信頼度高く識別できない場合に第2識別手段も考慮した識別を行うように構成してもよい。すなわち、前記人物識別手段は、前記第1識別手段による信頼度が第1閾値未満の場合に、前記第2識別手段による識別を行って、前記第1識別結果と前記第2識別結果とに基づいて前記人物を識別するように構成してもよい。人物識別手段は、第1識別結果と第2識別結果とに基づく識別では、例えば、第1識別結果と第2識別結果が一致した場合に識別結果を確定し、2つの識別結果が異なる場合には識別結果を未確定としてもよい。 In this aspect, the person identifying means may be configured to perform identification considering the second identifying means when the first identifying means cannot be identified with high reliability. That is, the person identifying means performs identification by the second identifying means when the reliability by the first identifying means is less than a first threshold, and based on the first identification result and the second identification result. The person may be identified. In the identification based on the first identification result and the second identification result, for example, the person identification means determines the identification result when the first identification result and the second identification result match, and the two identification results are different. The identification result may be unconfirmed.
 また本態様において、前記第1識別手段による信頼度が前記第1閾値以上の場合は、前記第1識別結果を、前記人物の識別結果としてもよい。 Further, in this aspect, when the reliability by the first identification means is equal to or higher than the first threshold, the first identification result may be the person identification result.
 このような構成によれば、画像のみから信頼度高く識別が行える場合に、処理量を削減できるとともに、音声での問いかけも省略されてユーザに対応を強いる必要がなくなる。 According to such a configuration, when the identification can be performed with high reliability only from the image, the processing amount can be reduced, and the question by voice is also omitted, so that it is not necessary to force the user to deal with it.
 本態様において、前記音声出力手段からの音声出力は所定のタイミングで行わればよい。この所定のタイミングは、人物の識別信頼度が閾値未満となったタイミング、人物が検出されたタイミングから第1の所定時間が経過したタイミング、人物の時間変化が略無い状態が第2の所定時間継続したタイミング、人物が撮像範囲外へ出るタイミング、情報処理装置と前記人物の間の距離が所定距離以下となったタイミング、の少なくとも何れかとすることができる。人物の識別信頼度は、第1識別手段による識別の信頼度であってもよいし、第1識別手段と第2識別手段の識別結果を統合した信頼度であってもよい。第1の所定時間は、固定の時間であってもよいし、画像から特徴を取得する場合には特徴の取得回数や特徴のデータ量に応じて決定される時間であってもよい。 In this aspect, the sound output from the sound output means may be performed at a predetermined timing. The predetermined timing includes the timing at which the person's identification reliability becomes less than the threshold, the timing at which the first predetermined time has elapsed from the timing at which the person is detected, and the state in which there is substantially no change in the time of the person at the second predetermined time. It can be at least one of the continued timing, the timing when the person goes out of the imaging range, and the timing when the distance between the information processing apparatus and the person becomes equal to or less than a predetermined distance. The identification reliability of the person may be the reliability of identification by the first identification means, or may be the reliability obtained by integrating the identification results of the first identification means and the second identification means. The first predetermined time may be a fixed time, or may be a time determined according to the number of feature acquisitions or the data amount of features when acquiring features from an image.
 本態様において、前記人物識別手段は、前記第1識別手段による信頼度に応じて、前記出力音声の内容を決定することが好適である。例えば、前記人物識別手段は、前記信頼度が第2閾値未満の場合は、前記第1識別結果の人物の呼称を含む内容、または、前記人物が誰であるかを問い合わせる内容を、前記出力音声の内容として決定することができる。なお、信頼度を3段階以上にレベル分けして、それぞれのレベルに応じて出力音声の内容を決定してもよい。例えば、信頼度を3段階に分ける場合には、高レベルであれば人物の呼称を含まない自然な内容とし、中レベルであれば人物の呼称を含む内容とし、低レベルであれば直接的に人物が誰であるかを問い合わせる内容としてもよい。一例として、高レベルであれば「そこで何してるの?」、中レベルであれば「お母さん、そこで何してるの?」、低レベルであれば「そこにいるのは誰?」といった内容を採用することができる。 In this aspect, it is preferable that the person identifying unit determines the content of the output sound according to the reliability by the first identifying unit. For example, when the reliability is less than a second threshold, the person identifying means displays the content including the name of the person of the first identification result or the content for inquiring who the person is as the output voice. The contents can be determined. Note that the reliability may be divided into three or more levels, and the content of the output sound may be determined according to each level. For example, when the reliability is divided into three levels, if the level is high, the content does not include the name of the person. If the level is medium, the content includes the name of the person. It is good also as the content which inquires who the person is. For example, “What are you doing there?” At the high level, “Mom, what are you doing there” at the middle level, “Who are you there” at the low level? Can be adopted.
 また、本態様において、前記人物識別手段は、前記第1識別結果と前記第2識別結果が一致しない場合には、新たに音声を出力して当該新たな出力音声に応答する入力音声に基づいて前記第2識別手段による識別を行ってもよい。この際は、新たな出力音声の内容は、前回の出力音声の内容と比較してより直接的に前記人物を確認する内容とするとよい。第1識別結果と第2識別結果が一致しない場合には、人物が誰であるか確定できない状態であるので、より直接的に人物が誰であるかを確認するような内容の音声出力を行って、その応答から人物が誰であるか識別するとよい。 In this aspect, when the first identification result and the second identification result do not match, the person identification means outputs a new voice and responds to the new output voice based on the input voice. You may identify by the said 2nd identification means. In this case, the content of the new output sound may be content that confirms the person more directly than the content of the previous output sound. If the first identification result and the second identification result do not match, it is in a state where it is not possible to determine who the person is. Therefore, the voice output is performed so that the person is more directly confirmed. It is good to identify who the person is from the response.
 本発明の第二態様は、上記の個人識別装置と、前記画像入力手段に入力される動画像から、前記検出された人物の人体または行動に関する特徴の少なくともいずれかを取得する特徴取得手段と、前記特徴取得手段によって取得された特徴を、前記人物識別手段によって識別された人物を関連付けて登録する特徴登録手段と、備える特徴収集装置である。このようにして収集された特徴は、識別器の学習に用いることができる。 According to a second aspect of the present invention, the personal identification device described above, and a feature acquisition unit that acquires at least one of the features related to the human body or behavior of the detected person from the moving image input to the image input unit, A feature collection device comprising: a feature registration unit that registers the feature acquired by the feature acquisition unit in association with the person identified by the person identification unit. The features collected in this way can be used for classifier learning.
 なお、本発明は、上記構成ないし機能の少なくとも一部を有する情報処理システムとして捉えることができる。また、本発明は、上記処理の少なくとも一部を含む、情報処理方法又は情報処理システムの制御方法や、これらの方法をコンピュータに実行させるためのプログラム、又は、そのようなプログラムを非一時的に記録したコンピュータ読取可能な記録媒体として捉えることもできる。上記構成及び処理の各々は技術的な矛盾が生じない限り互いに組み合わせて本発明を構成することができる。 Note that the present invention can be understood as an information processing system having at least a part of the above configuration or function. The present invention also provides an information processing method or information processing system control method including at least a part of the above processing, a program for causing a computer to execute these methods, or such a program non-temporarily. It can also be understood as a recorded computer-readable recording medium. Each of the above configurations and processes can be combined with each other to constitute the present invention as long as there is no technical contradiction.
 本発明によれば、画像内の人物を精度良く識別できる。 According to the present invention, a person in an image can be accurately identified.
図1は、本発明が適用された個人識別装置の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a personal identification device to which the present invention is applied. 図2は、第1の実施形態に係る特徴収集装置の構成例を示すブロック図である。FIG. 2 is a block diagram illustrating a configuration example of the feature collection device according to the first embodiment. 図3は、第1の実施形態における特徴収集処理の例を示すフローチャートである。FIG. 3 is a flowchart illustrating an example of feature collection processing according to the first embodiment. 図4は、第1の実施形態における発話内容決定処理の例を示すフローチャートである。FIG. 4 is a flowchart illustrating an example of utterance content determination processing according to the first embodiment. 図5Aおよび図5Bは、入力画像から検出される人体の骨格情報と、骨格に基づく姿勢検出を説明する図である。5A and 5B are diagrams illustrating human body skeleton information detected from an input image and posture detection based on the skeleton. 図6は、第2の実施形態における発話内容決定処理の例を示すフローチャートである。FIG. 6 is a flowchart illustrating an example of utterance content determination processing in the second embodiment.
 <適用例>
 本発明の適用例について説明する。個人の人体特徴や行動特徴を収集する際には、収集している対象の人物が誰であるのかを特定する必要がある。画像に基づく個人識別手法として顔認識があるが、必ずしも顔が撮影できるとは限らないので精度のよい識別が行えない場合がある。また、人体特徴や行動特徴に基づく識別も可能であるが、人体特徴や行動特徴による識別のための特徴収集において、人体特徴や行動特徴に基づく識別に全面的に頼ることは現実的ではない。
<Application example>
An application example of the present invention will be described. When collecting personal body characteristics and behavioral characteristics, it is necessary to identify who is the target person being collected. There is face recognition as a personal identification method based on an image. However, since a face cannot always be captured, accurate identification may not be performed. Although identification based on human body characteristics and behavioral characteristics is possible, it is not realistic to rely entirely on identification based on human body characteristics or behavioral characteristics in feature collection for identification based on human body characteristics or behavioral characteristics.
 図1は、本発明が適用された個人識別装置10の構成例を示すブロック図である。個人識別装置110は、入力される動画像や音声から動画像中の人物を識別する。個人識別装置10は、画像入力部11、人体検出部12、人物識別部13、音声出力部14、および音声入力部15を有する。 FIG. 1 is a block diagram showing a configuration example of a personal identification device 10 to which the present invention is applied. The personal identification device 110 identifies a person in the moving image from the input moving image or sound. The personal identification device 10 includes an image input unit 11, a human body detection unit 12, a person identification unit 13, a voice output unit 14, and a voice input unit 15.
 画像入力部11は、撮像された動画像(動画像データ)を取得する。例えば、画像入力部11は、画像データが入力される入力端子である。画像入力部11は、本発明の画像取得手段の一例である。 The image input unit 11 acquires a captured moving image (moving image data). For example, the image input unit 11 is an input terminal to which image data is input. The image input unit 11 is an example of an image acquisition unit of the present invention.
 人体検出部12は、動画像データの処理対象フレーム画像から人の体らしい領域(人体領域)を検出する。人体検出は、既存の任意のアルゴリズムによって学習された検出器を用いて行えばよい。人体検出部12は、本発明の検出手段の一例である。 The human body detection unit 12 detects a human body region (human body region) from the processing target frame image of the moving image data. Human body detection may be performed using a detector learned by an existing arbitrary algorithm. The human body detection unit 12 is an example of the detection means of the present invention.
 音声出力部14は、人物識別部13から得られる発話内容を音声データにして出力する。音声出力部14は本発明の音声出力手段の一例である。 The voice output unit 14 outputs the speech content obtained from the person identification unit 13 as voice data. The audio output unit 14 is an example of the audio output means of the present invention.
 音声入力部15は、音声データを取得する。音声入力部15は、本発明の音声入力手段の一例である。 The voice input unit 15 acquires voice data. The voice input unit 15 is an example of a voice input unit of the present invention.
 人物識別部13は、画像入力部11が取得する動画像や音声入力部15が取得する音声に基づいて、動画像中の人物を特定する。人物識別部13は、音声による識別に用いる応答音声を引き出すための発話の内容およびその発話のタイミングも決定する。人物識別部13は、本発明の人物識別手段の一例である。 The person identifying unit 13 identifies a person in the moving image based on the moving image acquired by the image input unit 11 and the sound acquired by the audio input unit 15. The person identification unit 13 also determines the content of the utterance for extracting the response voice used for voice identification and the timing of the utterance. The person identification unit 13 is an example of a person identification unit of the present invention.
 第1識別部131は、画像入力部11が取得した動画像から人体検出部12によって検出された人物の識別を行う。第1識別部131は、入力画像に基づいて人物の識別を行う。具体的には、画像中の顔特徴、人体特徴、および行動特徴の少なくともいずれかに基づいて検出人物を識別する。第1識別部131は、本発明の第1識別手段の一例である。 The first identification unit 131 identifies a person detected by the human body detection unit 12 from the moving image acquired by the image input unit 11. The first identification unit 131 identifies a person based on the input image. Specifically, the detected person is identified based on at least one of a facial feature, a human body feature, and a behavior feature in the image. The 1st identification part 131 is an example of the 1st identification means of this invention.
 第2識別部133は、音声入力部15が取得する音声に基づいて、人体検出部12によって検出された人物の識別を行う。第2識別部133は、音声の波形解析を行って得られる音響特徴量に基づいて識別を行ってもよいし、自然言語解析を行って得られる言語特徴量に基づいて識別を行ってもよい。音響特徴量に基づく識別は、あらかじめ登録された音響特徴量との一致度合いに基づいて行える。言語特徴量に基づく識別は、出力音声の内容と応答音声の内容の意味に基づいて行える。第2識別部133は、本発明の第2識別手段の一例である。 The second identification unit 133 identifies the person detected by the human body detection unit 12 based on the voice acquired by the voice input unit 15. The second identification unit 133 may perform identification based on an acoustic feature obtained by performing speech waveform analysis, or may perform identification based on a language feature obtained by performing natural language analysis. . The identification based on the acoustic feature amount can be performed based on the degree of coincidence with the acoustic feature amount registered in advance. The identification based on the language feature amount can be performed based on the meaning of the content of the output speech and the content of the response speech. The 2nd identification part 133 is an example of the 2nd identification means of this invention.
 人物識別部13は、音声出力部14から発話を行うタイミングおよび発話文の内容を決定する。本適用例において、音声出力部14から発話(音声の出力)を行うのは、検出人物から音声による応答を得て、音声に基づく検出を行うためである。したがって、人物識別部13は、音声に基づく識別を行う必要があると想定されるタイミングを発話タイミングとして決定する。人物識別部13は、典型的には、第1識別部131による識別の信頼度が低い(閾値TH1未満)タイミングを発話タイミングとして決定してもよい。また、人物識別部13は、定期的に発話を行ったり、人物の動きがない場合に発話を行ったり、第1識別部131と第2識別部133の識別結果が相違する場合に発話を行ったりするように発話タイミングを決定してもよい。発話内容は、人物識別の信頼度に応じて決定してもよい。例えば、信頼度が高い場合には自然なコミュニケーションの発話を行い、信頼度が低い場合には、推定される人物の呼称を含む発話や、直接的に検出人物が誰であるかを問い合わせる発話文を行うことが考えられる。ここで、人物の呼称を含む発話および人物が誰であるかを問い合わせる発話は、人物が誰であるかを確認する発話の一例である。 The person identification unit 13 determines the timing of speech from the voice output unit 14 and the content of the speech sentence. In this application example, the voice output unit 14 utters (outputs voice) because it obtains a voice response from the detected person and performs detection based on the voice. Therefore, the person identifying unit 13 determines a timing at which it is assumed that it is necessary to perform identification based on speech as an utterance timing. The person identifying unit 13 may typically determine a timing at which the reliability of identification by the first identifying unit 131 is low (less than the threshold value TH1) as the utterance timing. The person identifying unit 13 utters regularly, utters when there is no movement of the person, or utters when the identification results of the first identifying unit 131 and the second identifying unit 133 are different. The utterance timing may be determined so that The utterance content may be determined according to the reliability of person identification. For example, utterance of natural communication is performed when the reliability is high, and utterance including the name of the estimated person or an utterance that directly inquires who the detected person is when the reliability is low. Can be considered. Here, the utterance including the name of the person and the utterance for inquiring who the person is are examples of the utterance for confirming who the person is.
 なお、発話および応答を複数回行って、複数回の応答音声に対して第2識別部133による識別を行ってもよい。この場合、例えば、1回目は自然なコミュニケーションの発話を行い、第2識別部133が応答音声の音響特徴に基づいて識別を行う。ここで、第1識別部131と第2識別部133の識別結果が一致するか、第2識別部133の識別結果の信頼度が高いときには、人物の識別が正しく行えていると考えられる。したがって、2回目の発話はしなくてもよいし、引き続き自然なコミュニケーションを続けてもよい。一方、1回目の応答に基づく第2識別部133の識別結果が第1識別部131の識別結果と相違するか、識別信頼度が低い場合には、2回目の発話内容を、1回目の発話と比較してより直接的に人物を確認する内容とするとよい。2回目は応答の内容(自然言語解析による意味解釈)によって人物を識別できるので、信頼度の高い識別結果が得られることが期待される。 It should be noted that the utterance and the response may be performed a plurality of times, and the second identifying unit 133 may perform the identification for the plurality of response sounds. In this case, for example, the first communication utters natural communication, and the second identification unit 133 performs identification based on the acoustic characteristics of the response voice. Here, when the identification result of the 1st identification part 131 and the 2nd identification part 133 corresponds, or when the reliability of the identification result of the 2nd identification part 133 is high, it is thought that the identification of a person is performed correctly. Therefore, the second utterance may not be made, and natural communication may be continued. On the other hand, when the identification result of the second identification unit 133 based on the first response is different from the identification result of the first identification unit 131 or the identification reliability is low, the second utterance content is changed to the first utterance. It is better to confirm the person more directly than Since the person can be identified by the contents of the response (semantic interpretation by natural language analysis) the second time, it is expected that a highly reliable identification result will be obtained.
 このように、本適用例に係る個人識別装置10によれば、画像と音声の2つの識別手法に基づいて画像中の人物を識別することで、画像だけからは信頼度高く識別が行えない場合でも、精度良く識別が行える。また、個人識別のための音声による問いかけを最小限にできるので、ユーザに煩わしさを感じさせることを抑制できる。また、上述のように発話内容を決定することで、ユーザに問いかけを行う際にも自然なコミュニケーションが可能であり、このような点からもユーザに煩わしさを感じさせることを抑制できる。 As described above, according to the personal identification device 10 according to this application example, when a person in an image is identified based on two identification methods of an image and a sound, the identification cannot be performed with high reliability only from the image. However, it can be identified accurately. In addition, since the question by voice for personal identification can be minimized, it is possible to suppress the user from feeling troublesome. In addition, by determining the utterance content as described above, natural communication is possible even when making a question to the user, and it is possible to suppress annoying the user from such a point.
(第1の実施形態)
 本発明の第1の実施形態は、人物を撮影して得られる人体特徴や行動特徴を収集する特徴収集装置であり、家庭用のコミュニケーションロボット1(以下、単にロボット1とも称する)に搭載される。特徴収集装置は、個人識別の学習用にこれらの特徴を収集する。
(First embodiment)
The first embodiment of the present invention is a feature collection device that collects human body features and behavioral features obtained by photographing a person, and is mounted on a home communication robot 1 (hereinafter also simply referred to as a robot 1). . The feature collection device collects these features for personal identification learning.
[構成]
 図2は、ロボット1の構成を示す図である。ロボット1は、特徴収集装置100、カメラ200、スピーカ300、およびマイク400を備える。なお、ロボット1は、CPUなどのプロセッサ(演算装置)と、主記憶装置、補助記憶装置、通信装置などを有しており、プロセッサがプログラムを実行することで特徴収集装置100の各処理が実行される。
[Constitution]
FIG. 2 is a diagram illustrating a configuration of the robot 1. The robot 1 includes a feature collection device 100, a camera 200, a speaker 300, and a microphone 400. The robot 1 includes a processor (arithmetic unit) such as a CPU, a main storage device, an auxiliary storage device, a communication device, and the like, and each process of the feature collection device 100 is executed by the processor executing a program. Is done.
 カメラ200は、可視光または非可視光(例えば赤外光)による連続的な撮影を行い、撮影した画像データを特徴収集装置100に入力する。カメラ200の撮像範囲は、固定であっても可変であってもよい。カメラ200の撮像範囲の変更は、カメラ200の向きを変えることによって行われてもよいし、ロボット1が自律的に移動することによって行われてもよい。 The camera 200 performs continuous photographing with visible light or invisible light (for example, infrared light), and inputs the photographed image data to the feature collection device 100. The imaging range of the camera 200 may be fixed or variable. The change of the imaging range of the camera 200 may be performed by changing the direction of the camera 200 or may be performed by the robot 1 moving autonomously.
 スピーカ300は、特徴収集装置100から入力される音声データを音響波に変換して出力する。マイク400は、ユーザが発話する音声などの音響波を音声データに変換して特徴収集装置100に出力する。マイク400は、音源分離可能なようにマイクアレイとして構成されてもよい。 The speaker 300 converts the sound data input from the feature collection device 100 into an acoustic wave and outputs the sound wave. The microphone 400 converts an acoustic wave such as voice uttered by the user into voice data and outputs the voice data to the feature collection apparatus 100. The microphone 400 may be configured as a microphone array so that sound sources can be separated.
 特徴収集装置100は、カメラ200が撮像した画像から、当該画像に写っている人物の姿勢、仕草、シルエット、動線、及び、滞在場所の少なくともいずれかを当該人物の特徴として取得する。また、特徴収集装置100は、画像内の人物の識別を行い、個人と関連付けて特徴情報を記憶する。このような動作を行うために、特徴収集装置100は、個人識別装置110と、特徴取得部120と、特徴登録部130とを有する。 The feature collection device 100 acquires, from the image captured by the camera 200, at least one of the posture, gesture, silhouette, flow line, and stay location of the person shown in the image as the feature of the person. In addition, the feature collection apparatus 100 identifies a person in the image and stores the feature information in association with the individual. In order to perform such an operation, the feature collection device 100 includes a personal identification device 110, a feature acquisition unit 120, and a feature registration unit 130.
 個人識別装置110は、カメラ200から入力される動画像やマイク400から入力される音声から、動画像中の人物を識別する。個人識別装置110は、画像入力部111、人体検出部112、人物識別部113、音声出力部114、音声入力部115を有する。人物識別部113は、第1識別部1131、発話制御部1132、第2識別部1133、人物特定部1134を有する。 The personal identification device 110 identifies a person in the moving image from the moving image input from the camera 200 and the sound input from the microphone 400. The personal identification device 110 includes an image input unit 111, a human body detection unit 112, a person identification unit 113, an audio output unit 114, and an audio input unit 115. The person identification unit 113 includes a first identification unit 1131, an utterance control unit 1132, a second identification unit 1133, and a person identification unit 1134.
 画像入力部111は、カメラ200から画像データを受け取るインタフェースである。なお本実施形態では、画像入力部111は、カメラ200から直接画像データを受け取っているが、通信装置を介して画像データを受け取ったり、記録媒体を経由して画像データを受け取ったりしてもよい。 The image input unit 111 is an interface that receives image data from the camera 200. In the present embodiment, the image input unit 111 receives image data directly from the camera 200. However, the image input unit 111 may receive image data via a communication device or image data via a recording medium. .
 人体検出部112は、入力画像から人体領域を検出する。人体検出のアルゴリズムは既存の任意のものが採用できる。また、人体検出部112は、一度検出した人体については追跡処理によって検出するようにしてもよい。追跡の際に、人体検出部112は、一度検出した人物の服装や装飾品の種類や形状や色などを記憶しておき、所定時間(例えば1時間から数時間程度)の間は、これらの特徴を用いて検出してもよい。また、服装や装飾品の種類や形状や色を、第1識別部1131による識別に利用してもよい。短時間の間で服装が変更される可能性は低いため、これらの特徴は追跡に有効活用できる。 The human body detection unit 112 detects a human body region from the input image. Any existing human detection algorithm can be used. Further, the human body detection unit 112 may detect the human body once detected by tracking processing. At the time of tracking, the human body detection unit 112 stores the type, shape, color, and the like of a person's clothes and decorations that have been detected once, and during a predetermined time (for example, about 1 hour to several hours) You may detect using a characteristic. In addition, the type, shape, and color of clothes and ornaments may be used for identification by the first identification unit 1131. Because it is unlikely that clothes will change in a short time, these features can be used effectively for tracking.
 人物識別部113は、人体検出部112が検出した人物(検出人物)が誰であるかを識別(特定)する。 The person identifying unit 113 identifies (identifies) who the person (detected person) detected by the human body detecting unit 112 is.
 第1識別部1131は、画像入力部11が取得した画像に基づいて、検出人物の識別を行う。具体的には、画像中の顔特徴、人体特徴、および行動特徴の少なくともいずれかに基づいて検出人物を識別する。これらの識別を行うために、識別対象の各人物の顔特徴、人体特徴、および行動特徴をあらかじめ登録しておく。上述のように、人体検出部112が、検出人物の服装や装飾品の種類や形状や色を取得している場合には、これらの特徴も用いて識別を行ってもよい。また、第1識別部1131は、識別結果の信頼度(識別結果がどの程度確からしいかを表す指標)も算出する。 The first identification unit 1131 identifies the detected person based on the image acquired by the image input unit 11. Specifically, the detected person is identified based on at least one of a facial feature, a human body feature, and a behavior feature in the image. In order to perform these identifications, facial features, human body features, and action features of each person to be identified are registered in advance. As described above, when the human body detection unit 112 acquires the type, shape, and color of the detected person's clothes and decorations, the identification may be performed using these characteristics. The first identification unit 1131 also calculates the reliability of the identification result (an index that indicates how likely the identification result is).
 発話制御部1132は、音声出力部114から発話を行うタイミングおよび発話文の内容を決定する。音声出力部114から発話(音声の出力)を行うのは、検出人物から音声による応答を得て、音声に基づく検出を行うためである。したがって、人物識別部113は、音声に基づく識別を行う必要があると想定されるタイミングを発話タイミングとして決定する。また、発話内容は、ユーザが煩わしく感じない内容であり、かつ、応答音声に基づく識別が行えるような内容とすべきである。具体的な、発話タイミングおよび発話内容の決定方法は後述する。 The utterance control unit 1132 determines the timing of uttering from the voice output unit 114 and the content of the utterance sentence. The reason why speech (speech output) is performed from the speech output unit 114 is to obtain a speech response from the detected person and perform detection based on speech. Therefore, the person identifying unit 113 determines the timing assumed to be necessary to perform the identification based on the voice as the speech timing. The utterance content should be content that the user does not feel bothersome and can be identified based on the response voice. A specific utterance timing and utterance content determination method will be described later.
 第2識別部1133は、音声入力部15が取得する音声に基づいて、人体検出部12によって検出された人物の識別を行う。第2識別部1133は、音声の波形解析を行って得られる音響特徴量に基づいて識別を行ってもよいし、自然言語解析を行って得られる言語特徴量に基づいて識別を行ってもよい。音響特徴量に基づく識別は、あらかじめ登録された音響特徴量との一致度合いに基づいて行える。言語特徴量に基づく識別は、出力音声の内容と応答音声の内容の意味に基づいて行える。第2識別部1133は、識別結果の信頼度も算出する。 The second identification unit 1133 identifies the person detected by the human body detection unit 12 based on the voice acquired by the voice input unit 15. The second identification unit 1133 may perform identification based on an acoustic feature obtained by performing waveform analysis of speech, or may perform identification based on a language feature obtained by performing natural language analysis. . The identification based on the acoustic feature amount can be performed based on the degree of coincidence with the acoustic feature amount registered in advance. The identification based on the language feature amount can be performed based on the meaning of the content of the output speech and the content of the response speech. The second identification unit 1133 also calculates the reliability of the identification result.
 人物特定部1134は、第1識別部1131の識別結果(第1識別結果)と第2識別部1133の識別結果(第2識別結果)に基づいて、検出人物が誰であるかを最終的に特定する。例えば、人物特定部1134は、第1識別結果と第2識別結果が一致する場合に、その識別結果を最終的な結果としてもよいし、第1識別結果または第2識別結果いずれかを採用してもよい。人物特定部1134は、第1識別部1131の識別信頼度や第2識別部1133の識別信頼度を考慮して、検出人物が誰であるかを特定してもよい。例えば、第1識別部1131の識別信頼度が十分に高い場合には、第2識別部1133による識別を行わずに、第1識別部1131の識別結果を最終的な結果としてもよい。人物特定部1134は、識別結果の信頼度も算出する。この信頼度は、例えば、発話制御部1132による発話タイミングや発話内容の決定に用いられる。 Based on the identification result (first identification result) of the first identification unit 1131 and the identification result (second identification result) of the second identification unit 1133, the person specifying unit 1134 finally determines who the detected person is. Identify. For example, when the first identification result and the second identification result match, the person specifying unit 1134 may use the identification result as the final result, or adopt either the first identification result or the second identification result. May be. The person specifying unit 1134 may specify who the detected person is in consideration of the identification reliability of the first identification unit 1131 and the identification reliability of the second identification unit 1133. For example, when the identification reliability of the first identification unit 1131 is sufficiently high, the identification result of the first identification unit 1131 may be the final result without performing identification by the second identification unit 1133. The person specifying unit 1134 also calculates the reliability of the identification result. This reliability is used, for example, for utterance timing and utterance content determination by the utterance control unit 1132.
 音声出力部114は、発話制御部1132から発話音声のテキストデータを取得し、それを音声合成処理により音声データに変換して、スピーカ300から出力する。 The voice output unit 114 acquires the text data of the uttered voice from the utterance control unit 1132, converts it into voice data by voice synthesis processing, and outputs it from the speaker 300.
 音声入力部115は、マイク400から音声信号を受取、音声データに変換して第2識別部1133に出力する。音声入力部115は、雑音除去や話者分離などの前処理を施してもよい。 The voice input unit 115 receives a voice signal from the microphone 400, converts it into voice data, and outputs the voice data to the second identification unit 1133. The voice input unit 115 may perform preprocessing such as noise removal and speaker separation.
 特徴取得部120は、人体検出部112によって検出された人物の特徴を、入力画像から取得する。特徴取得部120が取得する特徴は、例えば、人体に関する特徴(例えば、人体の部位、骨格、姿勢、シルエットに関する特徴)と、行動に関する特徴(仕草、動線、滞在場所に関する特徴)の少なくともいずれかを含む。以下では、特徴取得部120が取得する特徴を、人体・行動特徴とも称する。これらの特徴が人体検出部112や第1識別部1131において既に算出されている場合には、特徴取得部120は算出済みの特徴を取得すればよい。逆に、特徴取得部120が算出した特徴を、第1識別部1131が使用してもよい。 The feature acquisition unit 120 acquires the feature of the person detected by the human body detection unit 112 from the input image. The feature acquired by the feature acquisition unit 120 is, for example, at least one of features related to the human body (for example, features related to the body part, skeleton, posture, and silhouette) and features related to behavior (features related to gestures, flow lines, and places to stay). including. Hereinafter, the feature acquired by the feature acquisition unit 120 is also referred to as a human body / behavior feature. When these features have already been calculated by the human body detection unit 112 or the first identification unit 1131, the feature acquisition unit 120 may acquire the calculated features. Conversely, the feature calculated by the feature acquisition unit 120 may be used by the first identification unit 1131.
 図5A、図5Bを参照して人体・行動特徴の取得について説明する。特徴取得部120は、入力画像50から、検出人物51の骨格を示す骨格情報を取得する。骨格情報は、例えば、OpenPoseなどを使って取得される。骨格情報は、人体を示す特徴情報でもあるし、人体の部位(頭、首、肩、肘、手、腰、膝、足首、目、耳、指先、等)を示す特徴情報でもある。図5では、画像50から、検出人物51の骨格(骨格情報)52が検出されている。 Acquisition of human body / behavior features will be described with reference to FIGS. 5A and 5B. The feature acquisition unit 120 acquires skeleton information indicating the skeleton of the detected person 51 from the input image 50. The skeleton information is acquired using, for example, OpenPose. The skeletal information is also characteristic information indicating a human body, and is also characteristic information indicating a part of the human body (head, neck, shoulder, elbow, hand, waist, knee, ankle, eye, ear, fingertip, etc.). In FIG. 5, the skeleton (skeleton information) 52 of the detected person 51 is detected from the image 50.
 図5Bに示すように、骨格の形状は姿勢に依存する。そのため、特徴取得部120は、人体の部位の位置関係(骨格情報)から人物の姿勢情報を取得できる。姿勢情報は、人体の各部位の相対的な位置関係を列挙した情報であってもよいし、人体の部位の位置関係を分類した結果であってもよい。姿勢の分類として、例えば、直立、猫背、O脚、X脚などが挙げられる。また、特徴取得部120は、検出人物のシルエット情報を取得してもよく、シルエット情報から姿勢を求めてもよい。 As shown in FIG. 5B, the shape of the skeleton depends on the posture. Therefore, the feature acquisition unit 120 can acquire the posture information of the person from the positional relationship (skeleton information) of the parts of the human body. The posture information may be information listing the relative positional relationships of each part of the human body, or may be a result of classifying the positional relationships of the parts of the human body. Examples of the posture classification include upright, stoop, O-leg, and X-leg. In addition, the feature acquisition unit 120 may acquire the silhouette information of the detected person, or may obtain the posture from the silhouette information.
 上述の骨格情報・姿勢情報・シルエット情報はいずれも本発明における人体特徴に相当する。 The above-described skeleton information, posture information, and silhouette information all correspond to the human body characteristics in the present invention.
 また、特徴取得部120は、各フレームの姿勢情報の変化から動作情報(仕草)を検出できる。仕草の検出結果として、例えば、歩行、屈伸、寝転び、腕組み、等を示す情報が得られる。腕組みの場合と腕組みでない場合との間で、上腕と前腕の間の位置関係などは異なる。このように、各部位の位置関係は仕草に依存する。そのため、骨格情報に基づいて、各部位の位置関係から仕草を検出できる。歩行や屈伸などの動きを伴う仕草は、互いに異なる時間に撮像された複数の画像にそれぞれ対応する複数の骨格情報を用いて検出されてもよい。歩行については、歩幅と肩幅の比率を示す情報が得られてもよい。 In addition, the feature acquisition unit 120 can detect motion information (a gesture) from a change in posture information of each frame. As the gesture detection result, for example, information indicating walking, bending, lying, lying down, and the like can be obtained. The positional relationship between the upper arm and the forearm differs depending on whether the arm is folded or not. Thus, the positional relationship of each part depends on the gesture. Therefore, a gesture can be detected from the positional relationship of each part based on the skeleton information. A gesture involving movement such as walking or bending may be detected using a plurality of pieces of skeleton information respectively corresponding to a plurality of images captured at different times. For walking, information indicating the ratio between the stride width and the shoulder width may be obtained.
 また、特徴取得部120は、検出人物の滞在場所や動線(移動経路)を特徴として取得してもよい。滞在場所に関する特徴は、過去数分間などの所定期間における人物位置の時間変化に基づいて、当該所定期間の長さに対する滞在時間の比率(滞在率)を、滞在判定領域ごとに算出すれば得られる。滞在場所の検出結果は、各滞在判定領域の滞在率を示す滞在マップ(ヒートマップ)の形式で得られる。動線に関する特徴は、滞在位置を時系列で示した情報であり、滞在マップと同様の手法により得られる。 Also, the feature acquisition unit 120 may acquire the detected person's stay location and flow line (movement route) as features. The characteristics related to the stay location can be obtained by calculating the ratio of stay time to the length of the predetermined period (stay rate) for each stay determination area based on the time change of the person position in the predetermined period such as the past several minutes. . The detection result of the stay location is obtained in the form of a stay map (heat map) indicating the stay rate of each stay determination area. The feature related to the flow line is information indicating the stay position in time series, and is obtained by the same method as the stay map.
 上述した、動作、滞在場所、および動線に関する情報は、いずれも本発明における行動特徴に相当する。 The above-described information regarding the operation, the staying place, and the flow line all correspond to the action characteristics in the present invention.
 特徴登録部130は、特徴取得部120によって取得された人体・行動特徴を、個人識別装置110(人物識別部113)によって特定された人物と関連付けて照合用データベースなどの記憶部に登録する。特徴登録部130が登録を行うタイミングは任意であってよいが、例えば、人物識別部113による識別が信頼度高く行えたタイミングや、人物の追跡が完了したタイミングであってもよい。 The feature registration unit 130 registers the human body / behavior feature acquired by the feature acquisition unit 120 in a storage unit such as a matching database in association with the person specified by the personal identification device 110 (person identification unit 113). The timing at which the feature registration unit 130 performs registration may be arbitrary, but may be, for example, the timing at which the identification by the person identification unit 113 can be performed with high reliability, or the timing at which the person tracking is completed.
[処理]
 図3は、特徴収集装置100が行う特徴収集処理の全体的流れを示すフローチャートである。以下、図3を参照しながら本実施形態における特徴収集処理について説明する。なお、このフローチャートは本実施形態における特徴収集処理を概念的に説明するものであり、実施形態においてこのフローチャートの通りの処理が実装される必要はないことに留意されたい。
[processing]
FIG. 3 is a flowchart showing the overall flow of feature collection processing performed by the feature collection device 100. Hereinafter, the feature collection processing in the present embodiment will be described with reference to FIG. Note that this flowchart conceptually describes the feature collection processing in the present embodiment, and it should be noted that the processing according to this flowchart need not be implemented in the embodiment.
 ステップS10において、人体検出部112が、入力画像から人体の検出を行う。人体が検出されなかった場合(S11-NO)には、ステップS10に戻って次の処理対象フレーム画像から人体検出を行う。一方、人体が検出された場合(S11-YES)には、処理はステップS12に進む。 In step S10, the human body detection unit 112 detects a human body from the input image. If no human body is detected (S11-NO), the process returns to step S10 and human body detection is performed from the next processing target frame image. On the other hand, if a human body is detected (S11-YES), the process proceeds to step S12.
 ステップS12において、特徴取得部120が検出人物の人体・行動特徴を取得する。 In step S12, the feature acquisition unit 120 acquires the human body / behavior feature of the detected person.
 ステップS13において、第1識別部1131が人体・行動特徴に基づく個人識別を行う。なお、ステップS12において顔特徴も取得して、ステップS13において第1識別部1131が顔特徴に基づく個人識別を行ってもよい。 In step S13, the first identification unit 1131 performs personal identification based on the human body / behavior feature. Note that facial features may also be acquired in step S12, and the first identification unit 1131 may perform personal identification based on the facial features in step S13.
 ここで、画像に基づく識別の信頼度について簡単に説明する。顔特徴に基づく識別は比較的精度良く(信頼度高く)行えることが期待できるが、ユーザがロボット1と正対しているときしか行えない。一方、人体特徴や行動特徴に基づく識別は、ユーザの身体が写っていれば行えるが、必ずしも精度がよいとは限らない。特に、人体・行動特徴の初期学習のための特徴を収集している段階では、信頼度の高い識別はできないことが想定される。 Here, the reliability of identification based on images will be briefly described. Although identification based on facial features can be expected to be performed with relatively high accuracy (high reliability), it can be performed only when the user is facing the robot 1. On the other hand, identification based on human body characteristics and behavioral characteristics can be performed if the user's body is reflected, but the accuracy is not always good. In particular, it is assumed that highly reliable identification is not possible at the stage of collecting features for initial learning of human body / behavior features.
 第1識別部1131は、動画像に含まれる人物の識別を継続して行い、各フレームにおける識別の信頼度を総合して現時点での識別信頼度を算出する。この際、第1識別部1131は、最近に行った識別の信頼度に対して大きな重みを付けた加重平均を、現時点での識別信頼度としてもよい。 The first identification unit 1131 continues to identify the person included in the moving image, and calculates the identification reliability at the present time by combining the identification reliability in each frame. At this time, the first identification unit 1131 may use a weighted average obtained by assigning a large weight to the reliability of recently performed identification as the identification reliability at the present time.
 ステップS14において、発話制御部1132は、発話を行うタイミングであるか否かを判断する。発話を行うタイミングはあらかじめ条件として設定しておけばよく、ステップS14では、人物識別部13は、設定した条件に該当しているか否か判断すればよい。 In step S14, the utterance control unit 1132 determines whether or not it is time to perform utterance. The timing for uttering may be set as a condition in advance, and in step S14, the person identifying unit 13 may determine whether or not the set condition is met.
 発話を行う条件として次のようなものが採用できる。
(1)時間
 例えば、人体検出から所定時間(例えば10分)経過後に1回目の発話を行い、それ以降所定の間隔で発話を行う。
(2)データ量
 例えば、あらかじめ定められたデータ量(例えば100回分のデータ)の特徴が取得されたら発話を行う。
(3)行動停止
 検出人物の行動に一定時間変化がない場合。例えば、ソファーに座ってテレビを見始めた後に一定時間経過した場合が相当する。
(4)撮像範囲外への移動
 検出人物が撮影範囲外に移動することが予測される場合。例えば、検出人物が現在の部屋から他の部屋へ移動・外出した場合が相当する。
(5)発話のしやすい状況
 ロボットが検出人物と対話を行うのに適した状況に達した場合。例えば、検出人物とロボットが向かい合っており(ロボットが検出人物の顔を検出でき)、かつ、検出人物とロボットの間の距離が所定距離(例えば3メートル)以内のとき。
(6)識別信頼度が低い場合
 人物識別部113は、第1識別部1131と第2識別部1133の両方の識別結果を用いて最終的な識別結果を確定する。そこで、第1識別部1131による識別信頼度が閾値TH1以上であれば、その結果を人物識別部113の識別結果として確定し、識別信頼度が閾値TH1未満であれば、第2識別部1133による識別を行うために発話を行うようにしてもよい。あるいは、第1識別部1131と第2識別部1133の両方の識別結果を考慮した上で識別信頼度が閾値TH1未満となる場合に、さらに第2識別部1133による識別を行うために発話を行うようにしてもよい。
The following conditions can be adopted as conditions for utterance.
(1) Time For example, the first utterance is performed after a predetermined time (for example, 10 minutes) has elapsed since the detection of the human body, and thereafter the utterance is performed at predetermined intervals.
(2) Data amount For example, when a feature of a predetermined data amount (for example, data for 100 times) is acquired, speech is performed.
(3) Action stop When the detected person's action does not change for a certain period of time. For example, it corresponds to the case where a certain time has elapsed after sitting on a sofa and starting to watch television.
(4) Movement outside the imaging range When the detected person is predicted to move outside the imaging range. For example, this corresponds to a case where the detected person moves / goes from the current room to another room.
(5) Situations where speech is easy When the robot reaches a situation suitable for interacting with the detected person. For example, when the detected person and the robot face each other (the robot can detect the face of the detected person), and the distance between the detected person and the robot is within a predetermined distance (for example, 3 meters).
(6) When the identification reliability is low The person identification unit 113 determines the final identification result using the identification results of both the first identification unit 1131 and the second identification unit 1133. Therefore, if the identification reliability by the first identification unit 1131 is equal to or higher than the threshold TH1, the result is determined as the identification result of the person identification unit 113, and if the identification reliability is less than the threshold TH1, the second identification unit 1133 Speaking may be performed for identification. Or, in consideration of the identification results of both the first identification unit 1131 and the second identification unit 1133, when the identification reliability is less than the threshold value TH1, utterance is performed for further identification by the second identification unit 1133. You may do it.
 上記の条件は複数組み合わせてもよく、例えば、上記複数の条件のいくつかの何れかが成立するときに発話するようにしてもよいし、上記複数の条件のいくつかが同時に成立するときに発話するようにしてもよい。さらに、検出人物が眠っているときや集中しているときなど、話しかけることが適切ではない状況では、上記条件を満たしても発話しないようにしてもよい。 A plurality of the above conditions may be combined. For example, a speech may be made when any one of the plurality of conditions is satisfied, or an utterance may be made when some of the plurality of conditions are simultaneously satisfied. You may make it do. Furthermore, in situations where it is not appropriate to speak, such as when the detected person is asleep or concentrated, the utterance may not be made even if the above condition is satisfied.
 なお、ここでの発話は、個人識別のための応答音声を得ることを目的としたものであるので、上記のようなタイミングで行うようにしているが、それ以外のタイミングでの発話を禁止するものではない。例えば、上記の条件を満たさないタイミングにおいて、コミュニケーションのためにユーザに話しかけるようにしても構わない。 Note that the utterance here is for the purpose of obtaining a response voice for personal identification, so it is performed at the timing as described above, but utterance at other timings is prohibited. It is not a thing. For example, you may make it talk to a user for communication in the timing which does not satisfy | fill said conditions.
 ステップS14において発話タイミングの条件を満たすと判断された場合は、処理はステップS15に進み、そうでない場合には、ステップS19に進む。 If it is determined in step S14 that the utterance timing condition is satisfied, the process proceeds to step S15; otherwise, the process proceeds to step S19.
 ステップS15では、発話制御部1132が、発話の内容(発話テキスト)を決定する。本実施形態では、発話制御部1132は、現時点での識別信頼度に基づいて発話内容を決定する。図4は、ステップS15の発話内容決定処理を説明するフローチャートである。 In step S15, the utterance control unit 1132 determines the content of the utterance (utterance text). In the present embodiment, the utterance control unit 1132 determines the utterance content based on the current identification reliability. FIG. 4 is a flowchart illustrating the utterance content determination process in step S15.
 図4に示すように、ステップS131において、発話制御部1132は、現在の識別信頼度に応じて発話レベルを決定する。本実施形態では、例えば、識別信頼度が0.8以上の高信頼度、識別信頼度が0.5以上0.8未満の中信頼度、識別信頼度が0.5未満の低信頼度の3つのレベルに分類する。この閾値は例示に過ぎず、システム要求に応じて適宜決定すればよい。また、閾値は状況に応じて変化するものであってもよい。また、本実施形態ではレベルを3段階に分けているが、2段階あるいは4段階以上に分けても構わない。 As shown in FIG. 4, in step S131, the utterance control unit 1132 determines the utterance level according to the current identification reliability. In the present embodiment, for example, a high reliability with an identification reliability of 0.8 or more, a medium reliability with an identification reliability of 0.5 or more and less than 0.8, and a low reliability with an identification reliability of less than 0.5. Classify into three levels. This threshold value is merely an example, and may be appropriately determined according to the system request. Further, the threshold value may change according to the situation. Further, in this embodiment, the level is divided into three stages, but it may be divided into two stages or four stages or more.
 識別信頼度が高い場合は、ステップS152に進み、発話制御部1132は、コミュニケーションの自然さを重視して発話内容を決定する。例えば、発話制御部1132は、検出人物の呼称を含まない内容の発話内容を決定する。人物が検出された場所が台所である場合には、発話制御部1132は、例えば「台所でいま何してるの?」を発話内容として決定する。 If the identification reliability is high, the process proceeds to step S152, and the utterance control unit 1132 determines the utterance content with emphasis on the naturalness of communication. For example, the utterance control unit 1132 determines the utterance content that does not include the name of the detected person. When the place where the person is detected is in the kitchen, the utterance control unit 1132 determines, for example, “What are you doing in the kitchen?” As the utterance content.
 識別信頼度が中程度の場合は、ステップS153に進み、発話制御部1132は、不自然とはならない程度に内容で、検出人物が誰であるかを確かめるように発話内容を決定する。発話制御部1132は、例えば、第1識別部1131の識別結果の呼称を含めた内容を発話内容とする。第1識別部1131の識別結果が「母」である場合には、発話制御部1132は、例えば「お母さん、台所で何してるの?」を発話内容として決定する。 If the identification reliability is medium, the process proceeds to step S153, and the utterance control unit 1132 determines the utterance contents so as to confirm who the detected person is so as not to be unnatural. For example, the utterance control unit 1132 sets the content including the name of the identification result of the first identification unit 1131 as the utterance content. When the identification result of the first identification unit 1131 is “mother”, the utterance control unit 1132 determines, for example, “Mom, what are you doing in the kitchen?” As the utterance content.
 識別信頼度が低い場合は、ステップS154に進み、発話制御部1132は、検出人物が誰であるかをより直接的に問いかける内容を発話内容とする。発話制御部1132は、例えば「台所にいるのは誰ですか?」を発話内容として決定する。 If the identification reliability is low, the process proceeds to step S154, and the utterance control unit 1132 sets the utterance content to more directly ask who the detected person is. For example, the utterance control unit 1132 determines “who is in the kitchen?” As the utterance content.
 ステップS15において決定された発話内容のテキストデータは、発話制御部1132から音声出力部114に渡される。ステップS16において、音声出力部114は、発話テキストを音声合成により音声データに変換して、スピーカ300から出力する。ステップS17において、システム発話に対する応答を音声入力部115がマイク400から取得する。 The text data of the utterance content determined in step S15 is transferred from the utterance control unit 1132 to the voice output unit 114. In step S <b> 16, the voice output unit 114 converts the utterance text into voice data by voice synthesis and outputs the voice data from the speaker 300. In step S <b> 17, the voice input unit 115 acquires a response to the system utterance from the microphone 400.
 ステップS18において、第2識別部1133が入力音声に基づく個人識別を行う。第2識別部1133は、音響特徴(音響解析)に基づく識別と、言語特徴(意味解析)に基づく識別を行う。音響特徴に基づく識別は応答音声が得られれば結果を得られるが、言語特徴に基づく識別は問いかけと応答の内容によっては誰であるか不明となる。ただし、言語特徴に基づく識別では「(私は)母です」といった意味を考慮できるため識別結果は信頼できると考えられる。 In step S18, the second identification unit 1133 performs personal identification based on the input voice. The second identification unit 1133 performs identification based on acoustic features (acoustic analysis) and identification based on language features (semantic analysis). The identification based on the acoustic features can be obtained if a response voice is obtained, but the identification based on the language features is unknown depending on the contents of the question and the response. However, in the identification based on the linguistic feature, the meaning of “(I am the mother)” can be considered, so the identification result is considered to be reliable.
 ステップS19において、人物特定部1134は、第1識別部1131による画像に基づく識別結果(S13)と、第2識別部1133による音声に基づく識別結果(S18)を考慮して、検出人物を特定する。 In step S19, the person specifying unit 1134 specifies the detected person in consideration of the identification result (S13) based on the image by the first identification unit 1131 and the identification result (S18) based on the sound by the second identification unit 1133. .
 発話内容が「台所で何しているの?」である場合、応答として「いま料理中」が得られることが想定される。この場合、第2識別部1133は、言語特徴に基づく識別は行えないが、音響特徴に基づいて識別結果が得られる。ここで、第2識別部1133の識別結果が第1識別部1131の識別結果と一致すれば、人物特定部1134は第1識別部1131の識別結果が正しいことを確認でき、これを最終的な特定結果とする。一方、これら2つの識別結果が異なった場合には、人物特定部1134は、いずれかの識別結果を採用してもよいし、検出人物が不明であるとしてもよい。この場合は、人物特定部1134は識別信頼度を低く設定して、次回の発話においてより直接的に検出人物が誰であるかを確認するようにしてもよい。 If the utterance content is “What are you doing in the kitchen?”, It is assumed that “Cooking now” is obtained as a response. In this case, the second identification unit 1133 cannot perform identification based on the language feature, but can obtain an identification result based on the acoustic feature. Here, if the identification result of the second identification unit 1133 matches the identification result of the first identification unit 1131, the person specifying unit 1134 can confirm that the identification result of the first identification unit 1131 is correct, Specific result. On the other hand, when these two identification results are different, the person specifying unit 1134 may adopt any identification result, or the detected person may be unknown. In this case, the person specifying unit 1134 may set the identification reliability low, and more directly confirm who is the detected person in the next utterance.
 発話内容が「お母さん、台所で何しているの?」である場合、人物が実際に「母」である場合には、応答として「いま料理中」が得られることが想定され、「母」ではない場合には、「私はお母さんじゃないよ」や「私は姉だよ」といった応答が得られることが想定される。いずれの場合も、第2識別部1133は、音響特徴に基づく識別と、言語特徴に基づく識別の両方が行える。言語特徴に基づく識別では、「母」という呼称を用いた問いかけに対して、応答文にそれを否定する語句が含まれているかいないかに基づいて、相手が「母」であるか否かが識別できる。応答文に自分が誰であるかを示す語句が含まれていれば、それに基づいて検出人物を識別できる。この際、音響特徴に基づく識別結果と言語特徴に基づく識別結果に相違が生じることも考えられるが、言語特徴に基づく識別結果を優先してもよいし、それぞれの識別信頼度を考慮して判断してもよい。 If the utterance is “Mom, what are you doing in the kitchen?”, If the person is actually “Mother”, it is assumed that “Currently cooking” is obtained as a response. If this is not the case, it is assumed that a response such as “I am not a mother” or “I am an older sister” can be obtained. In any case, the second identification unit 1133 can perform both identification based on acoustic features and identification based on language features. In identification based on language characteristics, whether or not the other party is “mother” is identified based on whether or not the response sentence contains a word that denies it in response to an inquiry using the name “mother” it can. If the response sentence includes a phrase indicating who the person is, the detected person can be identified based on the phrase. At this time, there may be a difference between the identification result based on the acoustic feature and the identification result based on the linguistic feature. May be.
 発話内容が「台所にいるのは誰ですか?」である場合、応答は「私は○○です」が得られることが想定される。したがって、第2識別部1133は、言語特徴に基づく識別を行って検出人物を識別すればよい。検出人物が誰であるかを直接的に問い合わせる発話内容を採用しているため、応答の意味内容から検出人物をより確実に識別できる。この際にも上記と同様に、第2識別部1133は音響特徴に基づく識別を行ってもよい。 If the utterance is “Who is in the kitchen?”, It is assumed that the response is “I am XX”. Therefore, the second identification unit 1133 may identify the detected person by performing identification based on the language feature. Since the utterance content that directly inquires who the detected person is is employed, the detected person can be more reliably identified from the meaning content of the response. At this time, as described above, the second identification unit 1133 may perform identification based on acoustic features.
 ステップS20では、特徴収集を継続するか否かが判断される。引き続き特徴収集を行う場合は、ステップS10に戻って次のフレームを処理する。 In step S20, it is determined whether or not to continue feature collection. If feature collection is to be continued, the process returns to step S10 to process the next frame.
 特徴収集を終了する場合には、ステップS21において、特徴登録部130が、取得された人体・生体特徴を検出された人物の識別結果と関連付けて記憶部(不図示)に登録する。なお、図3のフローチャートでは、特徴収集を終了するタイミングで特徴登録を行っているが、検出人物の追跡が完了したタイミングで特徴登録を行ってもよい。特徴登録部130は、追跡の開始から終了までに得られた特徴の全てを、一人の人物と関連付けて登録する。ただし、特徴登録部130は、一つの追跡期間を複数に分割して、それぞれの期間について、得られた特徴を当該期間内の識別結果の人物と関連付けて登録してもよい。この処理は、人物が途中で入れ替わったのを正しく認識できず人体検出部112は同一人物として検出していたが、人物識別部113によって異なる人物として識別された場合に行われうる。 When the feature collection is ended, in step S21, the feature registration unit 130 registers the acquired human body / biological feature in the storage unit (not shown) in association with the detected person identification result. In the flowchart of FIG. 3, feature registration is performed at the timing when feature collection ends, but feature registration may be performed at the timing when tracking of the detected person is completed. The feature registration unit 130 registers all the features obtained from the start to the end of tracking in association with one person. However, the feature registration unit 130 may divide one tracking period into a plurality of periods and register the obtained characteristics in association with the identification result person within the period. This process can be performed when the human body detection unit 112 cannot correctly recognize that the person has been changed in the middle and has been detected as the same person, but is identified as a different person by the person identification unit 113.
[本実施形態の有利な効果]
 本実施形態によれば、画像に基づく個人識別と音声に基づく個人識別を行い、両方を総合して最終的な識別結果を得られるため、精度のよい識別が行える。特に、画像だけからは精度のよい識別が行えない場合に、システム発話を行ってユーザからの音声応答を取得して音声に基づく識別を行うことで、精度のよい識別を可能としている。さらに、画像に基づく識別結果が信頼できない場合のみに発話を行ったり、画像に基づく識別信頼度に応じて発話内容を決定したりすることで、ユーザが煩わしさを感じることを最小限にできる。
[Advantageous effects of this embodiment]
According to this embodiment, personal identification based on an image and personal identification based on sound are performed, and both are combined to obtain a final identification result, so that accurate identification can be performed. In particular, when accurate identification cannot be performed only from an image, accurate identification is possible by performing system utterance, obtaining a voice response from a user, and performing identification based on voice. Furthermore, by speaking only when the identification result based on the image is unreliable, or by determining the utterance content according to the identification reliability based on the image, it is possible to minimize the user's annoyance.
(第2の実施形態)
 第1の実施形態では、図4に示すように識別信頼度に応じて発話内容を決定している。しかしながら、ユーザから音声による応答が得られれば、少なくとも音響特徴に基づく識別ができることと、1回の対話で複数回の発話が可能であることを考慮して、本実施形態では、発話内容の決定処理を第1の実施形態から変更する。以下、第1の実施形態との相違点について主に説明する。
(Second Embodiment)
In the first embodiment, as shown in FIG. 4, the utterance content is determined according to the identification reliability. However, in the present embodiment, the content of the utterance is determined in consideration of the fact that if a response from the user is obtained by voice, the identification based on at least the acoustic feature can be performed and that the utterance can be performed a plurality of times in one dialogue. The process is changed from the first embodiment. Hereinafter, differences from the first embodiment will be mainly described.
 図6は、本実施形態における発話内容決定処理の流れを示すフローチャートである。本実施形態では複数回の発話を行うことも想定しており、したがって、図6に示す処理は図3におけるステップS15の処理そのものではないことに留意されたい。 FIG. 6 is a flowchart showing a flow of utterance content determination processing in the present embodiment. Note that in the present embodiment, it is also assumed that speech is performed a plurality of times, and therefore the process shown in FIG. 6 is not the process of step S15 in FIG.
 発話制御部1132は、発話を行うタイミングになったら、ステップS31に示すように、自然なコミュニケーションとなるような発話内容を決定する。したがって、発話内容には検出人物の呼称を含める必要はなく、「台所でいま何してるの?」といった内容が発話内容として決定される。 The utterance control unit 1132 determines the utterance content that makes natural communication, as shown in step S31, when it is time to utter. Therefore, it is not necessary to include the name of the detected person in the utterance content, and the content such as “What are you doing in the kitchen?” Is determined as the utterance content.
 ステップS32では、ステップS31において決定された内容をスピーカ300から出力して、その結果としてマイク400が取得するユーザの応答音声に対して、第2識別部1133が識別を行う。ここでは、少なくとも音響特徴に基づく識別が行われればよい。言語特徴に基づく識別も可能であれば当然実施してもよい。 In step S32, the content determined in step S31 is output from the speaker 300, and the second identification unit 1133 identifies the user's response voice acquired by the microphone 400 as a result. Here, at least identification based on acoustic features may be performed. Of course, identification based on language features is also possible if possible.
 ステップS33では、第1識別部1131による識別結果と、第2識別部1133による識別結果が一致するか判断する。一致する場合には、検出人物が特定できるので、それ以上の発話を行う必要はない。一方、2つの識別結果が相違する場合には、人物をより確実に識別するために、ステップS34においてさらなる発話内容を決定する。発話制御部1132は、2回目の発話内容を、1回目の発話と比較して、より直接的に検出人物が誰であるかを確認する内容として決定する。いまの場合は、例えば、「台所にいるのはお母さんじゃないの?」という発話内容を採用できる。あるいは、第1の実施形態の信頼度が中レベルまたは低レベルの時と同様に、検出人物の呼称を含む発話内容(例:「お母さん、そこで何してるの?」)や、誰であるかを直接問い合わせる発話内容(例:「台所にいるのは誰ですか?」)を2回目の発話内容として決定してもよい。 In step S33, it is determined whether the identification result by the first identification unit 1131 matches the identification result by the second identification unit 1133. If they match, the detected person can be identified, so there is no need to speak further. On the other hand, if the two identification results are different, further utterance content is determined in step S34 in order to identify the person more reliably. The utterance control unit 1132 compares the contents of the second utterance with the utterances of the first time, and determines the contents to confirm who is the detected person more directly. In the present case, for example, it is possible to adopt the utterance content “Isn't my mother in the kitchen?”. Or, the utterance content including the name of the detected person (for example, “Mom, what are you doing there?”) And who is the same as when the reliability of the first embodiment is medium level or low level The utterance content (for example, “who is in the kitchen?”) May be determined as the second utterance content.
 本実施形態によれば、第1の実施形態と同様の効果が得られる上に、音声に基づく識別を行う際により自然な対話が行える。 According to this embodiment, the same effects as those of the first embodiment can be obtained, and more natural dialogue can be performed when performing identification based on voice.
 上記の説明では、1回の対話で発話を2回行うように説明しているが、1回の対話で発話を3回以上行ってもよい。その場合、最初の何回かは自然な発話を行って、音声に基づいて人物を識別できないときに、検出人物が誰であるかを確認する内容の発話を行うようにしてもよい。 In the above description, it is described that the utterance is performed twice in one dialogue, but the utterance may be performed three or more times in one dialogue. In that case, a natural utterance may be performed for the first several times, and when the person cannot be identified based on the voice, the utterance may be performed to confirm who the detected person is.
 また、本実施形態の処理は、第1の実施形態において、画像に基づく識別信頼度が高く自然な内容の発話を行った際に、音声に基づく識別結果が顔図に基づく識別結果と相違した場合にも適用可能である。 In addition, the processing of the present embodiment is different from the identification result based on the face figure in the first embodiment when the utterance of the natural content with high identification reliability based on the image is performed. It is also applicable to cases.
(その他)
 上述した各実施形態は、本発明の例示に過ぎない。本発明は上記の具体的な形態に限定されることはなく、その技術的思想の範囲内で種々の変形が可能である。
(Other)
Each embodiment mentioned above is only illustration of this invention. The present invention is not limited to the specific form described above, and various modifications are possible within the scope of the technical idea.
 上記の説明において、特徴収集装置100(家庭用ロボット)がユーザに話しかけるタイミングやその内容について述べたが、これは音声に基づく応答をユーザから得るための発話についての処理である。家庭用ロボットをコミュニケーションロボットとして実装する場合に、ユーザとコミュニケーションを取るための発話については上記の処理を適用する必要はない。また、上記の説明では、自然なコミュニケーションの例として、相手の呼称を含まない発話を挙げているが、呼称を含む発話が自然な場面や相手であれば、呼称を含む発話内容は自然な発話となる。 In the above description, the timing at which the feature collection apparatus 100 (home robot) speaks to the user and the contents thereof are described. This is processing for utterance to obtain a response based on voice from the user. When a home robot is implemented as a communication robot, it is not necessary to apply the above processing to an utterance for communicating with a user. In the above description, as an example of natural communication, an utterance that does not include the name of the other party is cited. However, if the utterance that includes the name is a natural scene or the other party, the utterance content including the name is a natural utterance. It becomes.
 上記の実施形態では、特徴収集装置100が家庭用ロボットに搭載されている例を説明したが、監視カメラなどに搭載されてもよい。また、上記の個人識別装置110は、学習用の特徴を収集する特徴収集装置100に搭載される必要はなく、それ単体で実施して個人を識別するために用いてもよい。 In the above embodiment, an example in which the feature collection device 100 is mounted on a home robot has been described, but it may be mounted on a surveillance camera or the like. The personal identification device 110 does not need to be mounted on the feature collection device 100 that collects learning features, and may be used alone to identify an individual.
(付記)
 音声を出力する音声出力手段(14,114)と、
 出力音声に応答する音声を取得する音声入力手段(15,115)と、
 画像入力手段(11,111)と、
 前記画像入力手段に入力される動画像から、人物を検出する検出手段(12,112)と、
 前記検出手段によって検出された人物を識別する人物識別手段(13,113)と、
 を備え、
 前記人物識別手段は、
  前記動画像に基づいて前記人物を識別する第1識別手段(131,1131)と、
  出力音声への応答として得られる入力音声に基づいて前記人物を識別する第2識別手段(133,1133)と、を有し、
  前記第1識別手段による識別結果である第1識別結果と前記第2識別手段による識別結果である第2識別結果に基づいて、前記人物を識別する、
 ことを特徴とする、個人識別装置。
(Appendix)
Voice output means (14, 114) for outputting voice;
Voice input means (15, 115) for acquiring voice in response to the output voice;
Image input means (11, 111);
Detection means (12, 112) for detecting a person from a moving image input to the image input means;
Person identifying means (13, 113) for identifying a person detected by the detecting means;
With
The person identifying means includes
First identification means (131, 1131) for identifying the person based on the moving image;
Second identification means (133, 1133) for identifying the person based on the input voice obtained as a response to the output voice,
Identifying the person based on a first identification result that is an identification result by the first identification means and a second identification result that is an identification result by the second identification means;
A personal identification device.
10:個人識別装置 11:画像入力部 12:人体検出部
13:人物識別部 131:第1識別部 133:第2識別部
14:音声出力部 15:音声入力部
100:特徴収集装置
110:個人識別装置 111:画像入力部 112:人体検出部
113:人物識別部 1131:第1識別部 1132:発話制御部
1133:第2識別部 1134:人体特定部
114:音声出力部 115:音声入力部
120:特徴取得部 130:特徴登録部
200:カメラ 300:スピーカ 400:マイク
 
DESCRIPTION OF SYMBOLS 10: Personal identification apparatus 11: Image input part 12: Human body detection part 13: Person identification part 131: 1st identification part 133: 2nd identification part 14: Audio | voice output part 15: Voice input part 100: Feature collection apparatus 110: Individual Identification device 111: Image input unit 112: Human body detection unit 113: Person identification unit 1131: First identification unit 1132: Speech control unit 1133: Second identification unit 1134: Human body identification unit 114: Audio output unit 115: Audio input unit 120 : Feature acquisition unit 130: Feature registration unit 200: Camera 300: Speaker 400: Microphone

Claims (15)

  1.  音声を出力する音声出力手段と、
     出力音声に応答する音声を取得する音声入力手段と、
     画像入力手段と、
     前記画像入力手段に入力される動画像から、人物を検出する検出手段と、
     前記検出手段によって検出された人物を識別する人物識別手段と、
     を備え、
     前記人物識別手段は、
      前記動画像に基づいて前記人物を識別する第1識別手段と、
      出力音声への応答として得られる入力音声に基づいて前記人物を識別する第2識別手段と、を有し、
      前記第1識別手段による識別結果である第1識別結果と前記第2識別手段による識別結果である第2識別結果に基づいて、前記人物を識別する、
     ことを特徴とする、個人識別装置。
    Audio output means for outputting audio;
    An audio input means for acquiring audio in response to the output audio;
    Image input means;
    Detecting means for detecting a person from a moving image input to the image input means;
    Person identifying means for identifying a person detected by the detecting means;
    With
    The person identifying means includes
    First identifying means for identifying the person based on the moving image;
    Second identifying means for identifying the person based on an input voice obtained as a response to the output voice,
    Identifying the person based on a first identification result that is an identification result by the first identification means and a second identification result that is an identification result by the second identification means;
    A personal identification device.
  2.  前記人物識別手段は、前記第1識別手段による信頼度が第1閾値未満の場合に、前記音声出力手段からの音声出力と、前記第2識別手段による識別とを行って、前記第1識別結果と前記第2識別結果とに基づいて前記人物を識別する、
     請求項1に記載の個人識別装置。
    The person identification means performs voice output from the voice output means and identification by the second identification means when the reliability by the first identification means is less than a first threshold, and the first identification result And identifying the person based on the second identification result,
    The personal identification device according to claim 1.
  3.  前記人物識別手段は、前記第1識別手段による信頼度が前記第1閾値以上の場合は、前記第1識別結果を、前記人物の識別結果とする、
     請求項2に記載の個人識別装置。
    When the reliability by the first identification unit is equal to or higher than the first threshold, the person identification unit sets the first identification result as the person identification result.
    The personal identification device according to claim 2.
  4.  前記音声出力手段からの出力音声の出力は所定のタイミングで行われ、
     前記所定のタイミングは、
      前記人物の識別信頼度が閾値未満となったタイミング、
      前記人物が検出されたタイミングから第1の所定時間が経過したタイミング、
      前記人物の時間変化が略無い状態が第2の所定時間継続したタイミング、
      前記人物が撮像範囲外へ出るタイミング、
      前記個人識別装置と前記人物の間の距離が所定距離以下になったタイミングで、
     の少なくとも何れかである、
     請求項1から3のいずれか1項に記載の個人識別装置。
    The output of the output sound from the sound output means is performed at a predetermined timing,
    The predetermined timing is
    Timing when the identification reliability of the person is less than a threshold,
    A timing at which a first predetermined time has elapsed from a timing at which the person is detected;
    A timing at which a state in which there is substantially no time change of the person continues for a second predetermined time;
    Timing when the person goes out of the imaging range,
    At the timing when the distance between the personal identification device and the person is equal to or less than a predetermined distance,
    Or at least one of
    The personal identification device according to any one of claims 1 to 3.
  5.  前記人物識別手段は、前記第1識別手段による信頼度に応じて、前記出力音声の内容を決定する、
     請求項1から4のいずれか1項に記載の個人識別装置。
    The person identifying means determines the content of the output sound according to the reliability by the first identifying means.
    The personal identification device according to claim 1.
  6.  前記人物識別手段は、前記信頼度が第2閾値未満の場合は、前記第1識別結果の人物の呼称を含む内容、または、前記人物が誰であるかを問い合わせる内容を、前記出力音声の内容として決定する、
     請求項5に記載の個人識別装置。
    When the reliability is less than the second threshold, the person identification means includes contents including the name of the person of the first identification result or contents for inquiring who the person is. As determined,
    The personal identification device according to claim 5.
  7.  前記人物識別手段は、前記第1識別結果と前記第2識別結果が一致しない場合には、新たに出力音声を出力して当該新たな出力音声に応答する入力音声に基づいて前記第2識別手段による識別を行い、
     新たな出力音声の内容は、前回の出力音声の内容と比較してより直接的に前記人物を確認する内容である、
     請求項1から6のいずれか1項に記載の個人識別装置。
    If the first identification result and the second identification result do not match, the person identifying means outputs the output voice anew and outputs the output voice in response to the new output voice. To identify
    The content of the new output audio is content that confirms the person more directly compared to the content of the previous output audio.
    The personal identification device according to any one of claims 1 to 6.
  8.  前記第2識別手段は、前記入力音声を用いた波形解析と言語解析の少なくとも一方を行うことにより、前記人物を識別する、
     請求項1から7のいずれか1項に記載の個人識別装置。
    The second identifying means identifies the person by performing at least one of waveform analysis and language analysis using the input speech;
    The personal identification device according to any one of claims 1 to 7.
  9.  前記第1識別手段は、前記動画像から得られる、顔特徴、人体特徴、および行動特徴の少なくともいずれかに基づいて、前記人物を識別する、
     請求項1から8のいずれか1項に記載の個人識別装置。
    The first identifying means identifies the person based on at least one of a facial feature, a human body feature, and a behavioral feature obtained from the moving image;
    The personal identification device according to any one of claims 1 to 8.
  10.  請求項1から9のいずれか1項に記載の個人識別装置と、
     前記画像入力手段に入力される動画像から、前記検出された人物の人体または行動に関する特徴の少なくともいずれかを取得する特徴取得手段と、
     前記特徴取得手段によって取得された特徴を、前記人物識別手段によって識別された人物とを関連付けて登録する特徴登録手段と、
     を備える、特徴収集装置。
    The personal identification device according to any one of claims 1 to 9,
    Feature acquisition means for acquiring at least one of features relating to the human body or behavior of the detected person from a moving image input to the image input means;
    A feature registration unit that registers the feature acquired by the feature acquisition unit in association with the person identified by the person identification unit;
    A feature collection device.
  11.  コンピュータによって実行される個人識別方法であって、
     動画像から人物を検出する検出ステップと、
     前記動画像に基づいて前記人物を識別する第1識別ステップと、
     音声を出力する音声出力ステップと、
     出力音声に応答する音声を取得する音声入力ステップと、
     出力音声への応答として得られる入力音声に基づいて前記人物を識別する第2識別ステップと、
     前記第1識別ステップにおける識別結果である第1識別結果と前記第2識別ステップにおける識別結果である第2識別結果に基づいて、前記人物を識別する、第3識別ステップと、
     を含む、個人識別方法。
    A personal identification method executed by a computer,
    A detection step of detecting a person from the moving image;
    A first identification step for identifying the person based on the moving image;
    An audio output step for outputting audio;
    A voice input step for obtaining a voice in response to the output voice;
    A second identifying step of identifying the person based on input speech obtained as a response to the output speech;
    A third identification step for identifying the person based on a first identification result that is an identification result in the first identification step and a second identification result that is an identification result in the second identification step;
    Including personal identification methods.
  12.  前記第1識別ステップにおける信頼度が第1閾値未満の場合に、前記音声出力ステップ、前記音声入力ステップ、前記第2識別ステップを行う、
     請求項11に記載の個人識別方法。
    When the reliability in the first identification step is less than a first threshold, the voice output step, the voice input step, and the second identification step are performed;
    The personal identification method according to claim 11.
  13.  前記第1識別ステップにおける信頼度が前記第1閾値以上の場合は、前記第3識別ステップでは、前記第1識別結果を前記人物の識別結果とする、
     請求項12に記載の個人識別方法。
    If the reliability in the first identification step is greater than or equal to the first threshold, in the third identification step, the first identification result is the person identification result;
    The personal identification method according to claim 12.
  14.  前記音声出力ステップでは、前記第1識別ステップにおける信頼度に応じ内容の音声を出力する、
     請求項11から13のいずれか1項に記載の個人識別方法。
    In the sound output step, a sound of content is output according to the reliability in the first identification step.
    The personal identification method according to claim 11.
  15.  請求項11から14のいずれか1項に記載の方法の各ステップを実行するためのプログラム。
     
    The program for performing each step of the method of any one of Claim 11 to 14.
PCT/JP2019/001488 2018-03-08 2019-01-18 Individual identification device and characteristic collection device WO2019171780A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018042204A JP6819633B2 (en) 2018-03-08 2018-03-08 Personal identification device and feature collection device
JP2018-042204 2018-03-08

Publications (1)

Publication Number Publication Date
WO2019171780A1 true WO2019171780A1 (en) 2019-09-12

Family

ID=67845907

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/001488 WO2019171780A1 (en) 2018-03-08 2019-01-18 Individual identification device and characteristic collection device

Country Status (2)

Country Link
JP (1) JP6819633B2 (en)
WO (1) WO2019171780A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021095094A1 (en) * 2019-11-11 2021-05-20 日本電気株式会社 Person state detection device, person state detection method, and non-transient computer-readable medium in which program is contained

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7451130B2 (en) 2019-10-07 2024-03-18 キヤノン株式会社 Control device, control system, control method, and program
JP7430087B2 (en) 2020-03-24 2024-02-09 株式会社フジタ speech control device
KR20230164240A (en) * 2020-08-20 2023-12-01 라모트 앳 텔-아비브 유니버시티 리미티드 Dynamic identity authentication

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004259255A (en) * 2003-02-05 2004-09-16 Fuji Photo Film Co Ltd Authentication apparatus
JP2005182159A (en) * 2003-12-16 2005-07-07 Nec Corp Personal identification system and method
JP2007156688A (en) * 2005-12-02 2007-06-21 Mitsubishi Heavy Ind Ltd User authentication device and its method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004259255A (en) * 2003-02-05 2004-09-16 Fuji Photo Film Co Ltd Authentication apparatus
JP2005182159A (en) * 2003-12-16 2005-07-07 Nec Corp Personal identification system and method
JP2007156688A (en) * 2005-12-02 2007-06-21 Mitsubishi Heavy Ind Ltd User authentication device and its method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021095094A1 (en) * 2019-11-11 2021-05-20 日本電気株式会社 Person state detection device, person state detection method, and non-transient computer-readable medium in which program is contained

Also Published As

Publication number Publication date
JP6819633B2 (en) 2021-01-27
JP2019154575A (en) 2019-09-19

Similar Documents

Publication Publication Date Title
WO2019171780A1 (en) Individual identification device and characteristic collection device
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
JP4462339B2 (en) Information processing apparatus, information processing method, and computer program
US20110224978A1 (en) Information processing device, information processing method and program
US11854550B2 (en) Determining input for speech processing engine
KR101749100B1 (en) System and method for integrating gesture and sound for controlling device
WO2016150001A1 (en) Speech recognition method, device and computer storage medium
Varghese et al. Overview on emotion recognition system
JP2009031951A (en) Information processor, information processing method, and computer program
JP2002182680A (en) Operation indication device
JP4730404B2 (en) Information processing apparatus, information processing method, and computer program
EP3956883A1 (en) Identifying input for speech recognition engine
JP2012038131A (en) Information processing unit, information processing method, and program
JP2013104938A (en) Information processing apparatus, information processing method, and program
JP2010165305A (en) Information processing apparatus, information processing method, and program
JP4730812B2 (en) Personal authentication device, personal authentication processing method, program therefor, and recording medium
KR20200085696A (en) Method of processing video for determining emotion of a person
CN111326152A (en) Voice control method and device
US11682389B2 (en) Voice conversation system, control system for voice conversation system, and control program, and control method
WO2021166811A1 (en) Information processing device and action mode setting method
JP2009042910A (en) Information processor, information processing method, and computer program
WO2023193803A1 (en) Volume control method and apparatus, storage medium, and electronic device
Ktistakis et al. A multimodal human-machine interaction scheme for an intelligent robotic nurse
JP7347511B2 (en) Audio processing device, audio processing method, and program
KR20230154380A (en) System and method for providing heath-care services fitting to emotion states of users by behavioral and speaking patterns-based emotion recognition results

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19763366

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19763366

Country of ref document: EP

Kind code of ref document: A1