JP5323770B2 - User instruction acquisition device, user instruction acquisition program, and television receiver - Google Patents

User instruction acquisition device, user instruction acquisition program, and television receiver Download PDF

Info

Publication number
JP5323770B2
JP5323770B2 JP2010149860A JP2010149860A JP5323770B2 JP 5323770 B2 JP5323770 B2 JP 5323770B2 JP 2010149860 A JP2010149860 A JP 2010149860A JP 2010149860 A JP2010149860 A JP 2010149860A JP 5323770 B2 JP5323770 B2 JP 5323770B2
Authority
JP
Japan
Prior art keywords
user
face
users
plurality
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2010149860A
Other languages
Japanese (ja)
Other versions
JP2012014394A (en
Inventor
真人 藤井
Original Assignee
日本放送協会
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本放送協会 filed Critical 日本放送協会
Priority to JP2010149860A priority Critical patent/JP5323770B2/en
Publication of JP2012014394A publication Critical patent/JP2012014394A/en
Application granted granted Critical
Publication of JP5323770B2 publication Critical patent/JP5323770B2/en
Application status is Expired - Fee Related legal-status Critical
Anticipated expiration legal-status Critical

Links

Images

Description

  The present invention relates to a user instruction acquisition device, a user instruction acquisition program, and a television receiver that acquire an instruction for controlling the device from a user who uses a device such as a television, an audio device, a personal computer, and various home appliances.

  As a method for a device such as a television to receive an instruction from a user, for example, the most basic method is a method of receiving an instruction with a remote controller. Patent Documents 1 and 2 propose an apparatus for receiving instructions from a user using voice recognition or gesture (motion recognition) instead of the remote control in order to avoid the troublesome operation of the remote control as described above. Yes.

JP 2004-192653 A Japanese Patent No. 3886074

  However, the method using the remote controller as described above has a problem that the content to be instructed to a device such as a television is fixed and lacks flexibility, and the operation of the remote controller is complicated and difficult to handle. In the devices using speech recognition and gestures proposed in Patent Documents 1 and 2, since voice recognition is always performed, the timing when the user actually gives an instruction and who of the multiple users gives the instruction. There is a problem that it does not know whether it is, and also reacts to noise. In addition, in the devices proposed in Patent Documents 1 and 2, it is necessary for the user to give an instruction under an unnatural situation that is not everyday, such as giving an instruction while looking at an anthropomorphic agent image displayed on the display screen. There was a problem that it was complicated.

  The present invention has been made in view of the above points, and it is possible to accurately instruct and control a device by a user's natural utterance or operation, and an instruction is acquired only when the user actually gives an instruction. It is an object to provide a user instruction acquisition device, a user instruction acquisition program, and a television receiver.

  In order to solve the above-mentioned problem, the user instruction acquisition device according to claim 1 specifies a user who gives an instruction to control the device from a plurality of users who use the device, and receives an instruction from the user. A user instruction acquisition device that recognizes each of the plurality of users registered in advance from an image captured by a camera and detects a change in the face of each of the plurality of users. Detecting a face analysis means for generating an utterance period indicating a period during which each of the plurality of users is speaking from the change of the face, and recognizing hand movements of the plurality of users from the images of the plurality of users. Based on the utterance period generated by the hand motion analysis means and the face analysis means, and detects sound from sounds around the device, and uses acoustic feature values registered in advance for each user Speech analysis means for recognizing the content of the recorded speech and the speaker, and when the speaker recognized by the speech analysis means is included in the plurality of users recognized by the face analysis means, The user is identified as the user who is giving the instruction, the face change of the user detected by the face analysis means, the movement of the user's hand recognized by the hand movement analysis means, and the voice analysis means And a command generation means for generating a predetermined command for the content of the user's voice recognized by the above.

  According to such a configuration, the user instruction acquisition device generates a period during which the user is speaking from the change of the user's face by the face analysis unit, and performs voice recognition only when the user is speaking. The accuracy of voice recognition can be increased. In addition, since a user who has given a voice instruction to a device can be identified by comparing the user recognized by face recognition with the speaker recognized by voice recognition, a plurality of users use the device. The command can be generated accurately.

  According to a second aspect of the present invention, in the user instruction acquiring apparatus, the face analysis unit detects a plurality of face areas of the plurality of users from the video, and a facial feature amount registered in advance for each user. A face recognition means for recognizing a user corresponding to the face area, a face change detection means for detecting a change in the face of the plurality of users from the face areas of the plurality of users, It is configured to determine whether or not the plurality of users are speaking from the change of the user's face, and to determine a speech state estimation unit that generates the speech period when it is determined that the user is speaking.

  According to such a configuration, the user instruction acquisition device determines whether or not the user is speaking by the speech state estimating means, and generates the speech period only when it is determined that the user is speaking. Since it outputs to a voice analysis means, the precision of voice recognition can be raised more.

  A user instruction acquisition program according to claim 3 is for identifying a user who is giving an instruction to control the device from a plurality of users who use the device and acquiring an instruction from the user. The computer recognizes each of the plurality of users registered in advance from the video taken by the camera, detects a change in the face of each of the plurality of users, and changes the face A face analysis unit that generates a speech period indicating a period during which each of the plurality of users is speaking, a hand motion analysis unit that recognizes the motions of the plurality of users from the videos of the plurality of users, and the face Based on the utterance period generated by the analysis means, the voice is detected from the sounds around the device, and the contents of the voice and the acoustic feature amount registered in advance for each user are used. Voice analysis means for recognizing a speaker, and when the speaker recognized by the voice analysis means is included in the plurality of users recognized by the face analysis means, the instruction is given to the speaker. The user's face change detected by the face analysis means, the user's hand movement recognized by the hand movement analysis means, and the user's hand recognition recognized by the voice analysis means. It is configured to function as command generation means for generating a predetermined command for the content of the voice.

  According to such a configuration, the user instruction acquisition program generates a period during which the user is speaking from the change of the user's face by the face analysis unit, and performs voice recognition only when the user is speaking. The accuracy of voice recognition can be increased. In addition, since a user who has given a voice instruction to a device can be identified by comparing the user recognized by face recognition with the speaker recognized by voice recognition, a plurality of users use the device. The command can be generated accurately.

  According to a fourth aspect of the present invention, there is provided a television receiver for providing a broadcast program to a user, and analyzing a video of a camera installed in the television receiver and a sound of a microphone, thereby analyzing the user's It was set as the structure provided with the user instruction | indication acquisition apparatus of Claim 1 or Claim 2 which acquires the instruction | indication from the said user by an audio | voice and operation | movement.

  According to such a configuration, the television receiver determines whether or not the user is speaking from the change of the user's face by the face analysis means, and performs voice recognition only when the user is speaking. The accuracy of voice recognition can be increased. In addition, since a user who has given a voice instruction to a device can be identified by comparing the user recognized by face recognition with the speaker recognized by voice recognition, a plurality of users use the device. The command can be generated accurately.

  According to the first, second, third, and fourth aspects of the invention, the user's utterance state is automatically determined from the change in the user's face, and the user speaks and operates the device. A command can be generated simply by giving an instruction. Therefore, the instruction content can be transmitted to the device and the device can be controlled on the extension of the natural action of the user without performing a complicated operation.

It is a block diagram which shows the whole structure of the user instruction | indication acquisition apparatus which concerns on this invention. (A) is a block diagram which shows the specific structure of the utterance state estimation means in the user instruction | indication acquisition apparatus which concerns on this invention, (b) is previously held by the utterance condition memory | storage part in the user instruction | indication acquisition apparatus which concerns on this invention It is a figure which shows an example of the speech conditions to perform. (A) is a block diagram which shows the specific structure of the command production | generation means in the user instruction | indication acquisition apparatus which concerns on this invention, (b) is a command condition memory | storage part in the user instruction | indication acquisition apparatus which concerns on this invention previously hold | maintains. It is a figure which shows an example of command conditions. It is a figure which shows an example of the user's instruction | indication in the user instruction | indication acquisition apparatus which concerns on this invention. It is a flowchart which shows operation | movement of the user instruction | indication acquisition apparatus based on this invention. It is the schematic which shows an example of a television receiver provided with the user instruction | indication apparatus which concerns on this invention.

  A user instruction acquisition device, a user instruction acquisition program, and a television receiver according to an embodiment of the present invention will be described with reference to the drawings. In the following description, the same configuration is given the same name and symbol, and detailed description is omitted.

[User Instruction Acquisition Device]
The user instruction acquisition device 1 is for specifying a user who gives an instruction for controlling the device from a plurality of users using the device such as a television and acquiring an instruction from the user. .

  For example, as shown in FIG. 6, the user instruction acquisition device 1 is connected to a television receiver (hereinafter referred to as a television) T that provides a broadcast program to the user, and inputs from a camera Cr and a microphone M installed on the upper portion of the television T. The user's instruction is acquired by analyzing the user's video and audio. Then, as shown in FIG. 1, the user instruction acquisition device 1 generates a command corresponding to this, and outputs the command to the control unit of the device. Note that the user instruction acquisition device 1 is not provided outside the television T as shown in FIG. 6, but may be incorporated inside the television T.

  Here, as shown in FIG. 1, the user instruction acquisition device 1 includes a voice analysis unit 10, a face analysis unit 20, and a hand motion analysis unit 30. In addition, as described above, the user instruction acquisition device 1 includes the camera Cr for capturing the video of the user who uses the device, and the microphone M for collecting sounds around the device. Note that the camera Cr and the microphone M provided in the user instruction acquisition device 1 are installed on the upper part of the device, for example, as shown in FIG. 6 described above, and the video of the user who uses the device and the sound around the device. It is configured so that it can be obtained. Hereinafter, each component with which the user instruction | indication acquisition apparatus 1 is provided is demonstrated in detail.

  The voice analysis means 10 detects voice from sounds around the equipment collected by the microphone M, and recognizes the content of the voice and the speaker using the acoustic feature amount registered for each user in advance. Here, as shown in FIG. 1, the voice analysis means 10 includes a voice detection means 11, a voice recognition means 12, and a speaker recognition means 13.

  The sound detection means 11 is for detecting sound from sounds around the device. As shown in FIG. 1, when a sound around the device is input from the microphone M, the sound detection unit 11 extracts the sound from the sound around the device using a frequency characteristic of the sound registered in advance. . When the speech detection unit 11 receives an utterance period indicating a period during which the user is speaking from the utterance state estimation unit 24 described later, the speech detection unit 11 recognizes the speech detected during the utterance period and recognizes the speech. Output to means 13. That is, the voice detection unit 11 is configured to output the detected voice to the voice recognition unit 12 and the speaker recognition unit 13 only while the user is speaking. The voice detection means 11 includes a storage unit (not shown) that holds the frequency characteristic data of the voice in advance.

  The voice recognition means 12 is for recognizing the contents of the voice from the voice. Specifically, the speech recognition unit 12 extracts, for example, frequency characteristics such as a low-order DCT component of a spectrum as an acoustic feature amount by acoustic analysis from a time waveform of speech, and pronunciations of all registered words. By collating with an acoustic model corresponding to the above, and using a language model (continuous frequency distribution of words), the most likely word sequence acoustically and linguistically is obtained as a recognition result. The voice recognition unit 12 includes a storage unit (not shown) that holds the acoustic model and the language model in advance.

  As shown in FIG. 1, voice is input to the voice recognition unit 12 from the voice detection unit 11. Then, the voice recognition unit 12 extracts a word string from the voice by the above-described method, and outputs it as voice information to the information acquisition unit 41 of the command generation unit 40 (see FIG. 3A).

  The speaker recognition means 13 is for recognizing a speaker of the voice, that is, which user is emitting the voice from the voice. Specifically, the speaker recognizing unit 13 extracts the same acoustic feature amount as that of the speech recognizing unit 12 from the voice, and obtains the acoustic feature amount and a speaker model registered in advance for a specific speaker. A comparison is made to determine who the speaker is.

  A Bayes information criterion can be used for speaker determination in the speaker recognition means 13. It is also possible to classify acoustic features into phoneme classes and collate them using a phoneme class mixed model. The speaker recognition unit 13 includes a storage unit (not shown) that holds the speaker model described above in advance. The speaker model can be created, for example, when the user utters a specific word in advance and registers the user's voice in the storage unit together with a registered name such as a name or a nickname.

  As shown in FIG. 1, voice is input to the speaker recognition unit 13 from the voice detection unit 11. Then, the speaker recognition unit 13 determines a speaker from the voice by the above-described method, and outputs this as speaker information to the information acquisition unit 41 of the command generation unit 40 (see FIG. 3A).

  The face analysis means 20 recognizes each of a plurality of users registered in advance by face image recognition processing from the video imaged by the camera Cr, and detects changes in the faces of the plurality of users. Further, the face analysis unit 20 determines whether or not the plurality of users are speaking from the change of the faces of the plurality of users, and generates an utterance period indicating a period during which each of the plurality of users is speaking. Is. As shown in FIG. 1, the face analysis unit 20 includes a face area detection unit 21, a face change detection unit 22, a face recognition unit 23, and an utterance state estimation unit 24.

  The face area detecting means 21 detects a human face area from a plurality of user images. Specifically, the face area detection means 21 extracts a user's universal features from images constituting a plurality of user images, and detects those features to detect a human face region. The face area detecting means 21 can perform high-speed processing by using a Haar function for extracting the universal feature from the image.

  As shown in FIG. 1, a plurality of users using the device are input to the face area detection unit 21 from the camera Cr. Then, the face area detection unit 21 detects the user's face area from the video by the above-described method, and outputs this to the detection units of the face change detection unit 22 and the face recognition unit 23 as face area information.

  The face change detection unit 22 detects a change in each user's face from areas of the user's face detected from a plurality of user images. For example, when the face area detection unit 21 detects a face area for three people, the face change detection unit 22 detects a change in each of the face areas for the three people. As shown in FIG. 1, the face change detection unit 22 includes a face direction detection unit 221, a line-of-sight detection unit 222, an eye opening / closing detection unit 223, and a lip movement detection unit 224.

  The face orientation detection unit 221 detects the face orientations of a plurality of users based on the device. For example, when the device is a television T (see FIG. 6), the face orientation detection unit 221 determines how much the user's face orientation is in the horizontal and vertical directions with respect to the center of the television screen based on the face area information described above. Detect whether the angle is turning. Specifically, the face direction detection unit 221 performs various extractions based on, for example, the arrangement information of the features around the eyes, nose, and mouth of the user by extracting the features by the Haar function and the Gabor wavelet described later. A face orientation template is recorded, and the user's face orientation is estimated by matching with the template.

  The line-of-sight detection unit 222 detects the directions of the lines of sight of a plurality of users based on the device. The line-of-sight detection unit 222 detects how much the direction of the line of sight of the user's eyes rotates in the horizontal and vertical directions with respect to the head from the face area information described above. The line-of-sight detection unit 222 estimates the position of the user's eyes based on the arrangement information of the parts in the user's face based on the user's face area detected by the face area detection unit 21, and registers in advance. The direction of the user's line of sight is estimated by matching with the image pattern in the direction of each line of sight. Note that, for example, when the device is a television T (see FIG. 6), the line-of-sight detection unit 222 is combined with the detection result by the face orientation detection unit 221 described above, which part of the television screen the user is looking at. Can be estimated.

  The eye opening / closing detection unit 223 detects whether the user has opened or closed the eyes. Similarly to the line-of-sight detection unit 222, the eye opening / closing detection unit 223 is based on the arrangement information of parts in the user's face based on the user's face area detected by the face area detection unit 21. The position is estimated, and it is determined that the eyes are open when there is a black part at that position, and it is determined that the eyes are closed when there is no black part at that position.

  The lip movement detection unit 224 detects the movement of the user's lips. The lip movement detection unit 224 estimates the position of the user's mouth based on the arrangement information of the parts in the user's face based on the user's face area detected by the face area detection unit 21, and performs block matching or Lucas. -When a lip motion vector is extracted by a motion detection algorithm such as Kanade's method, and the power of the motion vector exceeds a certain threshold value and the power fluctuation is periodic, the user's lip moves and speaks It is determined that

  Then, the face change detection unit 22 detects the face change detected for each face region by the face direction detection unit 221, the line-of-sight detection unit 222, the eye opening / closing detection unit 223, and the lip movement detection unit 224. The information is output to the utterance state determination unit 241 of the utterance state estimation unit 24 (see FIG. 2A) and also output to the information acquisition unit 41 of the command generation unit 40 (see FIG. 3A).

  The face recognizing means 23 recognizes users included in the area from the areas of the user's face detected from a plurality of user images. The face recognition means 23 applies a face image recognition technique to the face area detected by the face area detection means 21 and determines who is using the device. For example, when the device is a television T (see FIG. 6) and three users are watching the television, the face recognition unit 23 uses face image recognition technology to detect each of the three face regions. The registered name such as the user's name or nickname corresponding to the face of the area is determined.

  Specifically, the face recognizing unit 23 identifies a user from the face included in the face area by a template matching method characterized by the frequency analysis result of the local luminance component by the Gabor wavelet. Specifically, the face recognizing unit 23 extracts the feature amount of the position determined around the eyes, the nose, and the mouth in the user's face region detected by the face region detecting unit 21 and the arrangement information thereof as the face feature amount. The user is identified by collating with the image feature amount of the user registered in advance.

  Further, the face recognizing means 23 can also use a technique that allows deformation of the positional relationship of the feature arrangement so as not to deteriorate the recognition performance even for facial expression changes and the like. Note that the face recognition means 23 includes a storage unit (not shown) that holds the face feature values described above in advance. The face feature amount can be created by, for example, a user photographing a face from a specific angle in advance and registering the face image in the user instruction acquisition device 1 together with a registered name such as a name or a nickname.

  As shown in FIG. 1, face area information is input to the face recognition means 23 from the face area detection means 21. Then, the face recognizing unit 23 recognizes the user from the face area information by the above-described method, and outputs this to the information acquisition unit 41 of the command generating unit 40 together with the detection time as person information (see FIG. 3A). .

  The utterance state estimation unit 24 determines whether or not the plurality of users are speaking from the change of the faces of the plurality of users and generates an utterance period indicating a period during which the plurality of users are speaking. is there. Here, the utterance state estimation unit 24 includes an utterance state determination unit 241 and an utterance condition storage unit 242 as shown in FIG.

  The utterance state determination unit 241 is for determining whether or not the user is speaking. As shown in FIG. 2A, the utterance state determination unit 241 includes information from the face change detection unit 22 such as the user's face orientation, the user's line of sight, the user's eye opening / closing, and the user's lip movement. Face change information and detection times (not shown) for detecting changes in the face of the user are input. Further, as shown in FIG. 2A, the utterance condition is input to the utterance state determination unit 241 from the utterance condition storage unit 242 that holds the utterance conditions in advance.

  Here, the utterance condition is a predetermined condition for determining that the user is speaking, and as shown in FIG. 2B, the user's face direction, line of sight, eye opening / closing, lips It shows the condition determined by the detection result of the face change such as movement. That is, the utterance state determination unit 241 determines that the user using the device is in the utterance state only when the change in the user's face detected by the face change detection unit 22 satisfies the utterance condition.

  As shown in FIG. 2B, the utterance condition here is that the user's face is facing the front at a time rate of 80% or more, and the user's line of sight is facing the TV screen direction at a time rate of 80% or more. In other words, when the user's eyes are open at a time rate of 80% or more and the user's lips are moving at a time rate of 50% or more, the user is in an utterance state. It prescribes. Note that the above-described time rate indicates the ratio of the duration of face change to the detection time of the user's face change. For example, when the face change detection unit 22 detects a change in the user's face for 2 seconds, the time rate is 50% if the face change lasts 1 second, and if the face change lasts 1.6 seconds, the time rate is 80%.

  Note that the utterance conditions shown in FIG. 2B are merely examples, and the utterance conditions and the time rate can be appropriately changed according to the type of device or the type of user. For example, if the face direction, line of sight, and opening / closing of eyes are excluded from the detection target of the utterance condition in FIG. 2B and only the user's lips are moving at a predetermined time rate or more, the user is in the utterance state. It can also be determined.

  The utterance state determination unit 241 compares the face change information input from the face change detection unit 22 with the utterance condition input from the utterance condition storage unit 242, and if the utterance condition is satisfied, the face described above An utterance period indicating the period during which the user is speaking is generated from the change information detection time. Then, the utterance state determination unit 241 outputs the utterance period to the voice detection unit 11 as shown in FIG. 1 and FIG.

  Note that the utterance state determination unit 241 preferably displays the determination result on whether or not the utterance state is present on the device. For example, when the device is the television T (see FIG. 6), when the speech state determination unit 241 determines that the user is in the speech state, the speech state determination unit 241 displays that the user is in the speech state on the television screen. In this way, by displaying the determination result on whether or not the speech state is on the television screen, the user can expect to continue watching the television T, so that the determination accuracy can be improved. Hereinafter, the remaining configuration of the user instruction acquisition apparatus 1 will be described with reference to FIG.

  The manual motion analysis means 30 detects a human hand region from a plurality of user images taken by the camera Cr, and recognizes the motion of the plurality of user hands. Here, as shown in FIG. 1, the hand motion analysis unit 30 includes a hand region detection unit 31 and a hand motion recognition unit 32.

  The hand region detection means 31 detects a human hand region from a plurality of user images. Specifically, the hand region detection means 31 detects a human hand region using skin color and rough shape information from images constituting a plurality of user images. Further, for example, when there is a premise that the user gives an instruction while pointing or extending the hand, the hand region detecting unit 31 uses the distance image to identify the hand region by cutting out the most protruding part. You can also

  As shown in FIG. 1, the hand region detecting means 31 receives images of a plurality of users who use devices from the camera Cr. Then, the hand region detection unit 31 detects the region of the user's hand from the video by the above-described method, and outputs this to the hand motion recognition unit 32 as hand region information.

  The hand movement recognition means 32 recognizes the movement of the user's hand from the user's hand area detected from the images of the plurality of users. The hand motion recognition unit 32 applies a motion recognition method to the region detected by the hand region detection unit 31, and recognizes the motion of the user's hand corresponding to a predetermined command. Specifically, the hand motion recognition means 32 tracks time-series data of universal features, for example, called SIFT and SURF, which are created in advance from the user's hand region, that is, each feature for each frame. The motion recognition can be performed by matching each motion recognition template made up of the data obtained from the above and the time-series data of these feature amounts extracted from the user's hand region. The hand motion recognition means 32 recognizes not only the presence / absence of a user's hand motion but also the type of hand motion (pointing, finger-flicking, etc.).

  As shown in FIG. 1, hand region information is input to the hand motion recognition unit 32 from the hand region detection unit 31. Then, the hand movement recognition unit 32 recognizes the movement of the user's hand by the method described above, and outputs this to the information acquisition unit 41 of the command generation unit 40 together with the detection time as hand movement information (FIG. 3A). reference).

  When a plurality of users recognized by the face analysis unit 20 include speakers recognized by the voice analysis unit 10, the command generation unit 40 determines the user's face detected by the face analysis unit 20. Predetermined commands are generated for the change, the user's hand motion recognized by the hand motion analysis means 30, and the user's voice content recognized by the voice analysis means 10. Here, as shown in FIG. 3A, the command generation unit 40 includes an information acquisition unit 41, a command generation unit 42, and a command condition storage unit 43.

  The information acquisition unit 41 acquires information necessary for generating a command for controlling the device. As shown in FIG. 3A, the information acquisition unit 41 receives face change information from the face change detection unit 22, person information from the face recognition unit 23, hand movement information from the hand movement recognition unit 32, and voice recognition. Voice information is input from the means 12 and speaker information is input from the speaker recognition means 13.

  Then, the information acquisition unit 41, when the speaker recognized by the speaker recognition unit 13 is included in the persons recognized by the face recognition unit 23, that is, among a plurality of users who use the device. 3 includes a user who gave a voice instruction to the device, as shown in FIG. 3A, the face change information, the hand movement information, and the voice information of the user who gave the voice instruction Output to the command generation unit 42. As described above, the information acquisition unit 41 can specify a user who gives an instruction to the device from the plurality of users recognized by the face recognition unit 23 when the plurality of users use the device. The information acquisition unit 41 includes a storage unit (not shown) for temporarily storing the face change information, the person information, the hand movement information, the voice information, and the speaker information.

  Note that the face change information input from the face change detection unit 22 to the information acquisition unit 41 is face change information for each face area detected by the face area detection unit 21. The person information input from the face recognition unit 23 to the information acquisition unit 41 is also information on a registered name such as a name for each face area detected by the face area detection unit 21. Therefore, the information acquisition unit 41 can determine which user's face change the face change information input from the face change detection means 22 is based on the face area.

  Further, as described above, the detection time is input from the face recognition unit 23 together with the person information to the information acquisition unit 41 and the detection time is input from the hand movement recognition unit 32 together with the hand movement information. Therefore, the information acquisition unit 41 can determine which user's manual motion is the manual motion information input from the manual motion recognition means 32 by using the detection time as a reference.

  The command generation unit 42 generates a command corresponding to an instruction for controlling the device. As shown in FIG. 3A, the command generation unit 42 receives face change information, manual action information, and voice information of the user who has given a voice instruction to the device from the information acquisition unit 41. Is done. Further, as shown in FIG. 3A, command conditions are input to the command generation unit 42 from a command condition storage unit 43 that holds command conditions in advance.

  Here, the command condition is a predetermined condition for generating a command. As shown in FIG. 3B, the user's face direction, line of sight, eye opening / closing, lip movement, hand movement, voice, and the like. This indicates the condition determined by the detection result. That is, the command generator 42 changes the user's face detected by the face change detection unit 22, the user's hand movement recognized by the hand movement recognition unit 32, and the user's hand recognized by the voice recognition unit 12. The command is generated only when the voice and the voice satisfy this command condition.

  Here, as shown in FIG. 3B, four patterns are defined as command conditions. The first pattern is the first column of the detection result column in FIG. 3B, the user's face is facing the front, the user's line of sight is facing the TV screen, and the user's eyes are open. If the user's lips are moving, the user is performing a manual operation, and the user is speaking, it is specified that the voice instruction content and the manual operation instruction content are analyzed to generate a command. Yes. This means that, for example, a command is generated in the case of a situation like the user A shown in FIG.

  The second pattern is the second column in the detection result column of FIG. 3B, the user's face is facing the front, the user's line of sight is facing the TV screen, and the user's eyes are open. When the user's lips are moving, the user is not performing a hand movement, and the user is speaking, it is specified that the voice instruction content is analyzed to generate a command. This means that, for example, a command is generated in the case of a situation like the user B shown in FIG.

  The third pattern is the third column in the detection result column of FIG. 3B, where the user's face is facing sideways, the user's line of sight is facing sideways, and the user's eyes are open. It is specified that when the user's lips are moving, the user is performing a manual motion, and the user is speaking, a command is generated by analyzing the voice instruction content and the manual motion instruction content. . This means that, for example, a command is generated in the case of a situation like the user C shown in FIG.

  The fourth pattern is the fourth column in the detection result column of FIG. 3B, where the user's face is facing sideways, the user's line of sight is facing sideways, and the user's eyes are closed. When the user's lips are moving, the user is not performing hand movements, and the user is speaking, it is specified that the voice instruction content is analyzed to generate a command. This means that a command is generated in the case of a situation such as the user D shown in FIG.

  Note that the command conditions shown in FIG. 3B are merely examples, and can be changed as appropriate depending on the type of device or the type of user. For example, it is possible to exclude the face direction, line of sight, and eye opening / closing from the command condition detection target in FIG. 3B, and use only the user's lip movement and voice as conditions for command generation.

  Here, the command generation unit 42 includes a database (not shown) that holds in advance a command list for controlling the device. Then, the command generation unit 42 searches the database described above for the content of the user's voice recognized by the voice recognition unit 12 and the command corresponding to the user's hand movement recognized by the hand movement recognition unit 32. The user's voice instruction content and manual operation instruction content are analyzed.

  Note that the database described above associates commands and natural words and actions that are uttered by the user on a daily basis. For example, when the device is a television T (see FIG. 6), the user utters that the volume of the television T is insufficient, such as “sounds are tiny”, “sounds low”, “cannot hear well” The word is associated with the command “increase TV volume” in the database. Similarly, the operation of “closing the ear” performed by the user regarding the volume of the television being too loud is associated with the command “decreasing the volume of the television” in the database.

  Thus, the database of the command generation unit 42 holds the command list corresponding to the user's natural utterances and actions, so that the user can instruct the device under more natural conditions.

  In the user instruction acquisition device 1 having the configuration described above, the face analysis unit 20 determines whether or not the user is speaking from the change of the user's face, and performs voice recognition only when the user is speaking. Therefore, the accuracy of voice recognition can be improved. In addition, since a user who has given a voice instruction to a device can be identified by comparing the user recognized by face recognition with the speaker recognized by voice recognition, a plurality of users use the device. The command can be generated accurately.

  Moreover, according to the user instruction acquisition device 1, the user's speech state is automatically determined from the change of the user's face, and the user generates a command only by giving an instruction by voice and operation to the device. Can do. Therefore, the instruction content can be transmitted to the device and the device can be controlled on the extension of the natural action of the user without performing a complicated operation.

  Here, the user instruction acquisition device 1 can be realized by operating a general computer by a program that functions as each of the above-described units. This program (content encryption program) can be distributed via a communication line, or can be distributed by writing on a recording medium such as a CD-ROM.

[Operation of User Instruction Acquisition Device]
The operation of the user instruction acquisition device 1 will be briefly described with reference to FIG.
First, when the user instruction acquisition device 1 starts operating, the camera Cr acquires videos of a plurality of users who use the device, and outputs them to the face area detection means 21 and the hand area detection means 31. In addition, the microphone M acquires sound around the device and outputs it to the sound detection means 11. And the audio | voice detection means 11 detects an audio | voice from the sound around an apparatus (step S1). Next, the face area detection means 21 detects a human face area from the images of a plurality of users, and outputs this as face area information to each detection unit of the face change detection means 22 and the face recognition means 23 (step). S2).

  Next, each detection unit of the face change detection means 22 detects a face change such as a user's face direction, line of sight, eye opening / closing, lip movement, etc. from a plurality of user face area information, and this is detected as face change information. To the utterance state determination unit 241 and the information acquisition unit 41 (step S3). Next, the face recognition means 23 recognizes a person corresponding to the face included in the area, that is, the user from the face area information of a plurality of users, and outputs this to the information acquisition unit 41 as person information (step S4). .

  Further, the hand region detection unit 31 detects a human hand region from a plurality of user images, and outputs this to the hand motion recognition unit 32 as hand region information (step S5). Next, the hand movement recognition unit 32 recognizes the movement of the user's hand from the hand area information of the plurality of users, and outputs this to the information acquisition unit 41 as the hand movement information (step S6).

  Next, the utterance state determination unit 241 determines whether or not the face change information of a plurality of users satisfies the utterance condition input from the utterance condition storage unit 242, and determines whether or not the plurality of users are speaking. (Step S7). When the speech state determination unit 241 determines that a plurality of users are speaking, the speech state determination unit 241 generates a speech period indicating a period during which the user is speaking and outputs the speech period to the voice detection unit 11. Thereby, the voice detection means 11 outputs the voice around the device to the voice recognition means 12 and the speaker recognition means 13 (Yes in step S7). On the other hand, if the speech state determination unit 241 does not determine that a plurality of users are speaking, the speech state determination unit 241 stands by until there is a new input (No in step S7).

  Next, the voice recognition unit 12 recognizes the content of the voice from the voice around the device, and outputs this as voice information to the information acquisition unit 41 (step S8). Moreover, the speaker recognition means 13 recognizes the speaker of the voice from the voice around the device, and outputs this to the information acquisition unit 41 as the speaker information (step S9).

  Next, when the speaker of the speaker information is included in the person of the person information, command generation is performed on the face change information, the hand movement information, and the voice information of the user who has given the voice instruction by the information acquisition unit 41 To the unit 42. When the user face change information, the hand movement information, and the voice information satisfy the command conditions, the command generation unit 42 generates a command (step S10).

DESCRIPTION OF SYMBOLS 1 User instruction | indication acquisition apparatus 10 Speech analysis means 11 Speech detection means 12 Speech recognition means 13 Speaker recognition means 20 Face analysis means 21 Face area detection means 22 Face change detection means 23 Face recognition means 24 Speech state estimation means 30 Manual motion analysis means 31 Hand region detection means 32 Hand movement recognition means 40 Command generation means 41 Information acquisition section 42 Command generation section 43 Command condition storage section 221 Face orientation detection section 222 Eye gaze detection section 223 Eye opening / closing detection section 224 Lip movement detection section 241 Utterance State determination unit 242 Speech condition storage unit Cr Camera M Microphone T Television receiver (TV)

Claims (4)

  1. A user instruction acquisition device that identifies a user who gives an instruction to control the device from a plurality of users who use the device, and acquires an instruction from the user,
    Recognizing each of the plurality of users registered in advance from video captured by the camera, detecting a change in the face of each of the plurality of users, and detecting the plurality of the plurality of users from the change in the face. Face analysis means for generating an utterance period indicating a period during which each of the users is speaking;
    Manual motion analysis means for recognizing the motion of the hands of the plurality of users from the images of the plurality of users;
    Based on the utterance period generated by the face analysis means, speech is detected from sounds around the device, and the speech content and speaker are recognized using acoustic feature values registered in advance for each user. Voice analysis means to
    When the speaker recognized by the voice analysis unit is included in the plurality of users recognized by the face analysis unit, the speaker is identified as the user who is giving the instruction, and the face For the change of the user's face detected by the analysis means, the movement of the user's hand recognized by the hand movement analysis means, and the content of the user's voice recognized by the voice analysis means, Command generation means for generating a predetermined command;
    A user instruction acquisition device comprising:
  2. The face analysis means includes
    Face area detection means for detecting areas of the faces of the plurality of users from the video;
    Face recognition means for recognizing a user corresponding to the face area using a face feature amount registered in advance for each user;
    Face change detection means for detecting changes in the faces of the plurality of users from the areas of the faces of the plurality of users;
    From the change of the faces of the plurality of users, it is determined whether or not the plurality of users are speaking, and when it is determined that they are speaking, the utterance state estimation means for generating the utterance period;
    The user instruction acquisition apparatus according to claim 1, further comprising:
  3. In order to identify a user who gives an instruction to control the device from a plurality of users who use the device, and to obtain an instruction from the user,
    Recognizing each of the plurality of users registered in advance from video captured by the camera, detecting a change in the face of each of the plurality of users, and detecting the plurality of the plurality of users from the change in the face. Face analysis means for generating an utterance period indicating a period during which each of the users is speaking
    Manual motion analysis means for recognizing the motions of the hands of the plurality of users from the images of the plurality of users;
    Based on the utterance period generated by the face analysis means, speech is detected from sounds around the device, and the speech content and speaker are recognized using acoustic feature values registered in advance for each user. Voice analysis means to
    When the speaker recognized by the voice analysis unit is included in the plurality of users recognized by the face analysis unit, the speaker is identified as the user who is giving the instruction, and the face For the change of the user's face detected by the analysis means, the movement of the user's hand recognized by the hand movement analysis means, and the content of the user's voice recognized by the voice analysis means, Command generating means for generating a predetermined command;
    A user instruction acquisition program characterized by being made to function as:
  4. A television receiver that provides broadcast programs to users,
    The user instruction acquisition device according to claim 1 or 2, wherein an instruction from the user based on the voice and operation of the user is acquired by analyzing a video of a camera installed in the television receiver and a sound of a microphone. A television receiver comprising:
JP2010149860A 2010-06-30 2010-06-30 User instruction acquisition device, user instruction acquisition program, and television receiver Expired - Fee Related JP5323770B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2010149860A JP5323770B2 (en) 2010-06-30 2010-06-30 User instruction acquisition device, user instruction acquisition program, and television receiver

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2010149860A JP5323770B2 (en) 2010-06-30 2010-06-30 User instruction acquisition device, user instruction acquisition program, and television receiver

Publications (2)

Publication Number Publication Date
JP2012014394A JP2012014394A (en) 2012-01-19
JP5323770B2 true JP5323770B2 (en) 2013-10-23

Family

ID=45600756

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2010149860A Expired - Fee Related JP5323770B2 (en) 2010-06-30 2010-06-30 User instruction acquisition device, user instruction acquisition program, and television receiver

Country Status (1)

Country Link
JP (1) JP5323770B2 (en)

Families Citing this family (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
JP5735453B2 (en) * 2012-05-16 2015-06-17 ヤフー株式会社 Display control apparatus, display control method, information display system, and program
JP5847646B2 (en) * 2012-05-17 2016-01-27 日本電信電話株式会社 Television control apparatus, television control method, and television control program
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
JP6149433B2 (en) * 2013-03-11 2017-06-21 株式会社リコー Video conference device, video conference device control method, and program
WO2014178491A1 (en) * 2013-04-30 2014-11-06 포항공과대학교 산학협력단 Speech recognition method and apparatus
KR101430924B1 (en) 2013-05-14 2014-08-18 주식회사 아빅스코리아 Method for obtaining image in mobile terminal and mobile terminal using the same
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
JP6329833B2 (en) * 2013-10-04 2018-05-23 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Wearable terminal and method for controlling wearable terminal
JP2015087824A (en) * 2013-10-28 2015-05-07 オムロン株式会社 Screen operation device and screen operation method
AU2015266863B2 (en) 2014-05-30 2018-03-15 Apple Inc. Multi-command single utterance input method
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
DK179309B1 (en) 2016-06-09 2018-04-23 Apple Inc Intelligent automated assistant in a home environment
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
DK179343B1 (en) 2016-06-11 2018-05-14 Apple Inc Intelligent task discovery
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. User interface for correcting recognition errors
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. User-specific acoustic models
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US20180336275A1 (en) 2017-05-16 2018-11-22 Apple Inc. Intelligent automated assistant for media exploration
WO2019039352A1 (en) * 2017-08-25 2019-02-28 日本電気株式会社 Information processing device, control method, and program
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
KR20190048630A (en) * 2017-10-31 2019-05-09 엘지전자 주식회사 Electric terminal and method for controlling the same
WO2019175960A1 (en) * 2018-03-13 2019-09-19 三菱電機株式会社 Voice processing device and voice processing method
DK179822B1 (en) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000231427A (en) * 1999-02-08 2000-08-22 Nec Corp Multi-modal information analyzing device
JP2000338987A (en) * 1999-05-28 2000-12-08 Mitsubishi Electric Corp Utterance start monitor, speaker identification device, voice input system, speaker identification system and communication system
JP2002091466A (en) * 2000-09-12 2002-03-27 Pioneer Electronic Corp Speech recognition device
JP4624577B2 (en) * 2001-02-23 2011-02-02 富士通株式会社 Human interface system with multiple sensors
JP2006235712A (en) * 2005-02-22 2006-09-07 Canon Inc Conversation recording device
JP4984583B2 (en) * 2006-03-15 2012-07-25 オムロン株式会社 Display device, projector, display system, display method, display program, and recording medium
JP4715738B2 (en) * 2006-12-19 2011-07-06 トヨタ自動車株式会社 Utterance detection device and utterance detection method
JP2008310382A (en) * 2007-06-12 2008-12-25 Omron Corp Lip reading device and method, information processor, information processing method, detection device and method, program, data structure, and recording medium
JP2009032056A (en) * 2007-07-27 2009-02-12 Mitsuba Corp Communication system

Also Published As

Publication number Publication date
JP2012014394A (en) 2012-01-19

Similar Documents

Publication Publication Date Title
Caridakis et al. Multimodal emotion recognition from expressive faces, body gestures and speech
Fisher et al. Speaker association with signal-level audiovisual fusion
US9349218B2 (en) Method and apparatus for controlling augmented reality
Wöllmer et al. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework
JP4796309B2 (en) Method and apparatus for multi-sensor speech improvement on mobile devices
Aleksic et al. Audio-visual biometrics
US6421453B1 (en) Apparatus and methods for user recognition employing behavioral passwords
US6526395B1 (en) Application of personality models and interaction with synthetic characters in a computing system
Pantic et al. Toward an affect-sensitive multimodal human-computer interaction
US9031293B2 (en) Multi-modal sensor based emotion recognition and emotional interface
JP4481663B2 (en) Motion recognition device, motion recognition method, device control device, and computer program
US20060290699A1 (en) System and method for audio-visual content synthesis
CN1310207C (en) System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
Chen Audiovisual speech processing
Sebe et al. Emotion recognition based on joint visual and audio cues
De Silva et al. Bimodal emotion recognition
US7472063B2 (en) Audio-visual feature fusion and support vector machine useful for continuous speech recognition
JP2004515982A (en) Method and apparatus for predicting events in video conferencing and other applications
EP2426598A2 (en) Apparatus and method for user intention inference using multimodal information
US7340100B2 (en) Posture recognition apparatus and autonomous robot
US7321854B2 (en) Prosody based audio/visual co-analysis for co-verbal gesture recognition
Hennecke et al. Visionary speech: Looking ahead to practical speechreading systems
Busso et al. Analysis of emotion recognition using facial expressions, speech and multimodal information
Hong et al. Real-time speech-driven face animation with expressions using neural networks
JP2004310034A (en) Interactive agent system

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20130107

TRDD Decision of grant or rejection written
A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20130612

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20130618

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20130717

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

LAPS Cancellation because of no payment of annual fees