WO2018210219A1 - 基于正视的人机交互方法与系统 - Google Patents

基于正视的人机交互方法与系统 Download PDF

Info

Publication number
WO2018210219A1
WO2018210219A1 PCT/CN2018/086805 CN2018086805W WO2018210219A1 WO 2018210219 A1 WO2018210219 A1 WO 2018210219A1 CN 2018086805 W CN2018086805 W CN 2018086805W WO 2018210219 A1 WO2018210219 A1 WO 2018210219A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
recognition
image data
intention
view
Prior art date
Application number
PCT/CN2018/086805
Other languages
English (en)
French (fr)
Inventor
刘国华
Original Assignee
刘国华
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 刘国华 filed Critical 刘国华
Priority to EP18803148.8A priority Critical patent/EP3627290A4/en
Priority to US16/614,694 priority patent/US11163356B2/en
Publication of WO2018210219A1 publication Critical patent/WO2018210219A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/012Head tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/002Specific input/output arrangements not covered by G06F3/01 - G06F3/16
    • G06F3/005Input arrangements through a video camera
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/4223Cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44218Detecting physical presence or behaviour of the user, e.g. using sensors to detect if the user is leaving the room or changes his face expression during a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Definitions

  • the invention relates to the technical field of human-computer interaction, in particular to a human-computer interaction method and system based on front view.
  • Human-computer interaction refers to the process of information exchange between a person and a device using a certain dialogue language between a person and a device in a certain interaction manner.
  • the mainstream human-computer interaction methods mainly include three types, the first one is the traditional button mode; the second is the specific voice word activation method, such as: saying "Little Ice Hello" before the dialogue The device recognizes the speech that is heard later; the third is "speaking with a hand", that is, a specific gesture action is first used to enable the device to initiate speech recognition.
  • the human-computer interaction function can be realized to a certain extent, since the interaction mode is single, a certain specific gesture action needs to be set in advance, and the interaction process is not very natural, which brings inconvenience to the user operation to a certain extent. .
  • a human-computer interaction method based on front view including steps:
  • the current image data of the user is collected by the image acquisition device in real time, and the currently collected image data is compared with the front view image data;
  • the control device When the user and the device are in a relatively front view state, the user's behavior and intention are recognized by the visual recognition technology and the voice recognition technology of the computer, and the control device performs the correspondence with the current behavior and intention of the user according to the behavior of the preset user and the correspondence between the intention and the operation.
  • the computer visual recognition technology and speech recognition technology include face recognition, speech recognition, semantic understanding, gesture recognition, lip recognition, voiceprint recognition, expression recognition, age recognition, card recognition, face tracking, pupil Identification and iris recognition.
  • a human-computer interaction system based on front view comprising:
  • An acquiring module configured to acquire front view image data of the user and the device collected by the image collecting device in a relatively front view state
  • the comparison module is configured to collect current image data of the user in real time through the image acquisition device, and compare the currently collected image data with the front view image data;
  • a determining module configured to determine that the user and the device are in a relatively front view state when the currently acquired image data and the front view image data are consistent
  • control module configured to recognize user behavior and intention through a computer visual recognition technology and a voice recognition technology when the user and the device are in a relatively front view state, and control the device execution and the user current according to the preset user behavior and the intention and operation correspondence relationship
  • the operation of the computer corresponding to the intention, the computer visual recognition technology and speech recognition technology include face recognition, speech recognition, semantic understanding, gesture recognition, lip recognition, voiceprint recognition, expression recognition, age recognition, card recognition, Face tracking, pupil recognition, and iris recognition.
  • the invention is based on the human-computer interaction method and system of the front view, and acquires the front view image data of the user and the device collected by the image acquisition device in a relatively front view state, collects the current image data of the user, and compares the currently collected image data with the front view image data.
  • the user's behavior and intention are recognized by the computer's visual recognition technology and voice recognition technology, and the device performs the current behavior with the user according to the preset user's behavior and the intention and operation correspondence relationship. The operation corresponding to the intent.
  • the front view is judged, and the front view state of the user and the device is judged as the precondition of human-computer interaction, ensuring that the current user does have human-computer interaction requirements, and the whole human-computer interaction process is natural, and another method is adopted.
  • a variety of motion recognition methods including face recognition, speech recognition, gesture recognition, lip recognition, voiceprint recognition, expression recognition, age recognition, card recognition, pupil recognition, and iris recognition identify the user's next action, enabling multi-style people Machine interaction, bringing convenient operation to users.
  • FIG. 1 is a schematic flow chart of a first embodiment of a human-computer interaction method based on front view
  • FIG. 2 is a schematic flow chart of a second embodiment of a human-computer interaction method based on front view
  • FIG. 3 is a schematic structural diagram of a front view human-computer interaction system according to a first embodiment of the present invention
  • FIG. 4 is a schematic diagram of a specific application scenario of a human-computer interaction method and system based on the front view of the present invention.
  • a method for human-computer interaction based on front view includes the following steps:
  • S200 Obtain the front view image data of the user and the device collected by the image collection device in a relatively front view state.
  • the device may specifically be a television, an air conditioner, a computer, a robot, etc., and the device may also include an in-vehicle device or the like.
  • the user and the device are in a relatively front view state, the user faces the device.
  • the device is a television
  • the state of the user facing the television is that the user and the television are in a relatively front view state.
  • the image capturing device cannot be set in the center of the device, the image capturing device collects the image of the user and the device in a relatively front view state, and the user's eyes or the user's face are not facing the image capturing device from the perspective of the image capturing device.
  • it will show a certain angle.
  • the front view image data of the user and the device collected by the image acquisition device in a relatively front view state is first acquired.
  • the front view image data in which the user and the device are in a relatively front view state may be data collected in the history record or data collected on the spot.
  • the image capturing device may be a device such as a camera.
  • the front view image data of the user and the device in a relatively front view state may be set by the image collecting device collected by the image collecting device on the device, or may be set as an auxiliary device or a peripheral device of the device.
  • the image capturing device may be disposed on the television, or may be disposed on a set top box matched with the television. More specifically, the user and the device photographed by the camera are in front view image data in a relatively front view state, and after performing image processing and image object coordinate conversion, the relative position of the device and the user's face can be determined, that is, the user and the device can be obtained in a relatively front view.
  • the face image data of the user in the state It is determined that the user and the device are in a relatively front view state and can be selected by using techniques such as head pose estimation or gaze tracking.
  • S400 The current image data of the user is collected in real time through the image acquisition device, and the currently collected image data is compared with the front view image data.
  • the current image data of the user is collected in real time by the same image capturing device in step S200, and the image data collected in real time is compared with the front view image data acquired in step S200 to determine whether the current user and the device are in a relatively front view state.
  • step S200 When the front view image data acquired in step S200 is consistent with the image data collected in step S400 in real time, it indicates that the current user and the device are in a relatively front view state.
  • the control device performs the current behavior with the user according to the preset user's behavior and the intention and operation correspondence relationship.
  • the computer visual recognition technology and speech recognition technology include face recognition, speech recognition, semantic understanding, gesture recognition, lip recognition, voiceprint recognition, expression recognition, age recognition, card recognition, face tracking , pupil recognition and iris recognition.
  • the user's behavior and intention are recognized by the computer's visual recognition technology and voice recognition technology, and the device's current behavior and intention are controlled according to the preset user's behavior and the intention and operation correspondence.
  • Corresponding operation That is, only if it is determined that the user and the device are in a relatively front view state, the device will start to respond to the user operation, thus avoiding erroneous operations on the one hand, for example, avoiding the wrong startup of the television, incorrectly switching the TV program, etc.;
  • the user and the device are in a relatively front view state, there is a great possibility that the user operates the device to bring convenience to the user.
  • computer visual recognition technology and speech recognition technology mainly include face recognition, face detection, face tracking, speech recognition, gesture recognition, lip recognition, voiceprint recognition, expression recognition, age recognition, and card recognition. , pupil recognition and iris recognition.
  • face recognition face detection
  • face tracking speech recognition
  • gesture recognition lip recognition
  • voiceprint recognition voiceprint recognition
  • expression recognition age recognition
  • card recognition card recognition
  • pupil recognition iris recognition.
  • the above-mentioned rich computer visual recognition technology and speech recognition technology can realize human-computer interaction from face, voice, pupil, gesture, etc., further enrich user life and bring convenient operation to users.
  • the invention is based on the front view human-computer interaction method, and acquires the front view image data of the user and the device collected by the image acquisition device in a relatively front view state, collects the current image data of the user, and compares the currently collected image data with the front view image data, and is consistent.
  • the user and the device are in a relatively front view state, the user's behavior and intention are recognized by the computer's visual recognition technology and voice recognition technology, and the device performs the current behavior and intention according to the preset user's behavior and the intention and operation correspondence. Corresponding operation.
  • the front view is judged, and the front view state of the user and the device is judged as the precondition of human-computer interaction, ensuring that the current user does have human-computer interaction requirements, and the whole human-computer interaction process is natural, and another method is adopted.
  • a variety of motion recognition methods including face recognition, speech recognition, gesture recognition, lip recognition, voiceprint recognition, expression recognition, age recognition, card recognition, pupil recognition, and iris recognition identify the user's next action, enabling multi-style people Machine interaction, bringing convenient operation to users.
  • step S800 includes:
  • S820 Time the time when the user and the device are in a relatively front view state.
  • the preset time is a good time threshold set in advance, and may be set according to actual needs, for example, may be set to 2 seconds, 3 seconds, 5 seconds, and the like.
  • step S600 When it is determined in step S600 that the user and the device are in a relatively front view state, the time when the user and the device are in a relatively front view state is started. When the time when the user and the device are in a relatively front view state is greater than the preset time, the user is greatly probable.
  • the next step of the display setting needs to be performed. At this time, the user's behavior and intention are recognized by the computer's visual recognition technology and speech recognition technology, and the device performs the current behavior of the user according to the preset user's behavior and the intention and operation correspondence.
  • An operation corresponding to the intent such as starting the device.
  • Techniques such as face recognition, pupil recognition, and iris recognition can be used to determine that the user and the device maintain a relatively positive state, that is, maintaining the front view state is also a type of user action. It is not necessary to use the face recognition technology to identify the user's identity, find the video image data matching the user's identity, and control the device to display the found video image data.
  • the device is a television set
  • the time when the user and the television are kept in a relatively positive state, that is, the time when the user is facing the TV screen, when the user is facing the TV for more than the preset time (for example, 2 seconds).
  • the television is activated, and the identity of the user is recognized, the television program that is preferred by the current user is searched, and the television is controlled to switch to the television program.
  • the above embodiment is: “front view state” + time, that is, the user “frontways” the television to reach a certain time, for example, 2 seconds, and the user can think that the user wants to watch the television program, and the television can Standby to start playing the program; the TV can also actively greet the user to communicate. It can also be: “front view state” + time + “face recognition", that is, knowing who the user is, can play the program that the user likes; the television can also actively call the user and actively communicate with the user.
  • the user's behavior and intention are recognized by a computer visual recognition technology and a voice recognition technology, and the control device performs an operation corresponding to the current behavior and intention of the user according to the behavior of the preset user and the correspondence between the intention and the operation.
  • the steps include:
  • Step 1 Perform speech recognition and lip language recognition on the user.
  • Step 2 When the speech recognition result and the lip recognition result are consistent, the control device responds to the user's voice operation.
  • the user is in the "front view” state before the device to perform lip language recognition, and at the same time, the voice information is detected on the detected voice information.
  • the lip recognition result is compared with the speech recognition result. If the result is consistent, it can be determined that the user in the front view state is talking to the device (television), and the control device responds accordingly. If the result is inconsistent, the device does not respond.
  • the user's behavior and intention are recognized by the computer's visual recognition technology and the voice recognition technology.
  • the steps of the control device performing the operation corresponding to the user's current behavior and intention include:
  • Step 1 Perform speech recognition and semantic understanding on the user.
  • Step 2 When the result of the speech recognition result and the semantic understanding match the current scene of the device, the control device responds to the voice operation of the user.
  • the control device responds to the voice operation of the user. For example, when the user is watching TV, if he says: “I am resting tomorrow", obviously not operating the TV, the TV does not respond. If the user is talking about "central one", then obviously it is necessary to switch to the central one.
  • the device is used as a television to perform voice recognition and lip language recognition on the user A, that is, on the one hand, the voice information sent by the user A is collected, and on the other hand, based on the front view state, the user A is recognized in the lip language.
  • voice recognition and lip recognition result are consistent, it is determined that the user A is interacting with the television, and the television is controlled to respond accordingly, for example, switching the television program, adjusting the TV volume, and the like.
  • the step of determining that the user and the device are in a relatively front view state further comprises:
  • Step 1 When the user is detected, the location of the face of the user is located as a sound source location;
  • Step 2 Align the sound collection device with the sound source position
  • the step of recognizing the user behavior and the intention by the visual recognition technology and the voice recognition technology of the computer, according to the behavior of the preset user and the correspondence between the intention and the operation, the step of controlling the device to perform the operation corresponding to the current behavior and intention of the user includes:
  • the user voice data is collected by the sound collection device.
  • the collected user voice data carries a voice operation instruction
  • the voice operation instruction is extracted, and the control device performs a corresponding operation with the voice operation instruction.
  • the user's face position is positioned as the sound source position, and the sound collection device is ready to collect the user voice data for the sound source position.
  • the process may specifically detect the location of the user's face based on the face detection and tracking technology, and locate the location as the sound source location.
  • the user voice data is collected, and voice recognition is performed.
  • the collected user voice data carries a voice operation instruction
  • the voice operation instruction is extracted, and the control device performs voice and voice.
  • the operation instruction corresponds to the operation.
  • the detection user can detect by using a face detection, a face tracking, a human body detection, etc., and when the face position is detected, the user's face position is set as the sound source position.
  • the sound collection device may be an array microphone, and the array microphone is directly opposite to the sound source position, and the user voice data is collected.
  • the collected user voice data carries a voice operation instruction (for example, “next channel”), the voice is extracted.
  • the operation instruction the control device performs a corresponding operation with the voice operation instruction. More specifically, in practical applications, such as when several people watch TV, several people are watching TV. If several people speak at the same time, future array microphones (like radar can track multiple targets) can be multiple Source recording.
  • the number and position of the user are detected by means of face detection, that is, the number and position of the target sound source, and the position information of the target sound source is provided to the array microphone, and combined with the face identification, the sound of multiple people can be collected at the same time, and the difference is Who said what, when there is an operation command from the user carrying the "next channel", the control TV switches to the next channel.
  • face detection that is, the number and position of the target sound source
  • the position information of the target sound source is provided to the array microphone
  • the invention is based on the human-computer interaction method of the front view, and the front view state is used as the "switch" for the subsequent processing. Only when the user and the device are in a relatively front view state, the subsequent recording, or the voice recognition is started, or the voice recognition result is turned on. The operation inside.
  • determining that the user and the device are in a relatively front view state further comprises:
  • Step 1 Receive an operation instruction input by a user, and the operation instruction includes a non-front view state operation instruction and a front view state operation instruction.
  • Step 2 When it is detected that the user is no longer in the front view state, responding to the non-face-up state operation instruction input by the user.
  • Step 3 When it is detected that the user enters the front view state again, it responds to the front view state operation instruction input by the user.
  • the television receives an operation instruction input by the user, and specifically, the user inputs an operation instruction through a remote control or a direct touch button or a touch display area set on the television, and the operation instruction is divided into a non-front view operation instruction.
  • the front view state operation instruction when detecting that the user is no longer in the front view state, responding to the non-face-up state operation instruction input by the user; when detecting that the user enters the front view state again, responding to the front view state operation input by the user instruction.
  • the television enters the "recording back" state, and the person turns from the front view television to the side view, the television automatically turns on the recording mode, the person rotates one turn, then stops watching the video when facing the television, and turns on.
  • Video playback mode playing the video just recorded.
  • the method further includes:
  • Step 1 Obtain image data when the user is facing the device.
  • Step 2 Compare the image data when the user is facing the device and the currently acquired image data.
  • Step 3 When the image data of the user facing the device coincides with the currently collected image data, the visual recognition technology and the voice recognition technology of the computer, and/or the preset operation are started.
  • the visual recognition and voice recognition technology functions of the preset corresponding computer are activated only when the user is detected to face the device. Detecting whether the user is facing the device can be performed by comparing the image data when the user faces the device and the currently collected image data. When they are consistent, it indicates that the current user is facing the device, and starts the visual recognition and voice recognition technology functions of the computer (for example, gesture recognition, Face recognition and speech recognition, etc.; when inconsistent, it indicates that the current user has not face the device, and does not start the visual recognition and speech recognition technology functions of the computer.
  • the visual recognition and voice recognition technology functions of the computer for example, gesture recognition, Face recognition and speech recognition, etc.
  • the current image data of the user is collected by the camera in real time, and the image data when the user is looking at the air conditioner is obtained; the image data when the user is facing the air conditioner and the currently collected image data are compared when the two are consistent.
  • the voice recognition technology is used to identify the user voice command
  • the face recognition technology is used to identify the user identity
  • the gesture recognition technology is used to identify the user gesture instruction.
  • a human-computer interaction system based on front view includes:
  • the obtaining module 200 is configured to acquire the front view image data of the user and the device collected by the image collecting device in a relatively front view state.
  • the comparison module 400 is configured to collect current image data of the user in real time through the image acquisition device, and compare the currently collected image data with the front view image data.
  • the determining module 600 is configured to determine that the user and the device are in a relatively front view state when the currently acquired image data and the front view image data are consistent.
  • the control module 800 is configured to identify user behaviors and intentions by using a computer visual recognition technology and a voice recognition technology when the user and the device are in a relatively front view state, and control the device execution and the user according to the preset user behavior and the intention and the operation correspondence relationship.
  • Current actions and intent-related operations, computer visual recognition technology and speech recognition technologies including face recognition, speech recognition, gesture recognition, lip recognition, voiceprint recognition, expression recognition, age recognition, card recognition, pupil recognition, and iris Identification.
  • the present invention is based on the human-computer interaction system of the front view.
  • the acquisition module 200 acquires the front view image data of the user and the device collected by the image acquisition device in a relatively front view state
  • the comparison module 400 collects the current image data of the user, and the current image data and the front view are collected.
  • the image data is compared, when the determination module 600 determines that the user and the device are in a relatively front view state, the control module 800 recognizes the user behavior and intention through the computer visual recognition technology and the voice recognition technology, and corresponds to the behavior and the intention and the operation according to the preset user behavior. Relationship, the control device performs an operation corresponding to the user's current behavior and intent.
  • the front view is judged, and the front view state of the user and the device is judged as the precondition of human-computer interaction, ensuring that the current user does have human-computer interaction requirements, and the whole human-computer interaction process is natural, and another method is adopted.
  • a variety of motion recognition methods including face recognition, speech recognition, gesture recognition, lip recognition, pupil recognition, and iris recognition identify the user's next action, enabling multi-style human-computer interaction and bringing convenient operation to the user.
  • control module 800 includes:
  • the timing unit is configured to time the time when the user and the device are in a relatively front view state, and when the time when the user and the device are in a relatively front view state is greater than the preset time, the user's behavior and intention are recognized by the computer visual recognition technology and the voice recognition technology.
  • the control device performs an operation corresponding to the current behavior and intention of the user according to the behavior of the preset user and the correspondence between the intention and the operation.
  • control module 800 further includes:
  • the search control unit is configured to search for video image data whose preset matches the user's identity, and the control device displays the searched video image data.
  • control module 800 includes:
  • a recognition unit configured to perform voice recognition and lip language recognition on the user
  • control unit configured to: when the speech recognition result and the lip recognition result are consistent, the control device responds to the user's voice operation.
  • control module 800 includes:
  • a positioning unit configured to locate a user's face position as a sound source position when the user is detected
  • the adjusting unit is configured to collect the user voice data by directly facing the sound collecting device to the sound source position;
  • the extraction control unit is configured to: when the collected user voice data carries the voice operation instruction, extract the voice operation instruction, and the control device performs the operation corresponding to the voice operation instruction.
  • the front view image data obtained by the user as shown in FIG. 4 and the television set in a relatively front view state is acquired.
  • the current image data is collected in real time by the camera as shown in FIG. 4, and the data collected in real time is compared with the front view image data of the user and the television in a relatively front view state.
  • the TV can start playing the program from the standby, and can also greet the user actively.
  • knowing who the user is and knowing his expression can actively communicate with the user and even provide the corresponding service. If a child is crying at the TV, the TV can automatically dial the mother's video call, and the mother's video will soon appear on the TV to let the baby communicate with her mother.
  • the television can regard the result of the voice recognition as the user speaking to the television, and the television responds and feedbacks accordingly.
  • the face recognition confirms that there are a plurality of users on the site, it is judged whether the user is "front view state”, detects the change of the lips of the "front view” user, performs lip language recognition on the front view user, and performs voice recognition on the detected voice information.
  • the lip recognition result is compared with the speech recognition result. If the results are consistent, it can be determined that the front view user is talking to the television, and the television responds accordingly; if the results are inconsistent, the television does not respond.
  • the user looks at the air conditioner, and the air conditioning management system confirms that the user is in the "front view” state through the head posture estimation, and the air conditioner starts the face recognition - knowing who the user is, opening and adjusting to the state that the user likes; air conditioning startup gesture recognition - acceptable User's gesture operation; air conditioner start recording and voice recognition - can accept the user's voice command operation.

Abstract

本发明提供一种基于正视的人机交互方法与系统,获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据,采集用户当前图像数据,将当前采集的图像数据与正视图像数据比较,当一致时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。整个过程中,基于图像采集设备采集的图像数据进行正视判定,并以用户与设备的正视状态判定作为人机交互前提条件,整个人机交互过程自然,另外采用包括人脸识别、语音识别、手势识别、唇语识别、瞳孔识别以及虹膜识别的多种计算机的视觉识别技术和语音识别技术识别用户下一步动作,能够实现多样式人机交互。

Description

基于正视的人机交互方法与系统 技术领域
本发明涉及人机交互技术领域,特别是涉及基于正视的人机交互方法与系统。
背景技术
人机交互是指人与设备之间使用某种对话语言,以一定的交互方式,为完成确定任务的人与设备之间的信息交换过程。
随着科学技术的发展,人机交互技术的应用领域越来越宽广,小如收音机的播放按键,大至飞机上的仪表板、或是发电厂的控制室,用户都可以通过人机交互界面与系统交流,并进行操作。目前在人机交互技术中,主流的人机交互方式主要包括3种,第一种是传统按键方式;第二种是特定语音词激活方式,如:在对话前先说“小冰你好”,设备才识别后面所听到的语音;第三种是“举手发言”,即先用一个特定手势动作来让设备启动语音识别。
上述人机交互方式,虽然在一定程度上可以实现人机交互功能,但是由于交互方式单一,需要预先设定一定特定手势动作,交互过程并不十分自然,在一定程度上给用户操作带来不便。
发明内容
基于此,有必要针对一般人机交互方式单一且不自然给用户带来不便操作的问题,提供一种人机交互方式多样,且交互过程自然,给用户带来便捷操作的基于正视的人机交互方法与系统。
一种基于正视的人机交互方法,包括步骤:
获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据;
通过图像采集设备实时采集用户当前图像数据,将当前采集的图像数据与 正视图像数据比较;
当当前采集的图像数据和正视图像数据一致时,判定用户与设备处于相对正视状态;
当用户与设备处于相对正视状态时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作,所述计算机的视觉识别技术和语音识别技术包括人脸识别、语音识别、语义理解、手势识别、唇语识别、声纹识别、表情识别、年龄识别、卡片识别、人脸跟踪、瞳孔识别以及虹膜识别。
一种基于正视的人机交互系统,包括:
获取模块,用于获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据;
比较模块,用于通过图像采集设备实时采集用户当前图像数据,将当前采集的图像数据与正视图像数据比较;
判定模块,用于当当前采集的图像数据和正视图像数据一致时,判定用户与设备处于相对正视状态;
控制模块,用于当用户与设备处于相对正视状态时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作,所述计算机的视觉识别技术和语音识别技术包括人脸识别、语音识别、语义理解、手势识别、唇语识别、声纹识别、表情识别、年龄识别、卡片识别、人脸跟踪、瞳孔识别以及虹膜识别。
本发明基于正视的人机交互方法与系统,获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据,采集用户当前图像数据,将当前采集的图像数据与正视图像数据比较,当一致时,判定用户与设备处于相对正视状态,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。整个过程中,基于图像采集设备采集的图像数据进行正视 判定,并以用户与设备的正视状态判定作为人机交互前提条件,确保当前用户确实有人机交互需求,整个人机交互过程自然,另外采用包括人脸识别、语音识别、手势识别、唇语识别、声纹识别、表情识别、年龄识别、卡片识别、瞳孔识别以及虹膜识别的多种动作识别方式识别用户下一步动作,能够实现多样式人机交互,给用户带来便捷操作。
附图说明
图1为本发明基于正视的人机交互方法第一个实施例的流程示意图;
图2为本发明基于正视的人机交互方法第二个实施例的流程示意图;
图3为本发明基于正视的人机交互系统第一个实施例的结构示意图;
图4为本发明基于正视的人机交互方法与系统具体应用场景示意图。
具体实施方式
如图1所示,一种基于正视的人机交互方法,包括步骤:
S200:获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据。
设备具体来说可以为电视机、空调、电脑以及机器人等,另外设备还可以包括车载设备等。用户与设备处于相对正视状态是指用户正视设备,例如当设备为电视机时,用户正视电视机的状态即为用户与电视机处于相对正视状态。由于图像采集设备一般是无法设置于设备正中心的,所以图像采集设备采集用户与设备处于相对正视状态下图像时,从图像采集设备角度看去用户眼睛或用户人脸并不是正对图像采集设备的,一般会呈现一定的角度。为了有利于后续精准判定正视状态,先获取图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据。具体来说,用户与设备处于相对正视状态下的正视图像数据可以是历史记录中采集好的数据,也可以是当场采集的数据。图像采集设备可以是摄像头等设备,在这里,用户与设备处于相对正视状态下的正视图像数据是通过图像采集设备采集的图像采集设备可以设置于设备上,还可以设置设备的辅助设备或外围设备上,例如当设备为电视机时,图像采集设备可以设置 于电视,也可以设置于与电视配套的机顶盒上。更具体来说,摄像头拍摄的用户与设备处于相对正视状态下的正视图像数据,进行图像处理和图像目标坐标换算之后即可确定设备和用户人脸相对位置,即可以获取用户与设备处于相对正视状态下用户的人脸图像数据。判定用户与设备处于相对正视状态可以选择采用头部姿态估计(head pose estimation)或者视线跟踪(gaze tracking)等技术来实现。
S400:通过图像采集设备实时采集用户当前图像数据,将当前采集的图像数据与正视图像数据比较。
通过步骤S200中相同的图像采集设备实时采集用户当前图像数据,并且将实时采集的图像数据与步骤S200获取的正视图像数据比较,以判断当前用户与设备是否处于相对正视状态。
S600:当当前采集的图像数据和正视图像数据一致时,判定用户与设备处于相对正视状态。
当步骤S200获取的正视图像数据与步骤S400实时采集的图像数据一致时,即表明当前用户与设备处于相对正视状态。
S800:当用户与设备处于相对正视状态时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作,所述计算机的视觉识别技术和语音识别技术包括人脸识别、语音识别、语义理解、手势识别、唇语识别、声纹识别、表情识别、年龄识别、卡片识别、人脸跟踪、瞳孔识别以及虹膜识别。
用户与设备处于相对正视状态的前提下,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。即只有判定用户与设备处于相对正视状态的前提下,设备才会启动响应用户操作,这样,一方面避免误操作,例如可以避免电视机错误启动、错误切换电视机节目等;另一方面,由于用户与设备处于相对正视状态时,即有极大可能性用户对设备进行操作,给用户带来便利。具体来说,计算机的视觉识别技术和语音识别技术主要可以包 括人脸识别、人脸检测、人脸跟踪、语音识别、手势识别、唇语识别、声纹识别、表情识别、年龄识别、卡片识别、瞳孔识别以及虹膜识别等。采用上述丰富的计算机的视觉识别技术和语音识别技术能够从人脸、语音、瞳孔、手势等方面实现人机交互,更进一步丰富用户生活,给用户带来便捷操作。
本发明基于正视的人机交互方法,获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据,采集用户当前图像数据,将当前采集的图像数据与正视图像数据比较,当一致时,判定用户与设备处于相对正视状态,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。整个过程中,基于图像采集设备采集的图像数据进行正视判定,并以用户与设备的正视状态判定作为人机交互前提条件,确保当前用户确实有人机交互需求,整个人机交互过程自然,另外采用包括人脸识别、语音识别、手势识别、唇语识别、声纹识别、表情识别、年龄识别、卡片识别、瞳孔识别以及虹膜识别的多种动作识别方式识别用户下一步动作,能够实现多样式人机交互,给用户带来便捷操作。
如图2所示,在其中一个实施例中,步骤S800包括:
S820:对用户与设备处于相对正视状态的时间进行计时。
S840:当用户与设备处于相对正视状态的时间大于预设时间时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。
预设时间是事先设定的好的时间阈值,具体可以根据实际情况的需要进行设定,例如可以设定为2秒、3秒、5秒等。当步骤S600判定用户与设备处于相对正视状态下时,开始对用户与设备处于相对正视状态的时间进行计时,当用户与设备处于相对正视状态的时间大于预设时间时,表明很大概率用户当前需要对显示设定进行下一步操作,此时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作,例如启动设备。可以采用人脸识别、瞳孔识别以及虹膜识别等技术确定用户与设备保持着相对正视状态, 即保持正视状态也属于用户动作的一种。非必要的,在启动设备之后,采用人脸识别技术,识别用户身份,查找与用户身份匹配的视频图像数据,控制设备显示查找到的视频图像数据。在实际应用中,当设备为电视机时,计时用户与电视机保持相对正视状态的时间,即计时用户正视电视机屏幕的时间,当用户正视电视机的时间大于预设时间(例如2秒)时,启动电视机,并识别用户身份,查找与当前用户喜好的电视机节目,控制电视机切换至该电视节目播放。
具体来说,在实际应用场景中,上述实施例为:“正视状态”+时间,即用户“正视”电视机达到一定时间,比如2秒钟,可以认为用户想看电视节目,电视机可以从待机开启播放节目;电视机也可以跟用户主动打招呼交流。还可以是:“正视状态”+时间+“人脸识别”,即知道这个用户是谁,可以播放这个用户喜欢的节目;电视机还可以主动呼叫用户,主动跟用户交流。
在其中一个实施例中,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作的步骤包括:
步骤一:对用户进行语音识别和唇语识别。
步骤二:当语音识别结果和唇语识别结果一致时,控制设备响应用户的语音操作。
对设备前处于“正视状态”的用户进行唇语识别,同时对检测到的语音信息进行语音识别。将唇语识别结果与语音识别结果比对,如果结果一致,可以判定该正视状态用户是在跟设备(电视机)对话,控制设备作出相应的响应,如果结果不一致,则设备不响应。
通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作的步骤包括:
步骤一:对所述用户进行语音识别和语义理解。
步骤二:当语音识别结果和语义理解的结果与设备当前场景相符时,控制设备响应所述用户的语音操作。
在本实施例中,还需要对用户进行语音识别和语义理解,理解用户意图, 当语音识别结果和语义理解的结果与设备当前场景相符时,控制设备响应所述用户的语音操作。例如用户在看电视时,如果说的话是:“我明天休息”,显然不是操作电视机的,电视机不响应。如果用户说的是“中央一台”,则显然是要切换到中央一台。
在实际应用中,以设备为电视机为例对用户A进行语音识别和唇语识别,即一方面采集用户A发出的语音信息,另一方面基于正视状态,对用户A进行唇语识别,当语音识别和唇语识别结果一致时,判定用户A是在跟电视机交互,控制电视机做出相应的响应,例如切换电视节目,调节电视音量等操作。
在其中一个实施例中,所述当所述当前采集的图像数据和所述正视图像数据一致时,判定用户与设备处于相对正视状态的步骤之前还包括:
步骤一:当侦测到用户时,定位所述用户的面部位置为音源位置;
步骤二:将声音采集设备正对所述音源位置;
所述通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作的步骤包括:
通过所述声音采集设备采集用户声音数据,当采集的用户声音数据中携带有语音操作指令时,提取所述语音操作指令,控制设备执行与所述语音操作指令对应操作。
当侦测到用户时,将用户面部位置定位为声源位置,让声音采集设备正对该声源位置,准备采集用户声音数据。具体来说,这个过程具体可以是基于人脸检测和跟踪技术检测到用户人脸的位置,定位该位置为音源位置。在后续操作中,在判定当前用户与设备处于相对正视状态时,采集用户语音数据,进行语音识别,当采集的用户语音数据中携带有语音操作指令时,提取语音操作指令,控制设备执行与语音操作指令对应操作。另外,侦测用户可以通过人脸检测、人脸跟踪、人体检测等侦测方法侦测,当侦测到人脸位置时,将用户的面部位置设定为声源位置。在实际应用中,声音采集设备可以为阵列麦克风,将阵列麦克风正对音源位置,采集用户语音数据,当采集的用户语音数据中携带有语音操作指令(例如“下一频道”)时,提取语音操作指令,控制设备执行与 语音操作指令对应操作。更具体来说,在实际应用场景中,比如有几个人看电视时,几个人都是正视电视,如果几个人同时说话,将来的阵列麦克风(像雷达一样可以跟踪多个目标)可以对多个音源录音。通过人脸检测等方式侦测用户数量和位置,即为目标音源的数量和位置,给阵列麦克风提供目标音源的位置信息,结合人脸身份识别,可以实现同时采集多人的声音,并区分是谁说的内容,当有用户发出的声音数据中携带有“下一频道”的操作指令时,控制电视机切换至下一频道。另外,还可以结合人脸身份识别针对用户身份合法性进行识别,只有合法(拥有控制权的)用户发出的声音数据才会被采集,并进行后续操作。
本发明基于正视的人机交互方法,以正视状态作为后续处理的“开关”,只有判定用户与设备处于相对正视状态,才会进行后续包括开启录音、或者开启语音识别、或开启语音识别结果在内的操作。
另外,在其中一个实施例中,所述当所述当前采集的图像数据和所述正视图像数据一致时,判定用户与设备处于相对正视状态的步骤之后还包括:
步骤一:接收用户输入的操作指令,所述操作指令包括非正视状态操作指令和正视状态操作指令。
步骤二:当侦测到用户不再处于所述正视状态时,响应用户输入的非正视状态操作指令。
步骤三:当侦测到用户再次进入所述正视状态时,响应用户输入的正视状态操作指令。
在实际应用中电视机接收用户输入的操作指令,具体可以是用户通过遥控器或直接触碰按键又或是点击电视机上设置的触摸显示区域输入操作指令,该操作指令分为非正视状态操作指令和正视状态操作指令,当侦测到用户不再处于所述正视状态时,响应用户输入的非正视状态操作指令;当侦测到用户再次进入所述正视状态时,响应用户输入的正视状态操作指令。例如通过语音指令或其它方式,让电视机进入“录背影”状态,人从正视电视机转为侧视,电视机自动开启录像模式,人旋转一圈,再正视电视机时停止录像,并开启视频播放模式,播放刚才所录视频。
在其中一个实施例中,通过图像采集设备实时采集用户当前图像数据的步骤之后还包括:
步骤一:获取用户正视设备时的图像数据。
步骤二:比较用户正视设备时的图像数据和当前采集的图像数据。
步骤三:当用户正视设备时的图像数据和当前采集的图像数据一致时,启动计算机的视觉识别技术和语音识别技术、和/或预设操作。
具体来说,只有当检测到用户正视设备时,才启动预设对应的计算机的视觉识别和语音识别技术功能。检测用户是否正视设备可以采用比较用户正视设备时的图像数据和当前采集的图像数据的方式进行,当一致时,表明当前用户正视设备,启动计算机的视觉识别和语音识别技术功能(例如手势识别、人脸识别以及语音识别等);当不一致时,表明当前用户尚未正视设备,不启动计算机的视觉识别和语音识别技术功能。在实际应用中,以设备为空调为例,通过摄像头实时采集用户当前图像数据,获取用户正视空调时的图像数据;比较用户正视空调时的图像数据和当前采集的图像数据,当两者一致时,表明当前用户正视于空调,启动语音识别技术和人脸识别技术、手势识别技术,语音识别技术用于识别用户语音指令,人脸识别技术用于识别用户身份,手势识别技术用于识别用户手势指令。
如图3所示,一种基于正视的人机交互系统,包括:
获取模块200,用于获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据。
比较模块400,用于通过图像采集设备实时采集用户当前图像数据,将当前采集的图像数据与正视图像数据比较。
判定模块600,用于当当前采集的图像数据和正视图像数据一致时,判定用户与设备处于相对正视状态。
控制模块800,用于当用户与设备处于相对正视状态时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作,计算机的视觉识别技术和语音识别技术包括人脸识别、语音识别、手势识别、唇语识别、 声纹识别、表情识别、年龄识别、卡片识别、瞳孔识别以及虹膜识别。
本发明基于正视的人机交互系统,获取模块200获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据,比较模块400采集用户当前图像数据,将当前采集的图像数据与正视图像数据比较,当一致时,判定模块600判定用户与设备处于相对正视状态,控制模块800通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。整个过程中,基于图像采集设备采集的图像数据进行正视判定,并以用户与设备的正视状态判定作为人机交互前提条件,确保当前用户确实有人机交互需求,整个人机交互过程自然,另外采用包括人脸识别、语音识别、手势识别、唇语识别、瞳孔识别以及虹膜识别的多种动作识别方式识别用户下一步动作,能够实现多样式人机交互,给用户带来便捷操作。
在其中一个实施例中,控制模块800包括:
计时单元,用于对用户与设备处于相对正视状态的时间进行计时,当用户与设备处于相对正视状态的时间大于预设时间时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。
在其中一个实施例中,控制模块800还包括:
查找控制单元,用于查找预设与用户身份匹配的视频图像数据,控制设备显示查找到的视频图像数据。
在其中一个实施例中,控制模块800包括:
识别单元,用于对用户进行语音识别和唇语识别;
控制单元,用于当语音识别结果和唇语识别结果一致时,控制设备响应用户的语音操作。
在其中一个实施例中,控制模块800包括:
定位单元,用于当侦测到用户时,定位用户的面部位置为音源位置;
调节单元,用于将声音采集设备正对音源位置,采集用户声音数据;
提取控制单元,用于当采集的用户声音数据中携带有语音操作指令时,提 取语音操作指令,控制设备执行与语音操作指令对应操作。
为了更进一步详细解释本发明基于正视的人机交互方法与系统的技术方案,下面将采用多个具体应用实例,模拟不同实际应用场景,并结合图4进行说明,在下述应用实例中设备均为电视机。
获取通过如图4所示的摄像头采集的用户与电视机处于相对正视状态下的正视图像数据。
通过如图4所示的摄像头实时采集当前图像数据,将实时采集的数据与用户与电视机处于相对正视状态下的正视图像数据比较。
当一致时,判定用户与电视机处于相对正视状态。
应用实例一、正视状态+时间
用户正视电视机达到一定时间,比如2秒钟,可以认为用户想看电视节目,电视机可以从待机开启播放节目,也可以跟用户主动打招呼交流。
应用实例二、正视状态+时间+人脸识别
知道这个用户是谁,可以播放这个用户喜欢的节目;电视机还可以主动呼叫用户,主动跟用户交流。
应用实例三、正视状态+人脸身份识别+表情识别
显然,知道用户是谁,而且知道他的表情,可以主动跟该用户交流,甚至提供相应的服务。如果是一个小孩对着电视机哭,电视机可以自动拨打妈妈的视频电话,电视机上很快就可以出现妈妈的视频,让宝宝跟妈妈视频交流。
应用实例四、正视状态+人脸识别+语音识别
人脸识别确认现场只有一个用户时,电视机可以把语音识别的结果视为该用户对电视机所说,电视机作出相应回复和反馈。
应用实例五、正视状态+人脸识别+唇语识别+语音识别
人脸识别确认现场有多个用户时,判断用户是否“正视状态”,检测“正视”用户的嘴唇变化,对正视用户进行唇语识别;同时对检测到的语音信息进行语音识别。将唇语识别结果与语音识别结果比对,如果结果一致,可以判定该正视用户是在跟电视机对话,电视机作出相应的回应;如果结果不一致,则电视机不回应。
应用实例六、正视状态+阵列麦克风+人脸识别(或者声纹识别)
比如有几个人看电视时,几个人都是正视电视。如果几个人同时说话,将来的阵列麦克风(像雷达一样可以跟踪多个目标)可以对多个音源录音。正视识别可以确定目标有几个,给阵列麦克风提供目标音源的位置信息,结合人脸身份识别,可以实现同时采集多人的声音,并区分是谁说的内容。
应用实例七、应用于空调
用户望着空调,空调管理系统通过头部姿态估计确认用户为“正视”状态,空调启动人脸识别——知道用户是谁,打开并调节到用户喜欢的状态;空调启动手势识别——可以接受用户的手势操作;空调启动录音和语音识别--可以接受用户的语音指令操作。
以上实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (9)

  1. 一种基于正视的人机交互方法,其特征在于,包括步骤:
    获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据;
    通过所述图像采集设备实时采集用户当前图像数据,将当前采集的图像数据与所述正视图像数据比较;
    当所述当前采集的图像数据和所述正视图像数据一致时,判定用户与设备处于相对正视状态;
    当用户与设备处于相对正视状态时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作,所述计算机的视觉识别技术和语音识别技术包括人脸识别、语音识别、语义理解、手势识别、唇语识别、声纹识别、表情识别、年龄识别、卡片识别、人脸跟踪、瞳孔识别以及虹膜识别;
    所述当所述当前采集的图像数据和所述正视图像数据一致时,判定用户与设备处于相对正视状态的步骤之前还包括:
    当侦测到用户时,定位所述用户的面部位置为音源位置;
    将声音采集设备正对所述音源位置;
    所述通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作的步骤包括:
    通过所述声音采集设备采集用户声音数据,当采集的用户声音数据中携带有语音操作指令时,提取所述语音操作指令,控制设备执行与所述语音操作指令对应操作。
  2. 根据权利要求1所述的基于正视的人机交互方法,其特征在于,所述通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作的步骤包括:
    对用户与设备处于相对正视状态的时间进行计时;
    当用户与设备处于相对正视状态的时间大于预设时间时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。
  3. 根据权利要求2所述的基于正视的人机交互方法,其特征在于,所述当用户与设备处于相对正视状态的时间大于预设时间时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作的步骤之后还包括:
    查找预设与用户身份匹配的视频图像数据,控制设备显示查找到的视频图像数据。
  4. 根据权利要求1所述的基于正视的人机交互方法,其特征在于,所述通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作的步骤包括:
    对所述用户进行语音识别和唇语识别;
    当语音识别结果和唇语识别结果一致时,控制设备响应所述用户的语音操作。
  5. 根据权利要求1所述的基于正视的人机交互方法,其特征在于,所述通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作的步骤包括:
    对所述用户进行语音识别和语义理解;
    当语音识别结果和语义理解的结果与设备当前场景相符时,控制设备响应所述用户的语音操作。
  6. 根据权利要求1所述的基于正视的人机交互方法,其特征在于,所述当所述当前采集的图像数据和所述正视图像数据一致时,判定用户与设备处于相对正视状态的步骤之后还包括:
    接收用户输入的操作指令,所述操作指令包括非正视状态操作指令和正视状态操作指令;
    当侦测到用户不再处于所述正视状态时,响应用户输入的非正视状态操作指令;
    当侦测到用户再次进入所述正视状态时,响应用户输入的正视状态操作指令。
  7. 根据权利要求1所述的基于正视的人机交互方法,其特征在于,所述通过所述图像采集设备实时采集用户当前图像数据的步骤之后还包括:
    获取用户正视设备时的图像数据;
    比较所述用户正视设备时的图像数据和当前采集的图像数据;
    当所述用户正视设备时的图像数据和当前采集的图像数据一致时,启动计算机的视觉识别技术和语音识别技术、和/或预设操作,所述预设操作包括录音与播放视频。
  8. 一种基于正视的人机交互系统,其特征在于,包括:
    获取模块,用于获取通过图像采集设备采集的用户与设备处于相对正视状态下的正视图像数据;
    比较模块,用于通过所述图像采集设备实时采集用户当前图像数据,将当前采集的图像数据与所述正视图像数据比较;
    判定模块,用于当所述当前采集的图像数据和所述正视图像数据一致时,判定用户与设备处于相对正视状态;
    控制模块,用于当用户与设备处于相对正视状态时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作,所述计算机的视觉识别技术和语音识别技术包括人脸识别、语音识别、语义理解、手势识别、唇语识别、声纹识别、表情识别、年龄识别、卡片识别、人脸跟踪、瞳孔识别以及虹膜识别;
    所述控制模块包括:
    定位单元,用于当侦测到用户时,定位用户的面部位置为音源位置;
    调节单元,用于将声音采集设备正对音源位置,采集用户声音数据;
    提取控制单元,用于当采集的用户声音数据中携带有语音操作指令时,提取语音操作指令,控制设备执行与语音操作指令对应操作。
  9. 根据权利要求8所述的基于正视的人机交互系统,其特征在于,所述控制模块包括:
    识别单元,用于对所述用户进行语音识别和唇语识别;
    控制单元,用于当语音识别结果和唇语识别结果一致时,通过计算机的视觉识别技术和语音识别技术识别用户行为和意图,根据预设用户的行为与意图与操作对应关系,控制设备执行与用户当前的行为与意图对应的操作。
PCT/CN2018/086805 2017-05-18 2018-05-15 基于正视的人机交互方法与系统 WO2018210219A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18803148.8A EP3627290A4 (en) 2017-05-18 2018-05-15 DEVICE SIDE HUMAN COMPUTER INTERACTION METHOD AND SYSTEM
US16/614,694 US11163356B2 (en) 2017-05-18 2018-05-15 Device-facing human-computer interaction method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710354064.5A CN107239139B (zh) 2017-05-18 2017-05-18 基于正视的人机交互方法与系统
CN201710354064.5 2017-05-18

Publications (1)

Publication Number Publication Date
WO2018210219A1 true WO2018210219A1 (zh) 2018-11-22

Family

ID=59984389

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/086805 WO2018210219A1 (zh) 2017-05-18 2018-05-15 基于正视的人机交互方法与系统

Country Status (4)

Country Link
US (1) US11163356B2 (zh)
EP (1) EP3627290A4 (zh)
CN (1) CN107239139B (zh)
WO (1) WO2018210219A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920436A (zh) * 2019-01-28 2019-06-21 武汉恩特拉信息技术有限公司 一种提供辅助服务的装置及方法
CN110767226A (zh) * 2019-10-30 2020-02-07 山西见声科技有限公司 具有高准确度的声源定位方法、装置、语音识别方法、系统、存储设备及终端

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239139B (zh) * 2017-05-18 2018-03-16 刘国华 基于正视的人机交互方法与系统
CN109754814B (zh) * 2017-11-08 2023-07-28 阿里巴巴集团控股有限公司 一种声音处理方法、交互设备
CN108052079B (zh) * 2017-12-12 2021-01-15 北京小米移动软件有限公司 设备控制方法、装置、设备控制装置及存储介质
CN109976506B (zh) * 2017-12-28 2022-06-24 深圳市优必选科技有限公司 一种电子设备的唤醒方法、存储介质及机器人
CN108363557B (zh) * 2018-02-02 2020-06-12 刘国华 人机交互方法、装置、计算机设备和存储介质
CN108509890B (zh) * 2018-03-27 2022-08-16 百度在线网络技术(北京)有限公司 用于提取信息的方法和装置
CN108428453A (zh) * 2018-03-27 2018-08-21 王凯 一种基于唇语识别的智能终端操控系统
CN111713087A (zh) 2018-03-29 2020-09-25 华为技术有限公司 一种设备间的数据迁移的方法和设备
US20190332848A1 (en) * 2018-04-27 2019-10-31 Honeywell International Inc. Facial enrollment and recognition system
CN108632373B (zh) * 2018-05-09 2021-11-30 方超 设备控制方法和系统
CN108897589B (zh) * 2018-05-31 2020-10-27 刘国华 显示设备中人机交互方法、装置、计算机设备和存储介质
CN109032345B (zh) * 2018-07-04 2022-11-29 百度在线网络技术(北京)有限公司 设备控制方法、装置、设备、服务端和存储介质
CN110857067B (zh) * 2018-08-24 2023-04-07 上海汽车集团股份有限公司 一种人车交互装置和人车交互方法
CN109410957B (zh) * 2018-11-30 2023-05-23 福建实达电脑设备有限公司 基于计算机视觉辅助的正面人机交互语音识别方法及系统
CN109815804A (zh) * 2018-12-19 2019-05-28 平安普惠企业管理有限公司 基于人工智能的交互方法、装置、计算机设备及存储介质
CN109977811A (zh) * 2019-03-12 2019-07-05 四川长虹电器股份有限公司 基于嘴部关键位置特征检测实现免语音唤醒的系统及方法
CN110221693A (zh) * 2019-05-23 2019-09-10 南京双路智能科技有限公司 一种基于人机交互的智能零售终端操作系统
CN110196642B (zh) * 2019-06-21 2022-05-17 济南大学 一种基于意图理解模型的导航式虚拟显微镜
CN110288016B (zh) * 2019-06-21 2021-09-28 济南大学 一种多模态意图融合方法及应用
CN110266806A (zh) * 2019-06-28 2019-09-20 北京金山安全软件有限公司 内容推送方法、装置及电子设备
KR20210035968A (ko) * 2019-09-24 2021-04-02 엘지전자 주식회사 사용자의 표정이나 발화를 고려하여 마사지 동작을 제어하는 인공 지능 마사지 장치 및 그 방법
CN110689889B (zh) * 2019-10-11 2021-08-17 深圳追一科技有限公司 人机交互方法、装置、电子设备及存储介质
CN111145739A (zh) * 2019-12-12 2020-05-12 珠海格力电器股份有限公司 一种基于视觉的免唤醒语音识别方法、计算机可读存储介质及空调
CN111128157B (zh) * 2019-12-12 2022-05-27 珠海格力电器股份有限公司 一种智能家电的免唤醒语音识别控制方法、计算机可读存储介质及空调
CN111541951B (zh) * 2020-05-08 2021-11-02 腾讯科技(深圳)有限公司 基于视频的交互处理方法、装置、终端及可读存储介质
CN111625094B (zh) * 2020-05-25 2023-07-14 阿波罗智联(北京)科技有限公司 智能后视镜的交互方法、装置、电子设备和存储介质
CN112381001A (zh) * 2020-11-16 2021-02-19 四川长虹电器股份有限公司 基于专注度的智能电视用户识别方法及装置
CN113221699B (zh) * 2021-04-30 2023-09-08 杭州海康威视数字技术股份有限公司 一种提高识别安全性的方法、装置、识别设备
CN113485617B (zh) * 2021-07-02 2024-05-03 广州博冠信息科技有限公司 动画展示方法、装置、电子设备及存储介质
CN114035689A (zh) * 2021-11-26 2022-02-11 朱芳程 一种基于人工智能的可追随飞行人机交互系统和方法
CN114265499A (zh) * 2021-12-17 2022-04-01 交控科技股份有限公司 应用于客服终端的交互方法和系统
CN116434027A (zh) * 2023-06-12 2023-07-14 深圳星寻科技有限公司 一种基于图像识别人工智能交互系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1215658A2 (en) * 2000-12-05 2002-06-19 Hewlett-Packard Company Visual activation of voice controlled apparatus
CN102324035A (zh) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 口型辅助语音识别术在车载导航中应用的方法及系统
CN103413467A (zh) * 2013-08-01 2013-11-27 袁苗达 可控强制引导型自主学习系统
CN105700683A (zh) * 2016-01-12 2016-06-22 厦门施米德智能科技有限公司 一种智能窗及其控制方法
CN106125771A (zh) * 2016-08-16 2016-11-16 江西联创宏声电子有限公司 声频定向扬声器及其转向方法
CN107239139A (zh) * 2017-05-18 2017-10-10 刘国华 基于正视的人机交互方法与系统

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7382405B2 (en) * 2001-12-03 2008-06-03 Nikon Corporation Electronic apparatus having a user identification function and user identification method
CN100343867C (zh) * 2005-06-15 2007-10-17 北京中星微电子有限公司 一种判别视线方向的方法和装置
US20100217590A1 (en) * 2009-02-24 2010-08-26 Broadcom Corporation Speaker localization system and method
KR101092820B1 (ko) * 2009-09-22 2011-12-12 현대자동차주식회사 립리딩과 음성 인식 통합 멀티모달 인터페이스 시스템
US9823742B2 (en) 2012-05-18 2017-11-21 Microsoft Technology Licensing, Llc Interaction and management of devices using gaze detection
US8965170B1 (en) * 2012-09-04 2015-02-24 Google Inc. Automatic transition of content based on facial recognition
US8970656B2 (en) * 2012-12-20 2015-03-03 Verizon Patent And Licensing Inc. Static and dynamic video calling avatars
JP2014153663A (ja) * 2013-02-13 2014-08-25 Sony Corp 音声認識装置、および音声認識方法、並びにプログラム
US9384751B2 (en) 2013-05-06 2016-07-05 Honeywell International Inc. User authentication of voice controlled devices
CN105183169B (zh) * 2015-09-22 2018-09-25 小米科技有限责任公司 视线方向识别方法及装置
US10048765B2 (en) * 2015-09-25 2018-08-14 Apple Inc. Multi media computing or entertainment system for responding to user presence and activity
CN106356057A (zh) * 2016-08-24 2017-01-25 安徽咪鼠科技有限公司 一种基于计算机应用场景语义理解的语音识别系统
US10467509B2 (en) * 2017-02-14 2019-11-05 Microsoft Technology Licensing, Llc Computationally-efficient human-identifying smart assistant computer
US11145299B2 (en) * 2018-04-19 2021-10-12 X Development Llc Managing voice interface devices
US11152001B2 (en) * 2018-12-20 2021-10-19 Synaptics Incorporated Vision-based presence-aware voice-enabled device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1215658A2 (en) * 2000-12-05 2002-06-19 Hewlett-Packard Company Visual activation of voice controlled apparatus
CN102324035A (zh) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 口型辅助语音识别术在车载导航中应用的方法及系统
CN103413467A (zh) * 2013-08-01 2013-11-27 袁苗达 可控强制引导型自主学习系统
CN105700683A (zh) * 2016-01-12 2016-06-22 厦门施米德智能科技有限公司 一种智能窗及其控制方法
CN106125771A (zh) * 2016-08-16 2016-11-16 江西联创宏声电子有限公司 声频定向扬声器及其转向方法
CN107239139A (zh) * 2017-05-18 2017-10-10 刘国华 基于正视的人机交互方法与系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3627290A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920436A (zh) * 2019-01-28 2019-06-21 武汉恩特拉信息技术有限公司 一种提供辅助服务的装置及方法
CN110767226A (zh) * 2019-10-30 2020-02-07 山西见声科技有限公司 具有高准确度的声源定位方法、装置、语音识别方法、系统、存储设备及终端
CN110767226B (zh) * 2019-10-30 2022-08-16 山西见声科技有限公司 具有高准确度的声源定位方法、装置、语音识别方法、系统、存储设备及终端

Also Published As

Publication number Publication date
EP3627290A1 (en) 2020-03-25
US11163356B2 (en) 2021-11-02
EP3627290A4 (en) 2021-03-03
US20200209950A1 (en) 2020-07-02
CN107239139A (zh) 2017-10-10
CN107239139B (zh) 2018-03-16

Similar Documents

Publication Publication Date Title
WO2018210219A1 (zh) 基于正视的人机交互方法与系统
WO2019149160A1 (zh) 人机交互方法、装置、计算机设备和存储介质
US11580983B2 (en) Sign language information processing method and apparatus, electronic device and readable storage medium
JP6428954B2 (ja) 情報処理装置、情報処理方法およびプログラム
CN108052079B (zh) 设备控制方法、装置、设备控制装置及存储介质
WO2021135685A1 (zh) 身份认证的方法以及装置
CN110730115B (zh) 语音控制方法及装置、终端、存储介质
WO2015154419A1 (zh) 一种人机交互装置及方法
CN104092936A (zh) 自动对焦方法及装置
CN110335600A (zh) 家电设备的多模态交互方法及系统
KR20120072244A (ko) 디바이스 제어를 위한 제스처/음향 융합 인식 시스템 및 방법
CN108766438A (zh) 人机交互方法、装置、存储介质及智能终端
TW200809768A (en) Method of driving a speech recognition system
WO2020079941A1 (ja) 情報処理装置及び情報処理方法、並びにコンピュータプログラム
WO2021017096A1 (zh) 一种将人脸信息录入数据库的方法和装置
CN111583937A (zh) 一种语音控制唤醒方法及存储介质、处理器、语音设备、智能家电
JPH11249773A (ja) マルチモーダルインタフェース装置およびマルチモーダルインタフェース方法
CN104423992A (zh) 显示器语音辨识的启动方法
Thermos et al. Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view
EP4276818A1 (en) Speech operation method for device, apparatus, and electronic device
US10838741B2 (en) Information processing device, information processing method, and program
CN115604513A (zh) 一种系统模式切换方法、电子设备及计算机可读存储介质
CN115691498A (zh) 语音交互方法、电子设备及介质
JP2021145198A (ja) アイコンタクトによる機器操作システム
CN111739528A (zh) 一种交互方法、装置和耳机

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18803148

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018803148

Country of ref document: EP

Effective date: 20191217