WO2020192247A1 - 人机交互方法及系统、介质和计算机系统 - Google Patents

人机交互方法及系统、介质和计算机系统 Download PDF

Info

Publication number
WO2020192247A1
WO2020192247A1 PCT/CN2020/071188 CN2020071188W WO2020192247A1 WO 2020192247 A1 WO2020192247 A1 WO 2020192247A1 CN 2020071188 W CN2020071188 W CN 2020071188W WO 2020192247 A1 WO2020192247 A1 WO 2020192247A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
semantics
current context
human
image information
Prior art date
Application number
PCT/CN2020/071188
Other languages
English (en)
French (fr)
Inventor
苏晓文
Original Assignee
北京京东尚科信息技术有限公司
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司, 科大讯飞股份有限公司 filed Critical 北京京东尚科信息技术有限公司
Publication of WO2020192247A1 publication Critical patent/WO2020192247A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs

Definitions

  • the present disclosure relates to the field of computer technology, and more specifically, to a human-computer interaction method, a human-computer interaction system, a computer system, and a computer-readable storage medium.
  • the present disclosure provides a human-computer interaction method and a human-computer interaction system that enables smart devices to "hear” human voices and “see” humans during human-computer interaction.
  • One aspect of the present disclosure provides a human-computer interaction method, including: obtaining user image information and voice information in the human-computer interaction process; determining the current context based on the aforementioned image information; in the current context, Perceive the semantics that the user actually wants to express through the voice information; and respond to the user based on the sensed semantics.
  • the foregoing determining the current context based on the foregoing image information includes: performing face recognition on the foregoing user based on the foregoing image information to determine the current facial expression of the foregoing user, and/or performing physical manipulation of the foregoing user Action recognition to determine the current state of the user; and to determine the current context based on the current expression and/or current state of the user.
  • perceiving the semantics that the user actually wants to express through the voice information includes: determining at least one semantic that can be expressed by the voice information; judging whether the at least one semantic is Whether there is one or more semantics matching the current context; if it exists, perform one of the following operations: use at least one of the one or more semantics matching the current context as The semantics that the user actually wants to express through the voice information; take any one of the one or more semantics that matches the current context as the semantics that the user actually wants to express through the voice information; One or more kinds of semantics matching the current context all serve as the semantics that the user actually wants to express through the voice information.
  • the above method further includes in the human-computer interaction process: judging whether the image information of the user can be obtained; and if the image information of the user cannot be obtained, reminding the user to adjust the pose.
  • the above method further includes in the human-computer interaction process: judging whether there is an obstruction partially or completely covering the user; and if there is an obstruction partially or completely covering the user, adjusting The image acquisition device enables it to avoid the occlusion of the obstruction and collect the image information of the user.
  • Another aspect of the present disclosure provides a human-computer interaction system, including: an acquisition module for acquiring image information and voice information of a user in the process of human-computer interaction; a determining module for determining the current The perception module is used to perceive the semantics that the user actually wants to express through the voice information in the current context; and the response module is used to respond to the user based on the sensed semantics.
  • the above-mentioned determining module includes: an identification unit, configured to perform face recognition on the above-mentioned user based on the above-mentioned image information to determine the current expression of the above-mentioned user, and perform body movement recognition on the above-mentioned user to determine the above-mentioned user The current state; and the first determining unit, configured to determine the current context based on the current expression and/or current state of the user.
  • the above-mentioned perception module includes: a second determining unit, configured to determine at least one kind of semantics that the above-mentioned voice information can express; a determining unit, configured to determine whether the above-mentioned at least one kind of semantics is related to the current context One or more matching semantics; the execution unit is used to perform one of the following operations when there is one or more semantics matching the current context: match the above with the current context At least one of one or more semantics is regarded as the semantics that the user actually wants to express through the above voice information; any one of the one or more semantics matching the current context is regarded as the user The semantics that the voice information actually wants to express; the one or more semantics that match the current context are all taken as the semantics that the user actually wants to express through the voice information.
  • the above-mentioned system further includes: a first judgment module for judging whether the image information of the user can be obtained in the process of human-computer interaction; and a reminding module for when the image information of the user cannot be obtained In the case of information, remind the above-mentioned user to adjust the pose.
  • the above-mentioned system further includes: a second judgment module for judging whether there is an obstruction partially or completely covering the user during the human-computer interaction; and an adjustment module for In the case where the user is partially or completely blocked, the image capture device is adjusted so that it can collect the image information of the user without being blocked by the blocking object.
  • Another aspect of the present disclosure provides a computer-readable storage medium that stores computer-executable instructions, and the instructions are used to implement the method described in any one of the above when executed.
  • the computer program includes computer-executable instructions, and the instructions are used to implement the method described in any of the above when executed.
  • Another aspect of the present disclosure provides a computer system, including: one or more processors; a memory for storing one or more programs, wherein when one or more programs are executed by one or more processors , Enabling one or more processors to implement the method described in any one of the preceding items.
  • the smart device in the human-computer interaction scenario, because the smart device is used to acquire both the voice of the communication object and the image of the communication object, it at least partially overcomes the problem of smart devices in the related art.
  • Interaction especially human-computer dialogue, can only "hear” human voices, but cannot “see” human appearances. Therefore, the technical problem of weak semantic understanding ability, and thus achieve the technology to enhance the semantic understanding ability of smart devices effect.
  • Fig. 1 schematically shows an exemplary system architecture to which the human-computer interaction method and system of the present disclosure can be applied;
  • Fig. 2 schematically shows an application scenario of a human-computer interaction method and system according to an embodiment of the present disclosure
  • FIG. 3 schematically shows a flowchart of a human-computer interaction method according to an embodiment of the present disclosure
  • Fig. 5 schematically shows a block diagram of a human-computer interaction system according to an embodiment of the present disclosure
  • Fig. 6 schematically shows a block diagram of a determining module according to an embodiment of the present disclosure
  • Fig. 7 schematically shows a block diagram of a determining module according to an embodiment of the present disclosure.
  • Fig. 8 schematically shows a block diagram of a computer system suitable for implementing the human-computer interaction method and system according to an embodiment of the present disclosure.
  • At least one of the “systems” shall include but not limited to systems having A alone, B alone, C alone, A and B, A and C, B and C, and/or systems having A, B, C, etc. ).
  • At least one of the “systems” shall include but not limited to systems having A alone, B alone, C alone, A and B, A and C, B and C, and/or systems having A, B, C, etc. ).
  • the embodiments of the present disclosure provide a human-computer interaction method and a human-computer interaction system that enable smart devices to "hear" human voices and "see” human appearances during human-computer interaction.
  • the method includes obtaining the user's image information and voice information in the process of human-computer interaction; determining the current context based on the aforementioned image information; in the current context, sensing the semantics that the user actually wants to express through the aforementioned voice information ; And respond to the above-mentioned users based on the perceived semantics.
  • FIG. 1 schematically shows an exemplary system architecture to which the human-computer interaction method and system of the present disclosure can be applied. It should be noted that FIG. 1 is only an example of a system architecture to which the embodiments of the present disclosure can be applied to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure cannot be used for other Equipment, system, environment or scenario.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired and/or wireless communication links, and so on.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications may be installed on the terminal devices 101, 102, 103, such as shopping applications, web browser applications, search applications, instant messaging tools, email clients and/or social platform software, etc. (for example only ).
  • the terminal devices 101, 102, 103 may be various electronic devices with a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and so on.
  • the server 105 may be a server that provides various services, for example, a background management server (just an example) that provides support for websites browsed by users using the terminal devices 101, 102, and 103.
  • the background management server may analyze and process the received user request and other data, and feed back the processing result (for example, webpage, information, or data obtained or generated according to the user request) to the terminal device.
  • the human-computer interaction method provided in the embodiments of the present disclosure may also be executed by the terminal device 101, 102, or 103, or may also be executed by another terminal device different from the terminal device 101, 102, or 103.
  • the human-computer interaction system provided by the embodiments of the present disclosure may also be set in the terminal device 101, 102, or 103, or set in another terminal device different from the terminal device 101, 102, or 103.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there can be any number of terminal devices, networks and servers.
  • Fig. 2 schematically shows an application scenario of the human-computer interaction method and system according to the embodiments of the present disclosure.
  • the user can use the smart speaker 201 for human-computer interaction.
  • the smart speaker 201 can not only recognize the user’s voice, but also capture the user’s image, so that both the user’s voice and the user’s voice can be heard. It looks like the smart speaker in the prior art is no longer like a blind person. It can only hear the user's voice but cannot see the user's appearance.
  • FIG. 2 is only an example of an application scenario that can be adapted to the embodiment of the present disclosure to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiment of the present disclosure cannot be used for other devices. , System, environment or scene.
  • Fig. 3 schematically shows a flowchart of a human-computer interaction method according to an embodiment of the present disclosure.
  • the method includes operations S310 to S340, where:
  • smart devices and users serve as communication objects with each other.
  • the smart device can see the user's appearance because it can obtain the user's voice information, and at the same time, the smart device can also hear the user's voice because it can also obtain the user's image information.
  • the smart device includes a microphone array, and thereby collect the user's voice information. More specifically, when collecting the user's voice information, the smart device can accurately locate the user's speaking direction, thereby enhancing the beam energy in this direction and improving the recognition.
  • the smart device can collect the user's image information through an image acquisition device such as a camera.
  • the image acquisition device can be installed on a smart device as a component of the smart device, or can be installed outside the smart device as an independent device.
  • the current context is determined according to the image information.
  • Human facial expressions and body language also convey certain information and render the corresponding context. Therefore, when strengthening the understanding of human voice language by smart devices, consider analyzing human facial expressions and/or body language .
  • face recognition may be performed based on the acquired image information, and/or the body movements of the communication object may be analyzed based on the acquired image information. More specifically, in the recognition area, the smart device can collect face data through the camera, including the orientation, contour, eyes, eyebrows, lips, and nose contour of the face, so as to analyze the emotions and emotions of the current user (that is, the communication object). Expressions and other information to determine the corresponding context. For example, pleasure generally means affirmation and agreement, helplessness and frustration generally mean negative, and so on.
  • determining the current context based on the image information includes: performing face recognition on the user based on the image information to determine the user’s current facial expression, and/or performing physical manipulation on the user based on the image information Action recognition to determine the current state of the user; and to determine the current context based on the user's current facial expression and/or current state.
  • the image recognition system performs face recognition, that is, analyzes the facial attributes of the face to determine the angle and expression of the face, and/or recognizes body movements to calculate the current state of the user.
  • a state information table is generated, including the user's communication objects (such as smart devices) and the user, and then a threshold corresponding to each context is given based on the person's facial expression and/or current state.
  • the voice data collected by the microphone array is subjected to beam analysis and natural language processing to generate the user's language information, and according to preset different contexts, the processing results are given, and the corresponding semantic thresholds are given.
  • it combines face recognition and speech recognition, matches context, selects reasonable semantic analysis results, and interacts with users.
  • visual interaction is added on the basis of auditory interaction, combined with image recognition technology, to obtain the user’s current expression and/or state, and combined with audio analysis to give more reasonable semantic recognition results, so as to achieve more Strong semantic understanding ability.
  • perceiving the semantics that the user actually wants to express through the voice information includes: determining at least one kind of semantics that the voice information can express; judging whether the at least one kind of semantics exists and One or more semantics matched by the current context; if it exists, perform one of the following operations: use at least one of the one or more semantics matching the current context as the user’s voice information
  • the various semantics are all the semantics that users actually want to express through voice information.
  • one context may match one or more semantics at the same time, in this case, there can be multiple operations: as shown in Figure 4A, at least one of the matched semantics can be selected as the user's favorite Expressed semantics; or as shown in Figure 4B, you can choose a semantic from the matched semantics, such as the semantics with the highest matching degree as the semantics that the user most wants to express; or as shown in Figure 4C, you can select all the matched semantics Semantics are the semantics that users most want to express.
  • the above method further includes in the human-computer interaction process: judging whether the user's image information can be obtained; and if the user's image information cannot be obtained, reminding the user to adjust the posture.
  • the smart device Since it involves image collection, it is preferable that the smart device be placed in a position with a wide field of view and no obvious obstruction during use, so as to better collect user image information. In addition, reducing the occlusion can more accurately locate the user's speaking direction, so as to strengthen the beam energy of the direction and improve the speech recognition.
  • the smart device can see the appearance of the communication object during the interaction process.
  • the above method further includes in the human-computer interaction process: judging whether there is an obstruction partially or fully covering the user; and if there is an obstruction partially or completely covering the user, adjusting The image acquisition device enables it to avoid the occlusion of obstructions and collect the user's image information.
  • Adjust the image acquisition device can also be resolved by Adjust the image acquisition device to solve. Specifically, the shooting angle of the image acquisition device or the telescopic state of the camera can be adjusted, so that it can avoid or bypass the obstruction of the obstruction and collect the user's image information.
  • the smart device can see/see clearly the appearance of the communication object during the interaction process.
  • Fig. 5 schematically shows a block diagram of a human-computer interaction system according to an embodiment of the present disclosure.
  • the human-computer interaction system 500 includes an acquisition module 510, a determination module 520, a perception module 530, and a response module 540, where:
  • the obtaining module 510 is used to obtain the user's image information and voice information during the human-computer interaction process.
  • the determining module 520 is used to determine the current context according to the image information.
  • the perception module 530 is used to perceive the semantics that the user actually wants to express through the voice information in the current context.
  • the response module 540 is used to respond to the user based on the perceived semantics.
  • the determining module 520 includes an identifying unit 521 and a first determining unit 522.
  • the recognition unit 521 is configured to perform face recognition on the user according to the image information to determine the user’s current facial expression, and/or perform body motion recognition on the user to determine the user’s current state; the first determining unit 522 is used to The expression and/or current state of the person determine the current context.
  • visual interaction is added on the basis of auditory interaction, combined with image recognition technology, to obtain the user’s current expression and/or state, and combined with audio analysis to give more reasonable semantic recognition results, so as to achieve more Strong semantic understanding ability.
  • the perception module 530 includes a second determination unit 531, a judgment unit 532 and an execution unit 533.
  • the second determining unit 531 is configured to determine at least one semantic that can be expressed by the voice information;
  • the determining unit 532 is configured to determine whether one or more semantics matching the current context exists in the at least one semantic;
  • the executing unit 533 is configured to If there is one or more semantics that matches the current context, perform one of the following operations: use at least one of the one or more semantics that matches the current context as the user The semantics that the voice information actually wants to express; any one of one or more semantics that matches the current context is used as the semantics that the user actually wants to express through the voice information; the one that matches the current context
  • One or more kinds of semantics are all the semantics that users actually want to express through voice information.
  • the human-computer interaction system 500 further includes a first judgment module and a reminder module.
  • the first judgment module is used for judging whether the user's image information can be obtained in the human-computer interaction process; and the reminding module is used for reminding the user to adjust the posture when the user's image information cannot be obtained.
  • the smart device can see the appearance of the communication object during the interaction process.
  • the human-computer interaction system 500 further includes a second judgment module and an adjustment module.
  • the second judgment module is used for judging whether there are obstructions partially or completely covering the user during the human-computer interaction; and the adjustment module is used for adjusting the image capture when there are obstructions partially or completely covering the user. The device enables it to avoid the occlusion of obstructions and collect user's image information.
  • the smart device can see/see clearly the appearance of the communication object during the interaction process.
  • any number of modules, units, or at least part of functions of any number of them may be implemented in one module. Any one or more of the modules and units according to the embodiments of the present disclosure may be split into multiple modules for implementation. Any one or more of the modules and units according to the embodiments of the present disclosure may be at least partially implemented as hardware circuits, such as field programmable gate array (FPGA), programmable logic array (PLA), system on chip, The system, the system on the package, the application specific integrated circuit (ASIC), or the hardware or firmware in any other reasonable way that integrates or encapsulates the circuit, or can be implemented in any of the three ways of software, hardware and firmware This can be achieved by appropriate combination of any of them. Alternatively, one or more of the modules and units according to the embodiments of the present disclosure may be at least partially implemented as a computer program module, and when the computer program module is run, it may perform corresponding functions.
  • FPGA field programmable gate array
  • PLA programmable logic array
  • ASIC application specific integrated circuit
  • any of the acquisition module 510, the determination module 520, the perception module 530, and the response module 540 can be combined into one module/unit/subunit for implementation, or any one of the modules/units/subunits can be split. Into multiple modules/units/subunits. Or, at least part of the functions of one or more modules/units/subunits of these modules/units/subunits can be combined with at least part of the functions of other modules/units/subunits and integrated in one module/unit/subunit In the realization.
  • At least one of the acquisition module 510, the determination module 520, the perception module 530, and the response module 540 may be at least partially implemented as a hardware circuit, such as a field programmable gate array (FPGA), a programmable logic array (PLA), system-on-chip, system-on-substrate, system-on-package, application specific integrated circuit (ASIC), or can be implemented by hardware or firmware such as any other reasonable way of integrating or packaging the circuit, or by software, It can be implemented in any one of the three implementation modes of hardware and firmware or in an appropriate combination of any of them.
  • at least one of the acquisition module 510, the determination module 520, the perception module 530, and the response module 540 may be at least partially implemented as a computer program module, and when the computer program module is run, it may perform a corresponding function.
  • Fig. 8 schematically shows a block diagram of a computer system suitable for implementing the human-computer interaction method and system according to an embodiment of the present disclosure.
  • the computer system shown in FIG. 8 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
  • a computer system 800 includes a processor 801, which can be loaded into a random access memory (RAM) 803 according to a program stored in a read only memory (ROM) 802 or from a storage part 808 The program executes various appropriate actions and processing.
  • the processor 801 may include, for example, a general-purpose microprocessor (for example, a CPU), an instruction set processor and/or a related chipset and/or a special purpose microprocessor (for example, an application specific integrated circuit (ASIC)), and so on.
  • the processor 801 may also include on-board memory for caching purposes.
  • the processor 801 may include a single processing unit or multiple processing units for performing different actions of a method flow according to an embodiment of the present disclosure.
  • the processor 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804.
  • the processor 801 executes various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or RAM 803. It should be noted that the program may also be stored in one or more memories other than ROM 802 and RAM 803.
  • the processor 801 may also execute various operations of the method flow according to the embodiments of the present disclosure by executing programs stored in the one or more memories.
  • the system 800 may further include an input/output (I/O) interface 805, and the input/output (I/O) interface 805 is also connected to the bus 804.
  • the system 800 may also include one or more of the following components connected to the I/O interface 805: an input part 806 including a keyboard, a mouse, etc.; including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker
  • the output part 807 including the hard disk and the like; the storage part 808 including the hard disk and the like; and the communication part 809 including the network interface card such as a LAN card, a modem, and the like.
  • the communication section 809 performs communication processing via a network such as the Internet.
  • the driver 810 is also connected to the I/O interface 805 as needed.
  • a removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 810 as needed, so that the computer program read from it is installed into the storage section 808 as needed.
  • the method flow according to the embodiment of the present disclosure may be implemented as a computer software program.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable storage medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication part 809, and/or installed from the removable medium 811.
  • the above-mentioned functions defined in the system of the embodiment of the present disclosure are executed.
  • the above-described systems, devices, devices, modules, units, etc. may be implemented by computer program modules.
  • the present disclosure also provides a computer-readable storage medium.
  • the computer-readable storage medium may be included in the device/device/system described in the above embodiment; or it may exist alone without being assembled into the device/ In the device/system.
  • the aforementioned computer-readable storage medium carries one or more programs, and when the aforementioned one or more programs are executed, the method according to the embodiments of the present disclosure is implemented.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • it can include but not limited to: portable computer disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), portable compact disk read only memory (CD- ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable storage medium may include one or more memories other than the ROM 802 and/or RAM 803 and/or ROM 802 and RAM 803 described above.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the above-mentioned module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be It is realized by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本公开提供了一种人机交互方法,包括:在人机交互过程中,获取用户的图像信息和语音信息;根据所述图像信息,确定当前的语境;在当前的语境下,感知所述用户通过所述语音信息实际想要表达的语义;以及基于所感知的语义,对所述用户进行应答。本公开还公开了一种人机交互系统、一种计算机系统和一种计算机可读存储介质。

Description

人机交互方法及系统、介质和计算机系统 技术领域
本公开涉及计算机技术领域,更具体地,涉及一种人机交互方法、人机交互系统、一种计算机系统和一种计算机可读存储介质。
背景技术
随着人机对话技术的不断发展,目前越来越多的智能设备(如智能音箱,智能手机等)都可以实现人机对话了。
然而,在实现本公开构思的过程中,发明人发现现有的智能设备在人机交互,尤其是人机对话时,只能“听”到人的声音,无法“看”到人的样子,因而语义理解能力不强。
发明内容
有鉴于此,本公开提供了一种使智能设备在人机交互时既能“听”到人的声音,又能“看”到人的样子的人机交互方法和人机交互系统。
本公开的一个方面提供了一种人机交互方法,包括:在人机交互过程中,获取用户的图像信息和语音信息;根据上述图像信息,确定当前的语境;在当前的语境下,感知上述用户通过上述语音信息实际想要表达的语义;以及基于所感知的语义,对上述用户进行应答。
根据本公开的实施例,上述根据上述图像信息,确定当前的语境,包括:根据上述图像信息,对上述用户进行人脸识别,以确定上述用户当前的表情,和/或对上述用户进行肢体动作识别,以确定上述用户当前的状态;以及基于上述用户当前的表情和/或当前的状态,确定当前的语境。
根据本公开的实施例,上述在当前的语境下,感知上述用户通过上述语音信息实际想要表达的语义,包括:确定上述语音信息能够表达的至少一种语义;判断上述至少一种语义中是否存在与当前的语境匹配的一种或者多种语义;如果存在,则执行以下操作中的一种:将 上述与当前的语境匹配的一种或者多种语义中的至少一种语义作为上述用户通过上述语音信息实际想要表达的语义;将上述与当前的语境匹配的一种或者多种语义中的任意一种语义作为上述用户通过上述语音信息实际想要表达的语义;将上述与当前的语境匹配的一种或者多种语义全部作为上述用户通过上述语音信息实际想要表达的语义。
根据本公开的实施例,上述方法还包括在人机交互过程中:判断能否获取到上述用户的图像信息;以及如果不能获取到上述用户的图像信息,则提醒上述用户调整位姿。
根据本公开的实施例,上述方法还包括在人机交互过程中:判断是否有遮挡物部分地或者全部地遮挡住上述用户;以及如果有遮挡物部分地或者全部地遮挡住上述用户,则调整图像采集装置,使之能够避开遮挡物的遮挡而采集上述用户的图像信息。
本公开的另一个方面提供了一种人机交互系统,包括:获取模块,用于在人机交互过程中,获取用户的图像信息和语音信息;确定模块,用于根据上述图像信息,确定当前的语境;感知模块,用于在当前的语境下,感知上述用户通过上述语音信息实际想要表达的语义;以及应答模块,用于基于所感知的语义,对上述用户进行应答。
根据本公开的实施例,上述确定模块包括:识别单元,用于根据上述图像信息,对上述用户进行人脸识别,以确定上述用户当前的表情,对上述用户进行肢体动作识别,以确定上述用户当前的状态;以及第一确定单元,用于基于上述用户当前的表情和/或当前的状态,确定当前的语境。
根据本公开的实施例,上述感知模块包括:第二确定单元,用于确定上述语音信息能够表达的至少一种语义;判断单元,用于判断上述至少一种语义中是否存在与当前的语境匹配的一种或者多种语义;执行单元,用于在存在与当前的语境匹配的一种或者多种语义的情况下,执行以下操作中的一种:将上述与当前的语境匹配的一种或者多种语义中的至少一种语义作为上述用户通过上述语音信息实际想要表达的语义;将上述与当前的语境匹配的一种或者多种语义中的任意一 种语义作为上述用户通过上述语音信息实际想要表达的语义;将上述与当前的语境匹配的一种或者多种语义全部作为上述用户通过上述语音信息实际想要表达的语义。
根据本公开的实施例,上述系统还包括:第一判断模块,用于在人机交互过程中判断能否获取到上述用户的图像信息;以及提醒模块,用于在不能获取到上述用户的图像信息的情况下,提醒上述用户调整位姿。
根据本公开的实施例,上述系统还包括:第二判断模块,用于在人机交互过程中判断是否有遮挡物部分地或者全部地遮挡住上述用户;以及调整模块,用于在有遮挡物部分地或者全部地遮挡住上述用户的情况下,调整图像采集装置,使之能够避开遮挡物的遮挡而采集上述用户的图像信息。
本公开的另一方面提供了一种计算机可读存储介质,存储有计算机可执行指令,指令在被执行时用于实现如上任一项所述的方法。
本公开的另一方面提供了一种计算机程序,计算机程序包括计算机可执行指令,指令在被执行时用于实现如上任一项所述的方法。
本公开的另一方面提供了一种计算机系统,包括:一个或多个处理器;存储器,用于存储一个或多个程序,其中,当一个或多个程序被一个或多个处理器执行时,使得一个或多个处理器实现如上任一项所述的方法。
根据本公开的实施例,在人机交互场景中,因为采用了智能设备既获取交流对象的声音,又获取交流对象的图像的技术手段,所以至少部分地克服了相关技术中智能设备在人机交互,尤其是人机对话时,只能“听”到人的声音,无法“看”到人的样子,因而语义理解能力不强的技术问题,进而达到了增强智能设备的语义理解能力的技术效果。
附图说明
通过以下参照附图对本公开实施例的描述,本公开的上述以及其 他目的、特征和优点将更为清楚,在附图中:
图1示意性示出了可以应用本公开的人机交互方法和系统的示例性系统架构;
图2示意性示出了根据本公开实施例的人机交互方法和系统的应用场景;
图3示意性示出了根据本公开实施例的人机交互方法的流程图;
图图4A~图4C示意性示出了根据本公开实施例的确定语义的示意图;
图5示意性示出了根据本公开实施例的人机交互系统的框图;
图6示意性示出了根据本公开实施例的确定模块的框图;
图7示意性示出了根据本公开实施例的确定模块的框图;以及
图8示意性示出了根据本公开实施例的适于实现人机交互方法和系统的计算机系统的框图。
具体实施方式
以下,将参照附图来描述本公开的实施例。但是应该理解,这些描述只是示例性的,而并非要限制本公开的范围。在下面的详细描述中,为便于解释,阐述了许多具体的细节以提供对本公开实施例的全面理解。然而,明显地,一个或多个实施例在没有这些具体细节的情况下也可以被实施。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要地混淆本公开的概念。
在此使用的术语仅仅是为了描述具体实施例,而并非意在限制本公开。在此使用的术语“包括”、“包含”等表明了所述特征、步骤、操作和/或部件的存在,但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。
在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义,除非另外定义。应注意,这里使用的术语应解释为具有与本说明书的上下文相一致的含义,而不应以理想化或过于刻板的方式来解释。
在使用类似于“A、B和C等中至少一个”这样的表述的情况下,一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如,“具有A、B和C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。在使用类似于“A、B或C等中至少一个”这样的表述的情况下,一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如,“具有A、B或C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。
本公开的实施例提供了一种使智能设备在人机交互时既能“听”到人的声音,又能“看”到人的样子的人机交互方法和人机交互系统。该方法包括在人机交互过程中,获取用户的图像信息和语音信息;根据上述图像信息,确定当前的语境;在当前的语境下,感知上述用户通过上述语音信息实际想要表达的语义;以及基于所感知的语义,对上述用户进行应答。
图1示意性示出了可以应用本公开的人机交互方法和系统的示例性系统架构。需要注意的是,图1所示仅为可以应用本公开实施例的系统架构的示例,以帮助本领域技术人员理解本公开的技术内容,但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。
如图1所示,根据该实施例的系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线和/或无线通信链路等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如购物类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端和/或社交平台软件等(仅为示例)。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各 种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对用户利用终端设备101、102、103所浏览的网站提供支持的后台管理服务器(仅为示例)。后台管理服务器可以对接收到的用户请求等数据进行分析等处理,并将处理结果(例如根据用户请求获取或生成的网页、信息、或数据等)反馈给终端设备。
需要说明的是,本公开实施例所提供的人机交互方法也可以由终端设备101、102、或103执行,或者也可以由不同于终端设备101、102、或103的其他终端设备执行。相应地,本公开实施例所提供的人机交互系统也可以设置于终端设备101、102、或103中,或设置于不同于终端设备101、102、或103的其他终端设备中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
图2示意性示出了根据本公开实施例的人机交互方法和系统的应用场景。
如图2所示,在该应用场景中,用户可以使用智能音箱201进行人机交互。并且,使用本公开实施例提供的技术方案,在人机交互过程中,智能音箱201不仅可以识别用户的语音,而且还可以捕捉用户的图像,从而既可以听见用户的声音,又可以看见用户的样子,不再像现有技术中智能音箱如同盲人一样,只能听见用户的声音,无法看见用户的样子。
应该理解,图2所示仅为可以适于本公开实施例的应用场景的示例,以帮助本领域技术人员理解本公开的技术内容,但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。
图3示意性示出了根据本公开实施例的人机交互方法的流程图。
如图3所示,该方法包括操作S310~S340,其中:
在操作S310,在人机交互过程中,获取用户的图像信息和语音信息。
在人机交互过程中,智能设备和用户互相作为交流对象。并且,在 本公开实施例中,智能设备由于能够获取用户的语音信息,因而能够看见用户的样子,同时,智能设备由于还能够获取用户的图像信息,因而能够听见用户的声音。
具体地,智能设备包括一个麦克风阵列,并由此采集用户的语音信息。更具体地,在采集用户的语音信息时,智能设备可以准确地定位用户的说话方向,进而可以加强该方向上的波束能量,提高识别度。
此外,智能设备可以通过图像采集装置如摄像头采集用户的图像信息。更具体地,该图像采集装置既可以设置在智能设备上,作为智能设备的一个部件,又可以设置在智能设备之外,作为一个独立的装置。
在操作S320,根据图像信息,确定当前的语境。
由于人的面部表情和肢体语言也会传达出一定的信息,并渲染出相应的语境,因此在加强智能设备对人的有声语言的理解时,可以考虑解析人的面部表情和/或肢体语言。
具体地,可以基于获取的图像信息进行人脸识别,和/或基于获取的图像信息进行分析交流对象的肢体动作。更具体地,在识别区域内,智能设备可以通过摄像头可采集出人脸数据,包括人脸的朝向、轮廓、眼睛、眉毛、嘴唇以及鼻子轮廓,以便分析当前用户(即交流对象)的情绪和表情等信息,进而确定相应的语境。例如,愉悦一般表示肯定、同意,无奈、沮丧一般表示否定,等等。
在操作S330,在当前的语境下,感知用户通过语音信息实际想要表达的语义。
由于好多语音信息在不同的语境下表达出来的语义一般不同,甚至截然相反,因此,配合语境来感知用户通过某一语音信息实际想要表达的语义,可以加强对语义的理解。
例如,在聊天的过程中,如果一方发出“呵呵”的声音,并且变现得很愉悦,则一般认为这是同意另一方的意思,而如果一方发出“呵呵”的声音,并且变现得很无奈,则一般认为这是不同意另一方的意思,如果一方发出“呵呵”的声音,并且看不出愉悦还是无奈,则一般认为“呵呵”此时只是语气词而已。
在操作S340,基于所感知的语义,对用户进行应答。
在正确的语境下不仅可以感知交流对象所表达的真实意思,而且能够赋予智能设备察言观色的能力,因而更能聊到对方心坎里去,从而提起对方的聊天兴趣。
与现有技术智能设备在人机交互,尤其是人机对话时,只能听到人的声音,无法看到人的样子,如同盲人一样,语义理解能力受到影响相比,本公开公开实施例,增加视觉交互,即引入人脸识别和/或肢体动作识别,通过采集交流对象即人的图像和语音,使得智能设备在人机交互时既能听又能看,从而能够加强语义理解能力。
作为一种可选的实施例,根据图像信息,确定当前的语境,包括:根据图像信息,对用户进行人脸识别,以确定用户当前的表情,和/或根据图像信息,对用户进行肢体动作识别,以确定用户当前的状态;以及基于用户当前的表情和/或当前的状态,确定当前的语境。
换言之,在实际操作中,既可以只进行人脸识别,以确定交流对象的表情,进而根据表情来确定当前的语境,也可以只进行肢体动作,以确定交流对象当前的状态,进而根据当前的状态来确定当前的语境,还可以即进行人脸识别又进行肢体动作,从而确定交流对象的表情和其当前的状态,进而根据表情和当前的状态来确定当前的语境。
在具体的交互场景中,图像识别系统进行人脸识别,即通过对人脸的面部属性进行分析,以确定人脸的角度和表情,和/或进行肢体动作识别,从而计算用户当前的状态,生成状态信息表,包括用户的交流对象(如智能设备)和用户,进而基于人的表情和/或当前的状态给出每种语境对应的阈值。与此同时还将麦克风阵列采集的语音数据进行波束分析、自然语言处理,生成用户的语言信息,并根据预先设定不同的语境,给出处理结果,并给出对应语义的阈值。最后,结合人脸识别和语音识别,匹配语境,选出合理的语义解析结果,并与用户交互。
通过本公开实施例,在听觉交互的基础上增加视觉交互,并结合图像识别技术,获取用户当前的表情和/或状态,再结合音频分析,给出更为合理的语义识别结果,从而达到更强的语义理解能力。
作为一种可选的实施例,在当前的语境下,感知用户通过语音信息实际想要表达的语义,包括:确定语音信息能够表达的至少一种语义;判断至少一种语义中是否存在与当前的语境匹配的一种或者多种语义;如果存在,则执行以下操作中的一种:将与当前的语境匹配的一种或者多种语义中的至少一种语义作为用户通过语音信息实际想要表达的语义;将与当前的语境匹配的一种或者多种语义中的任意一种语义作为用户通过语音信息实际想要表达的语义;将与当前的语境匹配的一种或者多种语义全部作为用户通过语音信息实际想要表达的语义。
由于很多信息在不同的语境往往对应于不同的语义,因此在确定某个信息的语义时,可以先确定这个信息能够表达的所有语义,在从所有语义中匹配出与当前的语境匹配的语义。
由于一种语境可能同时匹配出一种或者多种语义,这种情况下,可以有多种操作方式:如图4A所示,可以从匹配出的语义中选出至少一种作为用户最想表达的语义;或者如图4B所示,可以从匹配出的语义中任选一种语义如匹配度最高的语义作为用户最想表达的语义;或者如图4C所示,可以将所有匹配出的语义都作为用户最想表达的语义。
通过本公开实施例,可以从多种语义中找出与当前语境匹配的语义,并依此进行应答,能够准确把握用户的意图,提高用户体验。
作为一种可选的实施例,上述方法还包括在人机交互过程中:判断能否获取到用户的图像信息;以及如果不能获取到用户的图像信息,则提醒用户调整位姿。
由于涉及图像采集,因此优选地,在使用过程中,可以将智能设备放置在视野开阔并且无明显遮挡的位置,以便于更好地收集用户的图像信息。此外,减少遮挡可以更准确的定位用户的说话方向,以便加强该方位的波束能量,提高语音识别度。
因此,在交互过程中,可以不断检测是否能够获取到用户的图像信息,如果不能,则可以提醒用户调整位姿,如果能,则可以不做处理。
通过本公开实施例,可以保证智能设备在交互过程中能够看见交流对象的样子。
作为一种可选的实施例,上述方法还包括在人机交互过程中:判断是否有遮挡物部分地或者全部地遮挡住用户;以及如果有遮挡物部分地或者全部地遮挡住用户,则调整图像采集装置,使之能够避开遮挡物的遮挡而采集用户的图像信息。
具体地,在发现有遮挡物部分地或者全部地遮挡住用户,导致无法拍到或者只能拍到部分或者无法拍到清楚的图像时,除了通过提醒用户调整位姿解决之外,还可以通过调整图像采集装置来解决。具体地,可以调整图像采集装置的拍摄角度或者摄像头的伸缩状态等,使之能够避开或者绕过遮挡物的遮挡而采集用户的图像信息。
通过本公开实施例,也可以保证智能设备在交互过程中能够看见/看清交流对象的样子。
图5示意性示出了根据本公开实施例的人机交互系统的框图。
如图5所示,人机交互系统500包括获取模块510、确定模块520、感知模块530和应答模块540,其中:
获取模块510用于在人机交互过程中,获取用户的图像信息和语音信息。
确定模块520用于根据图像信息,确定当前的语境。
感知模块530用于在当前的语境下,感知用户通过语音信息实际想要表达的语义。
应答模块540用于基于所感知的语义,对用户进行应答。
与现有技术智能设备在人机交互,尤其是人机对话时,只能听到人的声音,无法看到人的样子,如同盲人一样,语义理解能力受到影响相比,本公开公开实施例,增加视觉交互,即引入人脸识别和/或肢体动作识别,通过采集交流对象即人的图像和语音,使得智能设备在人机交互时既能听又能看,从而能够加强语义理解能力。
作为一种可选的实施例,如图6所示,确定模块520包括识别单元521和第一确定单元522。识别单元521用于根据图像信息,对用户进行人脸识别,以确定用户当前的表情,和/或对用户进行肢体动作识别,以确定用户当前的状态;第一确定单元522用于基于用户当前的表情和 /或当前的状态,确定当前的语境。
通过本公开实施例,在听觉交互的基础上增加视觉交互,并结合图像识别技术,获取用户当前的表情和/或状态,再结合音频分析,给出更为合理的语义识别结果,从而达到更强的语义理解能力。
作为一种可选的实施例,如图7所示,感知模块530包括第二确定单元531、判断单元532和执行单元533。第二确定单元531用于确定语音信息能够表达的至少一种语义;判断单元532用于判断至少一种语义中是否存在与当前的语境匹配的一种或者多种语义;执行单元533用于在存在与当前的语境匹配的一种或者多种语义的情况下,执行以下操作中的一种:将与当前的语境匹配的一种或者多种语义中的至少一种语义作为用户通过语音信息实际想要表达的语义;将与当前的语境匹配的一种或者多种语义中的任意一种语义作为用户通过语音信息实际想要表达的语义;将与当前的语境匹配的一种或者多种语义全部作为用户通过语音信息实际想要表达的语义。
通过本公开实施例,可以从多种语义中找出与当前语境匹配的语义,并依此进行应答,能够准确把握用户的意图,提高用户体验。
作为一种可选的实施例,该人机交互系统500还包括第一判断模块和提醒模块。第一判断模块用于在人机交互过程中判断能否获取到用户的图像信息;以及提醒模块用于在不能获取到用户的图像信息的情况下,提醒用户调整位姿。
通过本公开实施例,可以保证智能设备在交互过程中能够看见交流对象的样子。
作为一种可选的实施例,该人机交互系统500还包括第二判断模块和调整模块。第二判断模块用于在人机交互过程中判断是否有遮挡物部分地或者全部地遮挡住用户;以及调整模块用于在有遮挡物部分地或者全部地遮挡住用户的情况下,调整图像采集装置,使之能够避开遮挡物的遮挡而采集用户的图像信息。
通过本公开实施例,也可以保证智能设备在交互过程中能够看见/看清交流对象的样子。
根据本公开的实施例的模块、单元中的任意多个、或其中任意多个的至少部分功能可以在一个模块中实现。根据本公开实施例的模块、单元中的任意一个或多个可以被拆分成多个模块来实现。根据本公开实施例的模块、单元中的任意一个或多个可以至少被部分地实现为硬件电路,例如现场可编程门阵列(FPGA)、可编程逻辑阵列(PLA)、片上系统、基板上的系统、封装上的系统、专用集成电路(ASIC),或可以通过对电路进行集成或封装的任何其他的合理方式的硬件或固件来实现,或以软件、硬件以及固件三种实现方式中任意一种或以其中任意几种的适当组合来实现。或者,根据本公开实施例的模块、单元中的一个或多个可以至少被部分地实现为计算机程序模块,当该计算机程序模块被运行时,可以执行相应的功能。
例如,获取模块510、确定模块520、感知模块530和应答模块540中的任意多个可以合并在一个模块/单元/子单元中实现,或者其中的任意一个模块/单元/子单元可以被拆分成多个模块/单元/子单元。或者,这些模块/单元/子单元中的一个或多个模块/单元/子单元的至少部分功能可以与其他模块/单元/子单元的至少部分功能相结合,并在一个模块/单元/子单元中实现。根据本公开的实施例,获取模块510、确定模块520、感知模块530和应答模块540中的至少一个可以至少被部分地实现为硬件电路,例如现场可编程门阵列(FPGA)、可编程逻辑阵列(PLA)、片上系统、基板上的系统、封装上的系统、专用集成电路(ASIC),或可以通过对电路进行集成或封装的任何其他的合理方式等硬件或固件来实现,或以软件、硬件以及固件三种实现方式中任意一种或以其中任意几种的适当组合来实现。或者,获取模块510、确定模块520、感知模块530和应答模块540中的至少一个可以至少被部分地实现为计算机程序模块,当该计算机程序模块被运行时,可以执行相应的功能。
需要说明的是,本公开的实施例中系统部分实施方式与本公开的实施例中方法部分实施方式对应相同或类似,系统部分实施方式的描述具体请参考方法部分实施方式的描述,在此不再赘述。
图8示意性示出了根据本公开实施例的适于实现人机交互方法和 系统的计算机系统的框图。图8示出的计算机系统仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图8所示,根据本公开实施例的计算机系统800包括处理器801,其可以根据存储在只读存储器(ROM)802中的程序或者从存储部分808加载到随机访问存储器(RAM)803中的程序而执行各种适当的动作和处理。处理器801例如可以包括通用微处理器(例如CPU)、指令集处理器和/或相关芯片组和/或专用微处理器(例如,专用集成电路(ASIC)),等等。处理器801还可以包括用于缓存用途的板载存储器。处理器801可以包括用于执行根据本公开实施例的方法流程的不同动作的单一处理单元或者是多个处理单元。
在RAM 803中,存储有系统800操作所需的各种程序和数据。处理器801、ROM 802以及RAM 803通过总线804彼此相连。处理器801通过执行ROM 802和/或RAM 803中的程序来执行根据本公开实施例的方法流程的各种操作。需要注意,所述程序也可以存储在除ROM 802和RAM 803以外的一个或多个存储器中。处理器801也可以通过执行存储在所述一个或多个存储器中的程序来执行根据本公开实施例的方法流程的各种操作。
根据本公开的实施例,系统800还可以包括输入/输出(I/O)接口805,输入/输出(I/O)接口805也连接至总线804。系统800还可以包括连接至I/O接口805的以下部件中的一项或多项:包括键盘、鼠标等的输入部分806;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分807;包括硬盘等的存储部分808;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分809。通信部分809经由诸如因特网的网络执行通信处理。驱动器810也根据需要连接至I/O接口805。可拆卸介质811,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器810上,以便于从其上读出的计算机程序根据需要被安装入存储部分808。
根据本公开的实施例,根据本公开实施例的方法流程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其 包括承载在计算机可读存储介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分809从网络上被下载和安装,和/或从可拆卸介质811被安装。在该计算机程序被处理器801执行时,执行本公开实施例的系统中限定的上述功能。根据本公开的实施例,上文描述的系统、设备、装置、模块、单元等可以通过计算机程序模块来实现。
本公开还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的设备/装置/系统中所包含的;也可以是单独存在,而未装配入该设备/装置/系统中。上述计算机可读存储介质承载有一个或者多个程序,当上述一个或者多个程序被执行时,实现根据本公开实施例的方法。
根据本公开的实施例,计算机可读存储介质可以是非易失性的计算机可读存储介质。例如可以包括但不限于:便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
例如,根据本公开的实施例,计算机可读存储介质可以包括上文描述的ROM 802和/或RAM 803和/或ROM 802和RAM 803以外的一个或多个存储器。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图 中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
本领域技术人员可以理解,本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合,即使这样的组合或结合没有明确记载于本公开中。特别地,在不脱离本公开精神和教导的情况下,本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合。所有这些组合和/或结合均落入本公开的范围。
以上对本公开的实施例进行了描述。但是,这些实施例仅仅是为了说明的目的,而并非为了限制本公开的范围。尽管在以上分别描述了各实施例,但是这并不意味着各个实施例中的措施不能有利地结合使用。本公开的范围由所附权利要求及其等同物限定。不脱离本公开的范围,本领域技术人员可以做出多种替代和修改,这些替代和修改都应落在本公开的范围之内。

Claims (12)

  1. 一种人机交互方法,包括:
    在人机交互过程中,获取用户的图像信息和语音信息;
    根据所述图像信息,确定当前的语境;
    在当前的语境下,感知所述用户通过所述语音信息实际想要表达的语义;以及
    基于所感知的语义,对所述用户进行应答。
  2. 根据权利要求1所述的方法,其中,所述根据所述图像信息,确定当前的语境,包括:
    根据所述图像信息,对所述用户进行人脸识别,以确定所述用户当前的表情,和/或对所述用户进行肢体动作识别,以确定所述用户当前的状态;以及
    基于所述用户当前的表情和/或当前的状态,确定当前的语境。
  3. 根据权利要求1所述的方法,其中,所述在当前的语境下,感知所述用户通过所述语音信息实际想要表达的语义,包括:
    确定所述语音信息能够表达的至少一种语义;
    判断所述至少一种语义中是否存在与当前的语境匹配的一种或者多种语义;
    如果存在,则执行以下操作中的一种:
    将所述与当前的语境匹配的一种或者多种语义中的至少一种语义作为所述用户通过所述语音信息实际想要表达的语义;
    将所述与当前的语境匹配的一种或者多种语义中的任意一种语义作为所述用户通过所述语音信息实际想要表达的语义;
    将所述与当前的语境匹配的一种或者多种语义全部作为所述用户通过所述语音信息实际想要表达的语义。
  4. 根据权利要求1所述的方法,其中,所述方法还包括在人机交互过程中:
    判断能否获取到所述用户的图像信息;以及
    如果不能获取到所述用户的图像信息,则提醒所述用户调整位姿。
  5. 根据权利要求1所述的方法,其中,所述方法还包括在人机交互过程中:
    判断是否有遮挡物部分地或者全部地遮挡住所述用户;以及
    如果有遮挡物部分地或者全部地遮挡住所述用户,则调整图像采集装置,使之能够避开遮挡物的遮挡而采集所述用户的图像信息。
  6. 一种人机交互系统,包括:
    获取模块,用于在人机交互过程中,获取用户的图像信息和语音信息;
    确定模块,用于根据所述图像信息,确定当前的语境;
    感知模块,用于在当前的语境下,感知所述用户通过所述语音信息实际想要表达的语义;以及
    应答模块,用于基于所感知的语义,对所述用户进行应答。
  7. 根据权利要求6所述的系统,其中,所述确定模块包括:
    识别单元,用于根据所述图像信息,对所述用户进行人脸识别,以确定所述用户当前的表情,和/或对所述用户进行肢体动作识别,以确定所述用户当前的状态;以及
    第一确定单元,用于基于所述用户当前的表情和/或当前的状态,确定当前的语境。
  8. 根据权利要求6所述的系统,其中,所述感知模块包括:
    第二确定单元,用于确定所述语音信息能够表达的至少一种语义;
    判断单元,用于判断所述至少一种语义中是否存在与当前的语境匹配的一种或者多种语义;
    执行单元,用于在存在与当前的语境匹配的一种或者多种语义的情况下,执行以下操作中的一种:
    将所述与当前的语境匹配的一种或者多种语义中的至少一种语义作为所述用户通过所述语音信息实际想要表达的语义;
    将所述与当前的语境匹配的一种或者多种语义中的任意一种语义作为所述用户通过所述语音信息实际想要表达的语义;
    将所述与当前的语境匹配的一种或者多种语义全部作为所述用户通过所述语音信息实际想要表达的语义。
  9. 根据权利要求6所述的系统,其中,所述系统还包括:
    第一判断模块,用于在人机交互过程中判断能否获取到所述用户的图像信息;以及
    提醒模块,用于在不能获取到所述用户的图像信息的情况下,提醒所述用户调整位姿。
  10. 根据权利要求6所述的系统,其中,所述系统还包括:
    第二判断模块,用于在人机交互过程中判断是否有遮挡物部分地或者全部地遮挡住所述用户;以及
    调整模块,用于在有遮挡物部分地或者全部地遮挡住所述用户的情况下,调整图像采集装置,使之能够避开遮挡物的遮挡而采集所述用户的图像信息。
  11. 一种计算机系统,包括:
    一个或多个处理器;
    存储器,用于存储一个或多个程序,
    其中,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现权利要求1至5中任一项所述的方法。
  12. 一种计算机可读存储介质,其上存储有可执行指令,该指令被处理器执行时使处理器实现权利要求1至5中任一项所述的方法。
PCT/CN2020/071188 2019-03-22 2020-01-09 人机交互方法及系统、介质和计算机系统 WO2020192247A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910221207.4A CN111722702A (zh) 2019-03-22 2019-03-22 人机交互方法及系统、介质和计算机系统
CN201910221207.4 2019-03-22

Publications (1)

Publication Number Publication Date
WO2020192247A1 true WO2020192247A1 (zh) 2020-10-01

Family

ID=72562621

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/071188 WO2020192247A1 (zh) 2019-03-22 2020-01-09 人机交互方法及系统、介质和计算机系统

Country Status (2)

Country Link
CN (1) CN111722702A (zh)
WO (1) WO2020192247A1 (zh)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491435A (zh) * 2017-08-14 2017-12-19 深圳狗尾草智能科技有限公司 基于计算机自动识别用户情感的方法及装置
CN108833941A (zh) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 人机交互处理方法、装置、用户终端、处理服务器及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105093986A (zh) * 2015-07-23 2015-11-25 百度在线网络技术(北京)有限公司 基于人工智能的拟人机器人控制方法、系统及拟人机器人
CN107301168A (zh) * 2017-06-01 2017-10-27 深圳市朗空亿科科技有限公司 智能机器人及其情绪交互方法、系统

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491435A (zh) * 2017-08-14 2017-12-19 深圳狗尾草智能科技有限公司 基于计算机自动识别用户情感的方法及装置
CN108833941A (zh) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 人机交互处理方法、装置、用户终端、处理服务器及系统

Also Published As

Publication number Publication date
CN111722702A (zh) 2020-09-29

Similar Documents

Publication Publication Date Title
US11561621B2 (en) Multi media computing or entertainment system for responding to user presence and activity
US10585486B2 (en) Gesture interactive wearable spatial audio system
US11849256B2 (en) Systems and methods for dynamically concealing sensitive information
US9390726B1 (en) Supplementing speech commands with gestures
US10360876B1 (en) Displaying instances of visual content on a curved display
US20190217425A1 (en) Object Recognition and Presentation for the Visually Impaired
US9317113B1 (en) Gaze assisted object recognition
US20140379351A1 (en) Speech detection based upon facial movements
US20120259638A1 (en) Apparatus and method for determining relevance of input speech
US11205426B2 (en) Information processing device, information processing method, and program
US10255690B2 (en) System and method to modify display of augmented reality content
TW201901527A (zh) 視訊會議裝置與視訊會議管理方法
KR102193029B1 (ko) 디스플레이 장치 및 그의 화상 통화 수행 방법
KR102591555B1 (ko) 자동 어시스턴트를 위한 시각적 단서들의 선택적 검출
CN111144266B (zh) 人脸表情的识别方法及装置
TWI734246B (zh) 人臉辨識的方法及裝置
US20220319063A1 (en) Method and apparatus for video conferencing
US11663851B2 (en) Detecting and notifying for potential biases in artificial intelligence applications
WO2020192247A1 (zh) 人机交互方法及系统、介质和计算机系统
CN111010526A (zh) 一种视频通讯中的互动方法及装置
CN113325951B (zh) 基于虚拟角色的操作控制方法、装置、设备以及存储介质
US20210200500A1 (en) Telepresence device action selection
KR20220111574A (ko) 전자 장치 및 그 제어 방법
WO2020102943A1 (zh) 手势识别模型的生成方法、装置、存储介质及电子设备
US20240020872A1 (en) Provision of audio and video streams based onhand distances

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20778136

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.02.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20778136

Country of ref document: EP

Kind code of ref document: A1