CN117116259A - Man-machine interaction method and related device based on gesture information - Google Patents

Man-machine interaction method and related device based on gesture information Download PDF

Info

Publication number
CN117116259A
CN117116259A CN202310460765.2A CN202310460765A CN117116259A CN 117116259 A CN117116259 A CN 117116259A CN 202310460765 A CN202310460765 A CN 202310460765A CN 117116259 A CN117116259 A CN 117116259A
Authority
CN
China
Prior art keywords
user
content
machine
server
man
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310460765.2A
Other languages
Chinese (zh)
Inventor
王一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Renma Interactive Technology Co Ltd
Original Assignee
Shenzhen Renma Interactive Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Renma Interactive Technology Co Ltd filed Critical Shenzhen Renma Interactive Technology Co Ltd
Priority to CN202310460765.2A priority Critical patent/CN117116259A/en
Publication of CN117116259A publication Critical patent/CN117116259A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the application discloses a man-machine interaction method based on gesture information and a related device, wherein the method comprises the following steps: outputting machine output content in a first human-machine conversation scenario node from a server; determining that the input content of the user is not acquired within a preset time period; acquiring gesture information of the user; sending the gesture information of the user to the server; and outputting machine output content in a second human-machine dialogue scenario node in response to the first instruction sent by the server, wherein the second human-machine dialogue scenario node is a human-machine dialogue scenario node associated with the target reply content in a plurality of human-machine dialogue scenario nodes. The embodiment of the application is beneficial to the gesture interaction between the user and the terminal equipment in the scene where voice, text or instruction interaction cannot be used, and can promote the man-machine interaction experience of the user.

Description

Man-machine interaction method and related device based on gesture information
Technical Field
The application is applied to the technical field of general data processing in the Internet industry, relates to processing of voice data, gesture information and the like, and particularly relates to a man-machine interaction method based on gesture information and a related device.
Background
With the popularization of electronic devices and the development of the internet, interactive products such as interactive novels, interactive books, interactive games and the like appear on the market. Users can use these interactive products in a man-machine interaction manner between electronic devices. However, the existing man-machine interaction mode is limited to voice, text or instruction interaction, and when a user is in a special scene (such as a crowded subway or a quiet library), the conventional man-machine interaction mode is difficult to use, which undoubtedly reduces the man-machine interaction experience of the user.
Disclosure of Invention
The embodiment of the application discloses a human-computer interaction method and a related device based on gesture information, which aim to enable a user to carry out gesture interaction with terminal equipment in a scene where voice, text or instruction interaction cannot be used, so that human-computer interaction experience of the user is improved.
In a first aspect, an embodiment of the present application provides a human-computer interaction method based on gesture information, which is applied to a terminal device, and the method includes:
outputting machine output content from a first human-machine dialogue scenario node of a server, wherein the first human-machine dialogue scenario node is one human-machine dialogue scenario node of a plurality of human-machine dialogue scenario nodes in the server;
Determining that input content of a user is not acquired within a preset time period, wherein the input content of the user is reply content of the user aiming at machine output content in the first man-machine conversation scenario node, and the input content of the user comprises one or more of voice information, text information and operation instructions of the user;
acquiring gesture information of the user, wherein the gesture information of the user comprises head actions and/or facial statues of the user;
sending the gesture information of the user to the server;
and outputting machine output content in a second human-machine dialogue scenario node in response to the first instruction sent by the server, wherein the second human-machine dialogue scenario node is a human-machine dialogue scenario node associated with target reply content in the plurality of human-machine dialogue scenario nodes, and the target reply content comprises expected reply content which is determined by the server according to gesture information of the user in a plurality of items of expected reply content in the first human-machine dialogue scenario node and meets user requirements.
In the method, when the terminal equipment does not acquire the input content of the user aiming at the first man-machine dialogue scenario node within the preset time period, the expected reply content meeting the user requirement can be determined through the server according to the gesture information of the user, and then the machine output content in the second man-machine dialogue scenario node is output, so that the user can continuously perform man-machine interaction with the terminal equipment in a scene where voice, text or instruction interaction cannot be used. Therefore, the method can improve the interpersonal interaction experience of the user.
It should be noted that the preset time period is an empirically set time threshold, and may be a smaller time threshold, for example, 15 seconds, 20 seconds, or 30 seconds.
With reference to the first aspect, in a possible implementation manner, the acquiring gesture information of the user includes:
if the type of the machine output content in the first man-machine conversation scenario node is the selection type content, acquiring the head action of the user, wherein the selection type content comprises a plurality of selectable items for the user to select;
and if the type of the machine output content in the first man-machine conversation scenario node is non-selective content, acquiring the facial expression of the user.
It is contemplated that in a scenario where voice, text, or instruction interaction is not available, people often indicate their intent of selection by head action (e.g., nodding or waving) when encountering selected content. Therefore, the terminal equipment in the method collects the head action of the user when the type of the machine output content in the first man-machine conversation scenario node is the selection content, so that the follow-up server can judge the selection intention of the user according to the head action;
considering that when people encounter non-selective content, the answer of people is often a statement sentence rather than a single option, and the intention reflected by each statement sentence may not be the same, people have difficulty in indicating their own statement intention through head movements. The facial look can express different emotion than the head motion, and the facial look is more and different. Thus, when the type of the machine output content in the first man-machine conversation scenario node is non-selective content, the terminal equipment in the method collects the facial gestures of the user so that the server can judge the statement intention of the user according to the facial gestures later.
Therefore, the method can perform gesture interaction with the user based on the reply habit of the user in the actual scene, and further improves the man-machine interaction experience of the user.
With reference to the first aspect, or any one of the foregoing possible implementation manners of the first aspect, in another possible implementation manner, the acquiring the head action of the user if the type of the machine output content in the first ergonomic scenario node is a selected content includes:
responding to a second instruction sent by the server, outputting guide type content and first prompt information, wherein the second instruction is an instruction generated by the server under the condition that the number of selectable items in the selection type content is more than two, the first prompt information is used for prompting the user to respond to the guide type content through head actions, the head actions comprise a nodding action and a head shaking action, the guide type content is determined by the server according to the selection type content, and the guide type content is used for guiding the user to respectively judge whether each selectable item in the selection type content is a selectable item which the user wants to select;
or, outputting second prompt information in response to a third instruction sent by the server, where the third instruction is an instruction generated by the server when the number of selectable items in the selective content is determined to be two, and the second prompt information is used to prompt the user to select one selectable item in the selective content through head action.
Considering that the type of the machine output content in the first man-machine conversation scenario node is the selection type content and more than two options exist, if the terminal equipment directly outputs the selection type content, the user can hardly express the selection intention of the user through the action of single nodding or moving the head. The terminal device of the above method can output the guidance type content converted from the selection type content, for example, the selection type content "do you like A, B or C? "convert to guided content" do you like a? "(does you like B? "and" you like C? ", it is convenient for the user to confirm whether each selectable item can be the selectable item that he wants to select. Therefore, the method can enhance the operability of gesture interaction between the terminal equipment and the user, and further improve the man-machine interaction experience of the user.
With reference to the first aspect, or any one of the foregoing possible implementation manners of the first aspect, in a further possible implementation manner, the acquiring, if a type of machine output content in the first ergonomic scenario node is non-selective content, a facial expression of the user includes:
and responding to a fourth instruction sent by the server, and outputting third prompt information, wherein the fourth instruction is an instruction generated by the server under the condition that the machine output content in the first man-machine conversation scenario node is determined to be non-selective content, and the third prompt information is used for prompting the user to respond to the non-selective content through facial expression.
With reference to the first aspect, or any one of the foregoing possible implementation manners of the first aspect, in a further possible implementation manner, before the outputting, in response to the first instruction sent by the server, machine output content in a second human-machine conversation scenario node, the method further includes:
responding to a fifth instruction sent by the server, outputting fourth prompt information, wherein the fifth instruction is an instruction generated when the server receives the facial expression of the user, the fourth prompt information is used for prompting the user to judge whether first expected reply content is the target reply content or not through head action, the first expected reply content is one expected reply content corresponding to the user's intention information in the first man-machine conversation scenario node, the user's intention information is determined by the server according to the facial expression of the user, and a preset corresponding relation exists between the user's intention information and a plurality of expected reply contents in the first man-machine conversation scenario node;
transmitting the head action of the user to the server;
the first instruction is sent by the server under the condition that the preset condition is met;
The preset conditions include: the head movements of the user are head shaking movements, and other expected reply contents confirmed by the user through the head gestures and sent by the terminal equipment are received, wherein the other expected reply contents are expected reply contents except the first expected reply contents in the first man-machine conversation scenario node, and the other expected reply contents are the target reply contents;
or the preset condition includes: the head action of the user is a nodding action, and the first expected reply content is the target reply content.
The terminal equipment determines the intention information of the user through the server according to the mental state information of the user, further determines the first expected reply content, prompts the user to judge whether the first expected reply content meets the requirement of the user through the head action, namely judges whether the first expected content is expressed by the user. If the user clicks, the terminal equipment can output the output content in the man-machine conversation scenario node associated with the first expected reply content; if the user shakes the head, the terminal device can output the expected reply content except the first expected reply content, and the expected reply content meeting the user requirement is determined again through the head action of the user, so that the output content in the associated man-machine conversation scenario node is output.
Therefore, the method can enhance the accuracy of the gesture interaction between the terminal equipment and the user, and further improve the man-machine interaction experience of the user.
With reference to the first aspect, or any one of the foregoing possible implementation manners of the first aspect, in a further possible implementation manner, the target reply content further includes an expected reply content closest to the input content of the user in the first ergonomic scenario node; before the determining that the content input by the user is not acquired within the preset time period, the method further comprises:
determining that the input content of the user is acquired within the preset time period;
and sending the input content of the user to the server.
Aiming at the scene that a user can interact by using voice, words or instructions, in the method, the terminal equipment takes one expected reply content closest to the input content of the user as target reply content through the server, and then outputs output content in the man-machine conversation scenario node associated with the target reply content.
That is, the terminal device in the method can enable the user to perform man-machine interaction with the terminal device in different scenes.
With reference to the first aspect, or any one of the foregoing possible implementation manners of the first aspect, in a further possible implementation manner, before the acquiring gesture information of the user, the method further includes:
and responding to a seventh instruction sent by the server, outputting sixth prompt information, wherein the sixth prompt information is used for prompting the user to determine whether human-computer interaction is performed with the terminal equipment through the gesture information of the user through head action.
Considering that the situation that the user does not input voice, text or instructions to the terminal device within the preset time period is not necessarily caused by the limitation of special scenes, for example, the situation that the user needs to temporarily interrupt human-computer interaction and the user forgets to close the display interface of the human-computer interaction may be caused by other factors, the terminal interface in the method can send a prompt message to the user before acquiring the gesture information of the user, so that the user confirms the interaction mode of the subsequent reading novels, and the reading experience of the user reading the interactive novels is further improved.
In a second aspect, an embodiment of the present application provides a human-computer interaction method based on gesture information, which is applied to a server, and the method includes:
Determining that input content of a user sent by terminal equipment is not acquired within a preset time period, wherein the input content of the user is reply content of the user aiming at machine output content in first man-machine conversation scenario nodes, the first man-machine conversation scenario nodes are one man-machine conversation scenario nodes in a plurality of man-machine conversation scenario nodes in the server, and the user input content comprises one or more of voice messages, text messages and operation instructions of the user;
acquiring gesture information of the user sent by the terminal equipment, wherein the gesture information of the user comprises head actions and/or facial statues of the user;
determining a target reply content meeting the user requirement in a plurality of expected reply contents in the first man-machine conversation scenario node according to the gesture information of the user;
determining a second human-machine dialogue scenario node as a human-machine dialogue scenario node after the first human-machine dialogue scenario node, wherein the second human-machine dialogue scenario node is a human-machine dialogue scenario node associated with the target reply content in the plurality of human-machine dialogue scenario nodes;
And sending a first instruction to the terminal equipment, wherein the first instruction is used for instructing the terminal equipment to output the machine output content in the second man-machine conversation scenario node.
With reference to the second aspect, in a possible implementation manner, the acquiring gesture information of the user sent by the terminal device includes:
determining the type of machine output content in the first man-machine conversation scenario node, wherein the type comprises selective content and non-selective content, and the selective content comprises a plurality of selectable items for a user to select;
if the machine output content in the first man-machine conversation scenario node is the selection type content, acquiring the head action of the user sent by the terminal equipment;
and if the output content of the middle machine of the first man-machine conversation scenario node is the non-selection content, acquiring the facial expression of the user sent by the terminal equipment.
With reference to the second aspect, or any one of the foregoing possible implementation manners of the second aspect, in another possible implementation manner, the acquiring, if machine output content in the first ergonomic scenario node is the selected content, a head action of the user sent by the terminal device includes:
If the number of selectable items in the selective content is greater than two, converting the selective content into guide type content, wherein the guide type content is used for guiding the user to respectively judge whether each selectable item in the selective content is a selectable item which the user wants to select;
sending a second instruction to the terminal equipment, wherein the second instruction is used for instructing the terminal equipment to output the guiding content and first prompting information, and the first prompting information is used for prompting the user to respond to the guiding content through head action;
receiving the head action of the user sent by the terminal equipment;
if the number of selectable items in the selective content is two, a third instruction is sent to the terminal equipment, wherein the third instruction is used for instructing the terminal equipment to output second prompting information, the second prompting information is used for prompting the user to select one selectable item in the selective content through head action, and the head action comprises a nodding action and a waving action;
and receiving the head action of the user sent by the terminal equipment.
With reference to the second aspect, or any one of the foregoing possible implementation manners of the second aspect, in a further possible implementation manner, the acquiring, if machine output content in the first ergonomic serving node is the non-selection content, a facial expression of the user sent by the terminal device includes:
If the machine output content in the first man-machine conversation scenario node is non-selective content, sending a fourth instruction to the terminal equipment, wherein the fourth instruction is used for instructing the terminal equipment to output third prompt information, and the third prompt information is used for prompting the user to respond to the non-selective content through facial expression;
and receiving the facial expression of the user sent by the terminal equipment.
With reference to the second aspect, or any one of the possible implementation manners of the second aspect, in a further possible implementation manner, the determining, according to gesture information of the user, a target reply content that meets a requirement of the user from a plurality of expected reply contents in the first ergonomic scenario node includes:
if the machine output content in the first man-machine conversation scenario node is the non-selection content, determining the intention information of the user according to the facial statue of the user, wherein the intention information of the user and a plurality of expected reply contents in the first man-machine conversation scenario node have preset corresponding relations;
a fifth instruction is sent to the terminal equipment, wherein the fifth instruction is used for indicating fourth prompt information of the terminal equipment, the fourth prompt information is used for prompting the user to judge whether first expected reply content is the target reply content or not through head action, and the first expected reply content is one expected reply content corresponding to the intention information of the user in the first man-machine interaction scenario node;
Acquiring the head action of the user sent by the terminal equipment;
if the head action of the user is a click action, determining that the first expected reply content is the target reply content;
and if the head motion of the user is a head shaking motion, sending a sixth instruction to the terminal equipment, wherein the sixth instruction is used for indicating the terminal equipment to output fifth prompt information, the fifth prompt information is used for prompting the user to judge whether other expected reply contents are the target reply contents or not through the head motion, and the other expected reply contents are the expected reply contents except the first expected reply contents in the first man-machine conversation scenario node.
With reference to the second aspect, or any one of the foregoing possible implementation manners of the second aspect, in a further possible implementation manner, before the determining that the user input content sent by the terminal device is not acquired within a preset period of time, the method further includes:
determining that the user input content sent by the terminal equipment is acquired within the preset time period;
and determining one expected reply content closest to the user input content in the first man-machine conversation scenario node as the target reply content.
With reference to the second aspect, or any one of the foregoing possible implementation manners of the second aspect, in a further possible implementation manner, before the acquiring gesture information of the user sent by the terminal device, the method includes:
and sending a seventh instruction to the terminal equipment, wherein the seventh instruction is used for instructing the terminal equipment to output sixth prompt information, and the sixth prompt information is used for prompting the user to determine whether human-computer interaction is performed with the terminal equipment through the gesture information of the user through head action.
In a third aspect, an embodiment of the present application provides a human-machine interaction device, where the device includes:
the first output unit is used for outputting machine output content in a first man-machine dialogue scenario node from a server, wherein the first man-machine dialogue scenario node is one man-machine dialogue scenario node in a plurality of man-machine dialogue scenario nodes in the server;
the determining unit is used for determining that the input content of the user is not acquired within a preset time period, wherein the input content of the user is reply content of the user aiming at the machine output content in the first man-machine conversation scenario node, and the input content of the user comprises one or more of voice information, text information and operation instructions of the user;
An obtaining unit, configured to obtain pose information of the user, where the pose information of the user includes a head motion and/or a facial expression of the user;
a sending unit, configured to send gesture information of the user to the server;
the second output unit is used for responding to the first instruction sent by the server and outputting machine output content in a second man-machine dialogue scenario node, wherein the second man-machine dialogue scenario node is a man-machine dialogue scenario node which is associated with target reply content in the man-machine dialogue scenario nodes, and the target reply content comprises expected reply content which is determined by the server according to gesture information of the user in multiple expected reply contents in the first man-machine dialogue scenario node and meets the requirement of the user.
With reference to the third aspect, in one possible implementation manner, in the acquiring gesture information of the user, the acquiring unit is specifically configured to:
if the type of the machine output content in the first man-machine conversation scenario node is the selection type content, acquiring the head action of the user, wherein the selection type content comprises a plurality of selectable items for the user to select;
And if the type of the machine output content in the first man-machine conversation scenario node is non-selective content, acquiring the facial expression of the user.
With reference to the third aspect, or any one of the foregoing possible implementation manners of the third aspect, in another possible implementation manner, in the acquiring, if a type of machine output content in the first ergonomic scenario node is a selected content, a head action aspect of the user, the acquiring unit is specifically configured to:
responding to a second instruction sent by the server, outputting guide type content and first prompt information, wherein the second instruction is an instruction generated by the server under the condition that the number of selectable items in the selection type content is more than two, the first prompt information is used for prompting the user to respond to the guide type content through head actions, the head actions comprise a nodding action and a head shaking action, the guide type content is determined by the server according to the selection type content, and the guide type content is used for guiding the user to respectively judge whether each selectable item in the selection type content is a selectable item which the user wants to select;
Or, outputting second prompt information in response to a third instruction sent by the server, where the third instruction is an instruction generated by the server when the number of selectable items in the selective content is determined to be two, and the second prompt information is used to prompt the user to select one selectable item in the selective content through head action.
With reference to the third aspect, or any one of the foregoing possible implementation manners of the third aspect, in a further possible implementation manner, in the obtaining, if a type of machine output content in the first ergonomic scenario node is non-selective content, a face look aspect of the user, the obtaining unit is specifically configured to:
and responding to a fourth instruction sent by the server, and outputting third prompt information, wherein the fourth instruction is an instruction generated by the server under the condition that the machine output content in the first man-machine conversation scenario node is determined to be non-selective content, and the third prompt information is used for prompting the user to respond to the non-selective content through facial expression.
With reference to the third aspect, or any one of the foregoing possible implementation manners of the third aspect, in a further possible implementation manner, before the outputting, by the machine, of the content in the second human-machine conversation scenario node in response to the first instruction sent by the server, the apparatus further includes a third output unit. The third output unit is configured to output fourth prompting information in response to a fifth instruction sent by the server, where the fifth instruction is an instruction generated when the server receives a facial expression of the user, and the fourth prompting information is configured to prompt the user to determine, through a head action, whether first expected reply content is the target reply content, where the first expected reply content is one item of expected reply content corresponding to intention information of the user in the first ergonomic dialogue scenario node, and the intention information of the user is determined by the server according to the facial expression of the user, where a preset correspondence exists between the intention information of the user and multiple items of expected reply content in the first ergonomic dialogue scenario node; the sending unit is specifically configured to send the head action of the user to the server;
The first instruction is sent by the server under the condition that the preset condition is met;
the preset conditions include: the head movements of the user are head shaking movements, and other expected reply contents confirmed by the user through the head gestures and sent by the terminal equipment are received, wherein the other expected reply contents are expected reply contents except the first expected reply contents in the first man-machine conversation scenario node, and the other expected reply contents are the target reply contents;
or the preset condition includes: the head action of the user is a nodding action, and the first expected reply content is the target reply content.
With reference to the third aspect, or any one of the foregoing possible implementation manners of the third aspect, in a further possible implementation manner, the target reply content further includes an expected reply content closest to the input content of the user in the first ergonomic scenario node: the determining unit is further configured to determine that the input content of the user is acquired within a preset period of time before the determination that the content input by the user is not acquired within the preset period of time, and the transmitting unit is further configured to transmit the input content of the user to the server.
With reference to the third aspect, or any one of the foregoing possible implementation manners of the third aspect, in a further possible implementation manner, before the acquiring gesture information of the user, the apparatus further includes:
and the fourth output unit is used for responding to a seventh instruction sent by the server and outputting sixth prompt information, wherein the sixth prompt information is used for prompting the user to determine whether human-computer interaction is performed with the terminal equipment through the gesture information of the user through head action.
In a fourth aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor, a memory, and a communication interface, where the communication interface is configured to perform a receiving and/or transmitting operation under control of the processor, and the memory is configured to store a computer program, and the processor is configured to invoke the computer program to implement a method described in the first aspect or any possible implementation manner of the first aspect.
In a fifth aspect, an embodiment of the present application provides a server, where the server includes a processor, a memory, and a communication interface, where the communication interface is configured to perform a receiving and/or transmitting operation under control of the processor, and the memory is configured to store a computer program, and the processor is configured to invoke the computer program to implement the method described in the second aspect or any possible implementation manner of the second aspect.
In a sixth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein for implementing the method described in the first aspect or any one of the possible implementations of the first aspect when it is run on a processor.
In a seventh aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein for implementing the method of the second aspect or any one of the possible embodiments of the second aspect when it is run on a processor.
The advantages of the related methods or apparatuses provided in the second to seventh aspects of the present application may refer to the advantages of the technical solutions of the first aspect, and are not described herein.
Drawings
The drawings that are used in the description of the embodiments of the present application will be briefly described as follows.
FIG. 1 is a schematic diagram of a human-computer interaction system according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a human-computer interaction method based on gesture information according to an embodiment of the present application;
FIG. 3A is a schematic flow chart of a human-machine dialogue scenario provided by an embodiment of the present application;
FIG. 3B is a schematic view of an operation interface according to an embodiment of the present application;
FIG. 4A is a schematic view of a scenario of yet another operation interface provided by an embodiment of the present application;
FIG. 4B is a schematic view of a scenario of yet another operation interface provided by an embodiment of the present application;
FIG. 5 is a schematic view of a scenario of yet another operation interface provided by an embodiment of the present application;
fig. 6 is a schematic structural diagram of a man-machine interaction device 60 according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a terminal device 70 according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a schematic diagram of a human-computer interaction system according to an embodiment of the application. The architecture includes a terminal device 101 and a server 102. The number of terminal devices is not strictly limited in the present application, and the embodiment shown in fig. 1 is only an example.
The terminal device 101 is a device having both data processing and data transceiving capabilities, and is configured to present an operation interface to a user, so that the user can use an interactive product through the operation interface. The interactive products include products of multiple scenario branches corresponding to one or more outcomes. When the user uses the terminal device 101 to use the interactive product, the user can place the interactive product in the virtual world and the virtual character background in a man-machine interaction mode, and can autonomously determine the trend of certain scenarios. The man-machine interaction mode can be voice interaction, text interaction, instruction interaction and gesture interaction. Optionally, the interactive product may be an interactive novel or an interactive game, or may be an interactive drawing.
Server 102 is a device having computing power and storage space. Optionally, the storage space may store one or more human-machine dialogue scripts corresponding to the interactive product. The human-computer dialogue scenario comprises a plurality of human-computer dialogue scenario nodes, each human-computer dialogue scenario node in the plurality of human-computer dialogue scenario nodes comprises machine output content and a plurality of expected reply content, each item of expected reply content is associated with one human-computer dialogue scenario node, the machine output content comprises machine dialogue voice and/or machine dialogue text, and the expected reply content comprises a preset reply to the machine output content.
In the process of man-machine interaction, the terminal device 101 may receive the machine output content in the first scenario node sent by the server 102, and display the machine output content to the user through the operation interface. The terminal device 101 sends the acquired input content (one or more of a voice message, a text message and an operation instruction) or gesture information of the user to the server 102, and the server 102 may determine, according to the input content or gesture information of the user, an expected reply content of the output content of the machine, which meets the user requirement, so as to determine a second man-machine conversation scenario node. Further, the terminal device 101 may output the machine output content in the second human-machine conversation scenario node under the direction of the server 102.
In an alternative embodiment, the terminal device 101 performs voice interaction, text interaction or instruction interaction with the user by default, and may continue to perform gesture interaction with the user when the terminal device 101 does not acquire the input content (such as one or more of a voice message, a text message and an operation instruction) of the user within a preset period of time.
In an alternative embodiment, the user may set itself to change the way in which the user interacts with the terminal device 101.
Alternatively, the terminal device 101 may be a stand-alone device such as a handheld terminal, a desktop terminal, or a wearable device, or may be a component (e.g., a chip or an integrated circuit) included in the stand-alone device, and when the terminal device 101 is a handheld terminal, it may be a mobile phone, a tablet computer, a computer (e.g., a notebook computer, a palm computer, etc.), or the like.
Alternatively, the server 102 may be a physical device such as a server or a host, or may be a virtual device such as a virtual machine or a container. Alternatively, the server 102 may be a cloud, such as a server cluster formed by a single service or multiple servers in the cloud, or may be a local device, such as a server cluster formed by a single service or multiple servers in the local device.
Alternatively, the terminal device 101 and the server 102 may be directly connected through wired communication, such as a tangible medium such as a metal wire or an optical fiber, or may be indirectly connected through wireless communication, such as an intangible medium such as a wireless lan or bluetooth.
Alternatively, the terminal device 101 may perform some or all of the operations performed by the server 102 described above. That is, some or all of the operations performed by the server 102 may be performed by the terminal device 101 instead thereof.
In the architecture shown in fig. 1, the terminal device can facilitate the user to read the interactive product in a man-machine interaction manner, so that the interest and immersion of the user in experiencing the interactive product are enhanced.
Referring to fig. 2, fig. 2 is a flow chart of a human-computer interaction method based on gesture information according to an embodiment of the present application, and the method may be implemented based on the architecture shown in fig. 1, and includes, but is not limited to, the following steps:
step S201: the terminal device outputs machine output content in the first ergonomic scenario node from the server.
The terminal device may be the terminal device 101 in the embodiment described in fig. 1, or may be another device that can perform the same operation as the terminal device. The machine output content in the first ergonomic scenario node is sent by the server to the terminal device. The server may be the server 102 in the embodiment depicted in fig. 1, or may be another device capable of performing the same operations as the server.
In an alternative embodiment, the server may store a human-machine conversation scenario corresponding to the interactive product, the human-machine conversation scenario including a plurality of human-machine conversation scenario nodes, each of the plurality of human-machine conversation scenario nodes including machine output content and a plurality of expected reply content. The machine output content includes machine dialogue speech and/or machine dialogue text, and the expected reply content includes a preset reply to the machine output content. It should be noted that the man-machine conversation scenario is a scenario generated based on a scenario of an interactive product, and each man-machine conversation scenario node in the man-machine conversation scenario is at least associated with one other man-machine conversation scenario node. Different interactive products correspond to different human-machine conversation scripts.
For easy understanding, please refer to fig. 3A, and fig. 3A is a schematic flow chart of a human-machine dialogue scenario according to an embodiment of the present application. The man-machine conversation scenario comprises a man-machine conversation scenario node 301-a man-machine conversation scenario node 306, and the front-back sequence of the man-machine conversation scenario node is determined by the scenario of the interactive product. Each item of expected reply content in each human machine conversation scenario node may be considered a jump condition to jump from the current human machine conversation scenario node to the next human machine conversation scenario node. For easy understanding, please refer to table 1, table 1 is a list of human-computer dialogue scripts provided in the embodiment of the present application.
Table 1 man-machine conversation scenario list
Specifically, in the process of man-machine interaction, the terminal device may receive machine output content in a first man-machine dialogue scenario node sent by the server, where the first man-machine dialogue scenario node is one of a plurality of man-machine dialogue scenario nodes. For example, the first human-machine conversation scenario node may be one of human-machine conversation scenario nodes 301-306 in fig. 3A.
The terminal device may be built in with control software, which may be a software module consisting of program code, such as an Application (APP), for presenting an operation interface to a user and for information interaction with the server via an application programming interface (application programming interface, API). The terminal device may default to preferentially perform voice interactions, text interactions, or instruction interactions with the user.
It is convenient to understand that, in the following, the first human-machine dialogue scenario node is described by taking the human-machine dialogue scenario node 301 as an example, and the human-machine dialogue scenario node 301 is one human-machine dialogue scenario node in the human-machine dialogue scenario corresponding to the "XXX interactive product". Referring to fig. 3B, fig. 3B is a schematic view of an operation interface according to an embodiment of the present application. The terminal device may output the machine output content 307 (i.e. "you are a food family, you have two stores down in the house, respectively a hot pot store and a bakery store, you will prefer to go to which store to try to eat. For machine output content 307, the user may input a voice message 317 (voice content is "me bakery") to the terminal device by clicking on voice icon 316 in return. Accordingly, the operator interface 315 may display the duration of the voice message 317.
Optionally, the user may also enter a text message 319 (i.e. "i select bakery") into the terminal device via text entry field 318 in return.
Alternatively, the user may also input an operation instruction to the terminal device by selecting one of two selectable items of the hover frame 320 displayed by the terminal device as a reply. The two alternatives may be the expected reply content 308 (i.e., "hot pot") and the expected reply content 309 (i.e., "bakery") corresponding to the machine output content 307, respectively.
Optionally, the terminal device may play the machine output content 307 in a voice manner, or may display the machine output content 307 in a text manner and append a voice broadcast. The user can customize the manner in which the terminal device outputs the machine output content 307.
Step S202: the terminal equipment determines that the input content of the user is not acquired within a preset time period.
Aiming at special scenes (such as a crowded subway or a quiet library) that the user is difficult to interact with voice, text or instructions, the terminal device can interact with the user in a gesture manner, so that the user can continue to experience man-machine interaction.
Specifically, the terminal device executes step S203 after determining that the input content of the user is not acquired within the preset period of time, where the input content of the user is a reply content for the machine output content in the first ergonomic scenario node, and the input content of the user may be one or more of a voice message, a text message, and an operation instruction. The preset time period is an empirically set time threshold and may be a small time threshold, such as 15 seconds, 20 seconds, or 30 seconds.
In an alternative embodiment, for a scenario that the user may interact with using voice, text, or instructions, after the terminal device obtains the input content of the user within a preset period of time, the terminal device sends the input content of the user to the server, and the server may match the input content of the user with expected reply content in the first man-machine conversation scenario node, and determine a man-machine conversation scenario node associated with an expected reply content that is closest to the semantic meaning of the input content of the user as the next man-machine conversation scenario node. Further, the server may instruct the terminal device to output machine output content in a next personal computer conversation scenario node.
It should be noted that, the manner of matching between the input content of the user and the expected reply content may be semantic matching. By way of example, in connection with fig. 3A, 3B and table 1, the terminal device sends a voice message 317 sent by the user to a server that converts the voice message 317 to the word "i select bakery" using voice recognition technology. And the server may semantically match the word "i choose bakery" with the expected reply content 308 (i.e., "hot pot") and the expected reply content 309 (i.e., "bakery") corresponding to the machine output content 307, respectively, using a semantic recognition algorithm. The server determines the human machine conversation scenario node 303 associated with the expected reply content 309 (i.e., "bakery") that is the closest semantic to the word "i select bakery" as the next human machine conversation scenario node. Further, the terminal device may output the machine output content 312 in the man-machine conversation scenario node 303 under the direction of the server.
Step S203: the terminal equipment acquires gesture information of a user.
After determining that the input content of the user is not acquired within the preset time period (i.e., the terminal device does not send the input content of the user to the server within the preset time period), the terminal device may output a prompt message under an instruction of the server, where the prompt message is used to prompt the user to perform gesture interaction with the terminal device, so as to acquire gesture information of the user. The gesture information of the user includes head movements and/or facial statues of the user. The gesture information of the user is reply content of the user aiming at the machine output content in the first man-machine dialogue scenario node.
The terminal device is provided with an image pickup unit that can collect posture information of a user during the use of the terminal device by the user.
Case one: the type of machine output content in the first man-machine conversation scenario node is selective content, and the terminal equipment can acquire the head action of the user. The selectable content includes a plurality of selectable items for selection by a user.
In an alternative embodiment, if the server does not receive the input content of the user sent by the terminal device and determines that the number of selectable items in the selectable content is greater than two, a second instruction may be sent to the terminal device. In response to the second instruction, the terminal device may output the guidance type content and the first hint information.
It should be noted that, the first prompting information is used for prompting the user to respond to the guiding type content through a head action, the head action includes a nodding action and a panning action, the guiding type content is determined by the server according to the selecting type content, and the guiding type content is used for guiding the user to respectively judge whether each selectable item in the selecting type content is a selectable item which the user wants to select. Alternatively, the selectable item of the selection type content may be its corresponding expected reply content.
For easy understanding, please refer to fig. 4A, fig. 4A is a schematic view of still another operation interface according to an embodiment of the present application. The operation interface 401 of the terminal device displays the machine output content 402 in the first human dialogue scenario node (i.e. "please you select what you want to be in from among sisal, dart, businessman and quick catch"). The server does not acquire the input content of the user replying to the machine output content 402 through the terminal device within a preset period of time. Since the server determines that the machine output content 402 is a selection type content and the number of selectable items (i.e., "sisal," dart, "" businessman, "and" quick "), the server converts the machine output content 402 into a guidance content including question 403 (i.e.," do you want to be a sisal. After that, the server sends a second instruction to the terminal device. The terminal device may output the first prompt message 404 under the instruction of the second instruction, that is, "please reply to the subsequent guidance content with the head action facing the operation interface, the nod represents affirmative, and the shake represents negative. And the terminal device may further sequentially output a plurality of questions about selectable items in the guidance type content under the direction of the second instruction for guiding the user to determine whether each selectable item is a selectable item that the user wants to select.
Alternatively, for the case where there are more selectable items in the selection type content, the selectable items may be first grouped. The guiding content may be used to guide the user to determine the target group in which the user wants to select the selectable items, and then guide the user to determine whether each selectable item in the target group is the selectable item that the user wants to select.
For example, the machine output content in the first human-machine dialogue scenario node is "select one of you favorite fruit from apple, pear, watermelon, grape, orange. Then, the terminal device may output the guidance content first "do you like the fruit one of apple, pear, and watermelon? ". If the user clicks, the terminal device may sequentially output the guidance content "do you like fruit is apple? "," do you like the fruit is pear? "and" do you like the fruit is watermelon? ", until the user makes a nodding action; if the user clicks, the terminal device outputs the guidance content "do you like fruit one of grape and orange? ".
That is, the embodiment of the application can save the time for the user to judge whether each selectable item is the selectable item which the user wants to select, and further enhance the reading experience when the user and the terminal equipment perform gesture interaction.
In an alternative embodiment, if the server does not receive the input content of the user sent by the terminal device and determines that the number of selectable items in the selectable content is two, a third instruction may be sent to the terminal device. And responding to the third instruction, the terminal equipment outputs second prompt information, wherein the second prompt information is used for prompting the user to select one selectable item in the selected content through head action.
For example, if the machine output content in the first ergonomic node is "do you like the sea now? ". The server does not acquire the input content of the machine output content replied by the user in a preset time period through the terminal equipment. Since the server determines that the machine output content is a selective type content and the number of selectable items (i.e., "like" and "dislike") is 2, the server transmits a third instruction to the terminal device. The terminal device can output a second prompt message under the instruction of the third instruction, namely, please select the selectable item in the front output content by head action facing the operation interface, the nodding represents affirmative, and the shaking represents negative. ".
Illustratively, the machine output content in the first ergonomic node is "you like cloudy or sunny days". The server does not acquire the input content of the machine output content replied by the user in a preset time period through the terminal equipment. Since the server determines that the machine output content is the selective content and the number of selectable items (i.e., "cloudy days" and "sunny days") is 2, the server transmits a third instruction to the terminal device. The terminal device can output a second prompt message 'please select the selectable items in the output content by head action facing the operation interface' under the instruction of the third instruction, wherein the nod represents selecting a cloudy day, and the shake represents selecting a sunny day. ".
It should be noted that, in a scenario where voice, text, or instruction interaction cannot be used, people often indicate their own selection intent by head movements (e.g., nodding or waving) when encountering selection-type content. Therefore, the terminal equipment in the embodiment of the application collects the head action of the user when the type of the machine output content in the first man-machine conversation scenario node is the selection type content, so that the selection intention of the user can be judged according to the head action through the server.
And a second case: the type of machine output content in the first man-machine conversation scenario node is non-selective content, and the terminal equipment can acquire the facial expression of the user. The non-selection type content may be understood as a presentation type content, and no selectable item is directly provided for selection by the user. The facial look of the user may include the facial expression and/or facial motion of the user.
Specifically, if the server does not receive the input content of the user sent by the terminal device and determines that the type of the machine output content in the first ergonomic scenario node is non-selective content, a fourth instruction may be sent to the terminal device. And responding to the fourth instruction, and outputting the third prompt information by the terminal equipment. The third prompting information is used for prompting the user to respond to the non-selective content through the facial expression.
For easy understanding, please refer to fig. 4B, fig. 4B is a schematic view of still another operation interface according to an embodiment of the present application. The operating interface 405 of the terminal device displays the machine output content 406 in the first human dialogue scenario node (i.e. "you have contracted friend child Li Xiawu to watch a movie together three times, facing xiao Li that has been half an hour later you will say she"). The server does not acquire the input content of the user replying to the machine output content 406 through the terminal device within a preset period of time. Since the server determines that the machine output content 406 is non-selective content, the server sends a fourth instruction to the terminal device. The terminal device may output the third prompt 407 under the instruction of the fourth instruction, that is, "please you face the operation interface and return the output content through the facial statue (such as smiling, frowning, etc.). ".
It should be noted that, considering that when people encounter non-selective content, the answer of people is often a statement sentence rather than a single option, and the intention reflected by each statement sentence may not be the same, it is difficult for people to indicate their own statement intention through head movements. The facial look can express different emotion than the head motion, and the facial look is more and different. Therefore, when the type of the machine output content in the first man-machine conversation scenario node is non-selective content, the terminal equipment in the embodiment of the application collects the facial gestures of the user so that the server can judge the statement intention of the user according to the facial gestures later.
Therefore, the embodiment of the application can perform gesture interaction with the user based on the reply habit of the user in the actual scene, and further improve the man-machine interaction experience of the user.
In an alternative embodiment, the terminal device may further receive a seventh instruction sent by the server between step S202 and step S203, i.e. before the terminal device obtains the gesture information of the user. And responding to the seventh instruction, and outputting sixth prompt information by the terminal equipment. The sixth prompting information is used for prompting the user to determine whether human-computer interaction is performed with the terminal equipment through the gesture information of the user through head actions.
For example, the sixth prompting message may be "detect that no input content of the user is received within 15 seconds, ask the user to face the operation interface to determine to continue reading in a gesture interaction manner through head action, and nod for affirmative and shake for negative. "
That is, it is considered that the user does not input voice, text or instructions to the terminal device within the preset period of time is not necessarily limited by a special scenario, for example, may be caused by other factors that the user needs to temporarily interrupt reading the man-machine interaction and the user forgets to turn off the man-machine interaction display interface. The terminal interface in the embodiment of the application can send the prompt message to the user before acquiring the gesture information of the user, so that the user confirms the interaction mode of the follow-up reading novels, and the man-machine interaction experience of the user is further improved.
In an alternative embodiment, the terminal device may further receive an eighth instruction sent by the server between step S202 and step S203, i.e. before the terminal device obtains the gesture information of the user. In response to the eighth instruction, the terminal device may output seventh hint information. The seventh prompt message is used for providing personalized reply content for the user and prompting the user to determine whether the personalized reply content meets the user requirement through the head action. It should be noted that, the personalized reply content is reply content meeting user preferences, and may be reply content generated by the server according to the stored historical reply record of the user and aimed at machine output content in the first man-machine conversation scenario node.
Step S204: and the terminal equipment sends the gesture information of the user to the server.
Specifically, after receiving the gesture information of the user sent by the terminal device, the server may determine, according to the gesture information of the user, a target reply content that meets a user requirement from multiple pieces of expected reply contents in the first ergonomic dialogue scenario node. The server then determines a second human-machine conversation scenario node as a human-machine conversation scenario node subsequent to the first human-machine conversation scenario node. The second human-machine conversation scenario node is a human-machine conversation scenario node associated with the target reply content in a plurality of human-machine conversation scenario nodes in the human-machine conversation scenario.
It is convenient to understand how the server determines the targeted reply content is described below.
In an alternative embodiment, the type of machine output content in the first ergonomic serving node is selective content. As can be seen from the case in step S203, the server may acquire the header action sent by the terminal device. Further, the server takes the selectable item corresponding to the nodding action as a target selectable item, and performs semantic matching on the target selectable item and the expected reply content in the first man-machine conversation scenario node. Finally, the server may use an expected reply content closest to the semantics of the target selectable item as the target reply content, and further determine the man-machine conversation scenario node associated with the target reply content as a second man-machine conversation scenario node, that is, a man-machine conversation scenario node after the first man-machine conversation scenario node.
It should be noted that the plurality of selectable items in the selectable content may be the expected reply content in the first ergonomic scenario node, respectively.
In an alternative embodiment, the type of machine output content in the first ergonomic serving node is non-selective content. In combination with the second case in step S203, the server may acquire the face expression transmitted by the terminal device. It should be noted that, the facial mind of the user has a preset first corresponding relationship with the intention information of the user. And a plurality of expected reply messages in each man-machine conversation scenario node have a preset second corresponding relation with the intention information of the user.
First, the server may determine intention information of the user according to the facial look of the user. For easy understanding, please refer to table 2, and table 2 is a list of correspondence between facial look and intent information provided by the embodiment of the present application.
Table 2 list of correspondence between face look and intent information
Facial statue Intention information
Smile, yang Mei, … Forward direction
Eyebrow and mouth wrinkling and … Negative going
Eye-closing, mouth-opening and … Neutral position
As can be seen from table 2, the intention information corresponding to the facial look of the user can be classified into three categories, namely "positive", "negative", and "neutral". Each type of intent information may correspond to one or more facial look.
Further, the server may further determine, according to the intention information of the user, an expected reply content corresponding to the intention information of the user in the first ergonomic scenario node, and use the expected reply content as the target reply content. For ease of understanding, please refer to table 3 for the following illustration, table 3 is a list of correspondence between intent information and expected reply content provided by an embodiment of the present application.
Table 3 list of correspondence between intention information and expected reply content
Illustratively, the machine output content in the first ergonomic node is "you have about friends small Li Xiawu to watch a movie together, facing xiao Li that has been half an hour later, you will say her: ". If the server determines that the facial look of the user is smiling or Yang Mei, the intention information of the user is forward and the target reply content is irrelevant and the user can not watch a movie from table 2 and table 3. If the server determines that the user's facial appearance is "frown" or "pucker", the user's intention information is "negative" from table 2, and then the target reply content is "why it is late" from table 3. If the server determines that the facial look of the user is "eye-closed" or "mouth-open", the intention information of the user can be determined to be "neutral" by table 2, and then the target reply content can be determined to be "re-buying ticket bar" by table 3. And then, the server determines the man-machine conversation scenario node associated with the target reply content as a second man-machine conversation scenario node, namely a man-machine conversation scenario node after the first man-machine conversation scenario node.
Optionally, before determining the target reply content, the server may further use an expected reply content corresponding to the facial look of the user as the first expected reply content, and send a fifth instruction to the terminal device. After receiving the fifth instruction, the terminal device may output fourth prompting information, where the fourth prompting information is used to prompt the user to judge whether the first expected reply content is the target reply content through the head action.
The terminal device may then send the head action of the user to the server. If the head action of the user is a nod action, the server takes the first expected reply content as target reply content; if the head motion of the user is a shaking motion, the server may send a sixth instruction to the terminal device. After receiving the sixth instruction, the terminal device may output a fifth prompting message, where the fifth prompting message is used to prompt the user to judge whether other expected reply contents are target reply contents through head actions. It should be noted that the other expected reply content is expected reply content except the first expected reply content in the first ergonomic scenario node.
For easy understanding, please refer to fig. 5, fig. 5 is a schematic view of still another operation interface according to an embodiment of the present application. The first expected reply content is "why it is late", the other expected reply contents are "don't care, and can not watch a movie", and the terminal device can display fourth prompt information 502 in the operation interface 501, namely, "determine you want to say according to your face mind: why late please you face the operating interface to determine if the reply is correct with the head action? ". The user makes a panning motion with respect to the fourth prompt 502, and the terminal device may display a fifth prompt 503 in the operation interface 501, namely, "please ask you to say with the head motion for the operation interface: nothing can see the movie? ".
Therefore, the embodiment of the application can enhance the accuracy of the gesture interaction between the terminal equipment and the user, and further improve the man-machine interaction experience of the user.
Step S205: and the terminal equipment responds to the first instruction sent by the server and outputs the machine output content in the second man-machine conversation scenario node.
Specifically, after determining, according to gesture information of the user, a second human-machine dialogue scenario node associated with the target reply content, the server may send a first instruction to the terminal device. And the terminal equipment can output the machine output content in the second man-machine conversation scenario node after receiving the first instruction.
Alternatively, the terminal device in the embodiment described in fig. 2 may perform some or all of the operations performed by the server instead of the server.
In summary, the embodiment of the application is convenient for the user to continuously interact with the terminal device in the gesture in the scene of being incapable of using voice, text or instruction interaction so as to read the interaction novel, and can improve the reading experience of the user when reading the interaction novel.
The foregoing details of the method of embodiments of the present application are provided for the purpose of better implementing the foregoing aspects of embodiments of the present application, and accordingly, the following provides an apparatus of embodiments of the present application.
It can be understood that, in order to implement the functions in the above method embodiments, the apparatus provided in the embodiments of the present application, for example, a man-machine interaction apparatus, includes a hardware structure, a software module, or a combination of a hardware structure and a software structure, which perform the respective functions.
Those of skill in the art will readily appreciate that the elements and steps of the various examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. The skilled person may implement the foregoing method embodiments in different usage scenarios using different device implementations, which should not be considered to be outside the scope of the embodiments of the present application.
The embodiment of the application can divide the functional modules of the device. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated in one functional module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation. For example, taking the case of dividing the respective functional modules of the apparatus by integration as an example, the present application exemplifies several possible processing apparatuses.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a man-machine interaction device 60 according to an embodiment of the present application, where the man-machine interaction device 60 may be the terminal device 101 shown in fig. 1 or a device in the terminal device 101; the human-computer interaction device 60 may include a first output unit 601, a determining unit 602, an obtaining unit 603, a sending unit 604, and a second output unit 605, where the respective units are connected through a bus, and a detailed description of the respective units is as follows:
a first output unit 601, configured to output machine output content from a first human-machine dialogue scenario node in a server, where the first human-machine dialogue scenario node is one of a plurality of human-machine dialogue scenario nodes in the server;
a determining unit 602, configured to determine that input content of a user is not acquired within a preset period of time, where the input content of the user is reply content of the user to machine output content in the first ergonomic scenario node, and the input content of the user includes one or more of a voice message, a text message, and an operation instruction of the user;
an obtaining unit 603, configured to obtain pose information of the user, where the pose information of the user includes a head action and/or a facial expression of the user;
A transmitting unit 604, configured to transmit gesture information of the user to the server;
and a second output unit 605, configured to output, in response to the first instruction sent by the server, machine output content in a second man-machine dialogue scenario node, where the second man-machine dialogue scenario node is a man-machine dialogue scenario node associated with target reply content in the plurality of man-machine dialogue scenario nodes, and the target reply content includes an expected reply content that is determined by the server according to gesture information of the user in multiple expected reply contents in the first man-machine dialogue scenario node and meets a user requirement.
In one possible implementation manner, in the acquiring the gesture information of the user, the acquiring unit 603 is specifically configured to:
if the type of the machine output content in the first man-machine conversation scenario node is the selection type content, acquiring the head action of the user, wherein the selection type content comprises a plurality of selectable items for the user to select;
and if the type of the machine output content in the first man-machine conversation scenario node is non-selective content, acquiring the facial expression of the user.
In another possible implementation manner, in the aspect of acquiring the head action of the user if the type of the machine output content in the first ergonomic scenario node is a selection type content, the acquiring unit 603 is specifically configured to:
responding to a second instruction sent by the server, outputting guide type content and first prompt information, wherein the second instruction is an instruction generated by the server under the condition that the number of selectable items in the selection type content is more than two, the first prompt information is used for prompting the user to respond to the guide type content through head actions, the head actions comprise a nodding action and a head shaking action, the guide type content is determined by the server according to the selection type content, and the guide type content is used for guiding the user to respectively judge whether each selectable item in the selection type content is a selectable item which the user wants to select;
or, outputting second prompt information in response to a third instruction sent by the server, where the third instruction is an instruction generated by the server when the number of selectable items in the selective content is determined to be two, and the second prompt information is used to prompt the user to select one selectable item in the selective content through head action.
In yet another possible implementation manner, in the step of obtaining the face look aspect of the user if the type of the machine output content in the first ergonomic serving node is non-selective content, the obtaining unit 603 is specifically configured to:
and responding to a fourth instruction sent by the server, and outputting third prompt information, wherein the fourth instruction is an instruction generated by the server under the condition that the machine output content in the first man-machine conversation scenario node is determined to be non-selective content, and the third prompt information is used for prompting the user to respond to the non-selective content through facial expression.
In yet another possible implementation, the human-machine interaction device 60 further includes a third output unit before the machine output content in the second human-machine dialogue scenario node is output in response to the first instruction sent by the server. The third output unit is configured to output fourth prompting information in response to a fifth instruction sent by the server, where the fifth instruction is an instruction generated when the server receives a facial expression of the user, and the fourth prompting information is configured to prompt the user to determine, through a head action, whether first expected reply content is the target reply content, where the first expected reply content is one item of expected reply content corresponding to intention information of the user in the first ergonomic dialogue scenario node, and the intention information of the user is determined by the server according to the facial expression of the user, where a preset correspondence exists between the intention information of the user and multiple items of expected reply content in the first ergonomic dialogue scenario node; the sending unit 604 is specifically configured to send the head action of the user to the server;
The first instruction is sent by the server under the condition that the preset condition is met;
the preset conditions include: the head movements of the user are head shaking movements, and other expected reply contents confirmed by the user through the head gestures and sent by the terminal equipment are received, wherein the other expected reply contents are expected reply contents except the first expected reply contents in the first man-machine conversation scenario node, and the other expected reply contents are the target reply contents;
or the preset condition includes: the head action of the user is a nodding action, and the first expected reply content is the target reply content.
In yet another possible implementation, the target reply content further includes an expected reply content closest to the user's input content in the first ergonomic scenario node: the determining unit 602 is further configured to determine that the input content of the user is acquired within the preset period of time before the determination that the content of the user input is not acquired within the preset period of time, and the transmitting unit 604 is further configured to transmit the input content of the user to the server.
In yet another possible implementation manner, before the acquiring the gesture information of the user, the man-machine interaction device 60 further includes:
and the fourth output unit is used for responding to a seventh instruction sent by the server and outputting sixth prompt information, wherein the sixth prompt information is used for prompting the user to determine whether human-computer interaction is performed with the terminal equipment through the gesture information of the user through head action.
It should be noted that, in the embodiment of the present application, the specific implementation and the technical effect of each unit may also correspond to those described with reference to the corresponding method embodiment of fig. 2.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a terminal device 70 according to an embodiment of the present application, where the terminal device 70 includes a processor 701, a memory 702, and a communication interface 703, and the processor 701, the memory 702, and the communication interface 703 are connected to each other through a bus.
The processor 701 may be one or more central processing units (central processing unit, CPU), and in the case where the processor 701 is one CPU, the CPU may be a single-core CPU or a multi-core CPU.
Memory 702 includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM), and memory 702 is used for associated computer programs and data.
The communication interface 703 is used to receive and output data. The communication interface 703 may receive gesture information of a user and transmit input contents of the user to the processor 701; the communication interface may also output machine output content in the human-machine conversation scenario node to the user.
The processor 701 is configured to read the computer program code stored in the memory 702, and perform the following operations:
outputting machine output content from a first human-machine dialogue scenario node of a server, wherein the first human-machine dialogue scenario node is one human-machine dialogue scenario node of a plurality of human-machine dialogue scenario nodes in the server;
determining that input content of a user is not acquired within a preset time period, wherein the input content of the user is reply content of the user aiming at machine output content in the first man-machine conversation scenario node, and the input content of the user comprises one or more of voice information, text information and operation instructions of the user;
acquiring gesture information of the user, wherein the gesture information of the user comprises head actions and/or facial statues of the user;
sending the gesture information of the user to the server;
And outputting machine output content in a second human-machine dialogue scenario node in response to the first instruction sent by the server, wherein the second human-machine dialogue scenario node is a human-machine dialogue scenario node associated with target reply content in the plurality of human-machine dialogue scenario nodes, and the target reply content comprises expected reply content which is determined by the server according to gesture information of the user in a plurality of items of expected reply content in the first human-machine dialogue scenario node and meets user requirements.
In one possible implementation manner, in the acquiring gesture information of the user, the processor 701 is specifically configured to:
if the type of the machine output content in the first man-machine conversation scenario node is the selection type content, acquiring the head action of the user, wherein the selection type content comprises a plurality of selectable items for the user to select;
and if the type of the machine output content in the first man-machine conversation scenario node is non-selective content, acquiring the facial expression of the user.
In another possible implementation manner, in the aspect that if the type of the machine output content in the first ergonomic scenario node is the selected content, the processor 701 is specifically configured to:
Responding to a second instruction sent by the server, outputting guide type content and first prompt information, wherein the second instruction is an instruction generated by the server under the condition that the number of selectable items in the selection type content is more than two, the first prompt information is used for prompting the user to respond to the guide type content through head actions, the head actions comprise a nodding action and a head shaking action, the guide type content is determined by the server according to the selection type content, and the guide type content is used for guiding the user to respectively judge whether each selectable item in the selection type content is a selectable item which the user wants to select;
or, outputting second prompt information in response to a third instruction sent by the server, where the third instruction is an instruction generated by the server when the number of selectable items in the selective content is determined to be two, and the second prompt information is used to prompt the user to select one selectable item in the selective content through head action.
In yet another possible implementation manner, in the case that the type of the machine output content in the first ergonomic scenario node is non-selective content, the processor 701 is specifically configured to obtain a facial look aspect of the user:
And responding to a fourth instruction sent by the server, and outputting third prompt information, wherein the fourth instruction is an instruction generated by the server under the condition that the machine output content in the first man-machine conversation scenario node is determined to be non-selective content, and the third prompt information is used for prompting the user to respond to the non-selective content through facial expression.
In yet another possible implementation, before the outputting of the machine output content in the second human-machine conversation scenario node in response to the first instruction sent by the server, the processor 701 is further configured to:
responding to a fifth instruction sent by the server, outputting fourth prompt information, wherein the fifth instruction is an instruction generated when the server receives the facial expression of the user, the fourth prompt information is used for prompting the user to judge whether first expected reply content is the target reply content or not through head action, the first expected reply content is one expected reply content corresponding to the user's intention information in the first man-machine conversation scenario node, the user's intention information is determined by the server according to the facial expression of the user, and a preset corresponding relation exists between the user's intention information and a plurality of expected reply contents in the first man-machine conversation scenario node;
Transmitting the head action of the user to the server;
the first instruction is sent by the server under the condition that the preset condition is met;
the preset conditions include: the head movements of the user are head shaking movements, and other expected reply contents confirmed by the user through the head gestures and sent by the terminal equipment are received, wherein the other expected reply contents are expected reply contents except the first expected reply contents in the first man-machine conversation scenario node, and the other expected reply contents are the target reply contents;
or the preset condition includes: the head action of the user is a nodding action, and the first expected reply content is the target reply content.
In yet another possible implementation, the target reply content further includes an expected reply content closest to the user's input content in the first ergonomic scenario node: before the determining that the content input by the user is not acquired within the preset time period, the processor 701 is further configured to:
determining that the input content of the user is acquired within the preset time period;
And sending the input content of the user to the server.
In yet another possible implementation, before the acquiring the gesture information of the user, the processor 701 is further configured to:
and outputting sixth prompt information in response to a seventh instruction sent by the server, wherein the sixth prompt information is used for prompting the user to determine whether human-computer interaction is performed with the terminal equipment through gesture information of the user through head action.
It should be noted that the implementation of each operation may also correspond to the corresponding description of the corresponding method embodiment with reference to fig. 2.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, which when run on a processor, implements the method flow shown in fig. 2.
The "plurality" mentioned in the embodiments of the present application refers to two or more, and "and/or" describes association relationships of association objects, which means that three relationships may exist, for example, a and/or B may mean: three cases of a alone, a and B together, and B alone, wherein A, B may be singular or plural. And, unless otherwise indicated, the use of ordinal numbers such as "first," "second," etc., by embodiments of the present application is used for distinguishing between multiple objects and is not used for limiting a sequence, timing, priority, or importance of the multiple objects. For example, the first instruction and the second instruction are only for distinguishing between different instructions, and are not intended to represent differences in timing, importance, etc. of the two instructions.
While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (10)

1. A human-computer interaction method based on gesture information, which is characterized by being applied to terminal equipment, the method comprising:
outputting machine output content from a first human-machine dialogue scenario node of a server, wherein the first human-machine dialogue scenario node is one human-machine dialogue scenario node of a plurality of human-machine dialogue scenario nodes in the server;
determining that input content of a user is not acquired within a preset time period, wherein the input content of the user is reply content of the user aiming at machine output content in the first man-machine conversation scenario node, and the input content of the user comprises one or more of voice information, text information and operation instructions of the user;
acquiring gesture information of the user, wherein the gesture information of the user comprises head actions and/or facial statues of the user;
Sending the gesture information of the user to the server;
and outputting machine output content in a second human-machine dialogue scenario node in response to the first instruction sent by the server, wherein the second human-machine dialogue scenario node is a human-machine dialogue scenario node associated with target reply content in the plurality of human-machine dialogue scenario nodes, and the target reply content comprises expected reply content which is determined by the server according to gesture information of the user in a plurality of items of expected reply content in the first human-machine dialogue scenario node and meets user requirements.
2. The method of claim 1, wherein the obtaining gesture information of the user comprises:
if the type of the machine output content in the first man-machine conversation scenario node is the selection type content, acquiring the head action of the user, wherein the selection type content comprises a plurality of selectable items for the user to select;
and if the type of the machine output content in the first man-machine conversation scenario node is non-selective content, acquiring the facial expression of the user.
3. The method of claim 2, wherein the obtaining the head action of the user if the type of machine output content in the first ergonomic serving node is a selected content comprises:
Responding to a second instruction sent by the server, outputting guide type content and first prompt information, wherein the second instruction is an instruction generated by the server under the condition that the number of selectable items in the selection type content is more than two, the first prompt information is used for prompting the user to respond to the guide type content through head actions, the head actions comprise a nodding action and a head shaking action, the guide type content is determined by the server according to the selection type content, and the guide type content is used for guiding the user to respectively judge whether each selectable item in the selection type content is a selectable item which the user wants to select;
or, outputting second prompt information in response to a third instruction sent by the server, where the third instruction is an instruction generated by the server when the number of selectable items in the selective content is determined to be two, and the second prompt information is used to prompt the user to select one selectable item in the selective content through head action.
4. A method according to claim 2 or 3, wherein if the type of machine output content in the first ergonomic scenario node is non-selective content, obtaining the facial look of the user comprises:
And responding to a fourth instruction sent by the server, and outputting third prompt information, wherein the fourth instruction is an instruction generated by the server under the condition that the machine output content in the first man-machine conversation scenario node is determined to be non-selective content, and the third prompt information is used for prompting the user to respond to the non-selective content through facial expression.
5. A method according to claim 2 or 3, wherein before outputting the machine output content in the second human-machine conversation scenario node in response to the first instruction sent by the server, the method further comprises:
responding to a fifth instruction sent by the server, outputting fourth prompt information, wherein the fifth instruction is an instruction generated when the server receives the facial expression of the user, the fourth prompt information is used for prompting the user to judge whether first expected reply content is the target reply content or not through head action, the first expected reply content is one expected reply content corresponding to the user's intention information in the first man-machine conversation scenario node, the user's intention information is determined by the server according to the facial expression of the user, and a preset corresponding relation exists between the user's intention information and a plurality of expected reply contents in the first man-machine conversation scenario node;
Transmitting the head action of the user to the server;
the first instruction is sent by the server under the condition that the preset condition is met;
the preset conditions include: the head movements of the user are head shaking movements, and other expected reply contents confirmed by the user through the head gestures and sent by the terminal equipment are received, wherein the other expected reply contents are expected reply contents except the first expected reply contents in the first man-machine conversation scenario node, and the other expected reply contents are the target reply contents;
or the preset condition includes: the head action of the user is a nodding action, and the first expected reply content is the target reply content.
6. A method according to any one of claims 1-3, wherein prior to said determining that no user entered content has been obtained within a preset period of time, the method further comprises:
determining that the input content of the user is acquired within the preset time period;
and sending the input content of the user to the server.
7. A human-computer interaction method based on gesture information, which is characterized by being applied to a server, the method comprising:
Determining that input content of a user sent by terminal equipment is not acquired within a preset time period, wherein the input content of the user is reply content of the user aiming at machine output content in first man-machine conversation scenario nodes, the first man-machine conversation scenario nodes are one man-machine conversation scenario nodes in a plurality of man-machine conversation scenario nodes in the server, and the user input content comprises one or more of voice messages, text messages and operation instructions of the user;
acquiring gesture information of the user sent by the terminal equipment, wherein the gesture information of the user comprises head actions and/or facial statues of the user;
determining a target reply content meeting the user requirement in a plurality of expected reply contents in the first man-machine conversation scenario node according to the gesture information of the user;
determining a second human-machine dialogue scenario node as a human-machine dialogue scenario node after the first human-machine dialogue scenario node, wherein the second human-machine dialogue scenario node is a human-machine dialogue scenario node associated with the target reply content in the plurality of human-machine dialogue scenario nodes;
And sending a first instruction to the terminal equipment, wherein the first instruction is used for instructing the terminal equipment to output the machine output content in the second man-machine conversation scenario node.
8. A human-machine interaction device, the device comprising:
the first output unit is used for outputting machine output content in a first man-machine dialogue scenario node from a server, wherein the first man-machine dialogue scenario node is one man-machine dialogue scenario node in a plurality of man-machine dialogue scenario nodes in the server;
the determining unit is used for determining that the input content of the user is not acquired within a preset time period, wherein the input content of the user is reply content of the user aiming at the machine output content in the first man-machine conversation scenario node, and the input content of the user comprises one or more of voice information, text information and operation instructions of the user;
an obtaining unit, configured to obtain pose information of the user, where the pose information of the user includes a head motion and/or a facial expression of the user;
a sending unit, configured to send gesture information of the user to the server;
The second output unit is used for responding to the first instruction sent by the server and outputting machine output content in a second man-machine dialogue scenario node, wherein the second man-machine dialogue scenario node is a man-machine dialogue scenario node which is associated with target reply content in the man-machine dialogue scenario nodes, and the target reply content comprises expected reply content which is determined by the server according to gesture information of the user in multiple expected reply contents in the first man-machine dialogue scenario node and meets the requirement of the user.
9. A terminal device comprising a processor, a memory, a communication interface, wherein the communication interface is adapted to perform receiving and/or transmitting operations under control of the processor, the memory is adapted to store a computer program, and the processor is adapted to invoke the computer program to implement the method of any of claims 1-6.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on a processor, implements the method of any of claims 1-6.
CN202310460765.2A 2023-04-21 2023-04-21 Man-machine interaction method and related device based on gesture information Pending CN117116259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310460765.2A CN117116259A (en) 2023-04-21 2023-04-21 Man-machine interaction method and related device based on gesture information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310460765.2A CN117116259A (en) 2023-04-21 2023-04-21 Man-machine interaction method and related device based on gesture information

Publications (1)

Publication Number Publication Date
CN117116259A true CN117116259A (en) 2023-11-24

Family

ID=88809887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310460765.2A Pending CN117116259A (en) 2023-04-21 2023-04-21 Man-machine interaction method and related device based on gesture information

Country Status (1)

Country Link
CN (1) CN117116259A (en)

Similar Documents

Publication Publication Date Title
CN110609620B (en) Human-computer interaction method and device based on virtual image and electronic equipment
US10210002B2 (en) Method and apparatus of processing expression information in instant communication
EP3321787B1 (en) Method for providing application, and electronic device therefor
CN110113646B (en) AI voice-based intelligent interactive processing method, system and storage medium
CN107977928B (en) Expression generation method and device, terminal and storage medium
CN107632706B (en) Application data processing method and system of multi-modal virtual human
US20140351720A1 (en) Method, user terminal and server for information exchange in communications
US10239202B1 (en) Robot interaction system and method
EP3550812B1 (en) Electronic device and method for delivering message by same
KR20160071732A (en) Method and apparatus for processing voice input
CN111565143B (en) Instant messaging method, equipment and computer readable storage medium
TW201916005A (en) Interaction method and device
WO2019214463A1 (en) Man-machine dialog method, client, electronic device and storage medium
CN111538456A (en) Human-computer interaction method, device, terminal and storage medium based on virtual image
US20220165013A1 (en) Artificial Reality Communications
CN107463247A (en) A kind of method, apparatus and terminal of text reading processing
CN110737335B (en) Interaction method and device of robot, electronic equipment and storage medium
US20200099634A1 (en) Interactive Responding Method and Computer System Using the Same
CN112000766A (en) Data processing method, device and medium
CN117116259A (en) Man-machine interaction method and related device based on gesture information
CN114666291A (en) Message sending method and device
CN113470614A (en) Voice generation method and device and electronic equipment
CN116844385A (en) Method and related device for determining man-machine interaction mode
US20220055223A1 (en) Electronic device for providing reaction on basis of user state and operating method therefor
CN117675741A (en) Information interaction method, electronic device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination