WO2018006369A1

WO2018006369A1 - Method and system for synchronizing speech and virtual actions, and robot

Info

Publication number: WO2018006369A1
Application number: PCT/CN2016/089213
Authority: WO
Inventors: 邱楠; 杨新宇; 王昊奋
Original assignee: 深圳狗尾草智能科技有限公司
Priority date: 2016-07-07
Filing date: 2016-07-07
Publication date: 2018-01-11
Also published as: JP2018001403A; CN106471572B; JP6567609B2; CN106471572A

Abstract

A method for synchronizing speech and virtual actions, comprising: obtaining multimodal information of a user (S101); generating interactive content according to the multimodal information of the user and a variable parameter (300), the interactive content at least comprising speech information and action information (S102); and adjusting the length of time of the speech information and the length of time of the action information to be the same (S103). The interactive content is generated according to one or more types of the multimodal information of the user, such as user's speech, a user's expression, and a user's action. Moreover, to synchronize the speech information and the action information, the length of time of the speech information and the length of time of the action information are adjusted to be the same, so that sound and actions of a robot can be synchronized and matched during playing, and the robot can use not only speech but also multiple other expression forms, such as actions, for interaction. Therefore, the expression forms of the robot are further diversified, the robot is more humanized, and the user experience in interaction with the robot is also improved.

Description

Method, system and robot for synchronizing voice and virtual action

Technical field

The present invention relates to the field of robot interaction technologies, and in particular, to a method, system and robot for synchronizing voice and virtual motion.

Background technique

As an interactive tool with humans, robots are used more and more. For example, some elderly people and children can interact with robots, including dialogue and entertainment. In order to make the robot more humanized when interacting with humans, the inventor developed a virtual robot display device and imaging system, which can form a 3D animated image, and the virtual robot's host accepts human commands such as voice to interact with humans. Then the virtual 3D animated image will respond to the sounds and actions according to the instructions of the host, so that the robot can be more anthropomorphic, not only can interact with humans in sounds and expressions, but also interact with humans in actions, etc. Improve the experience of interaction.

However, how the virtual robot synchronizes the voice and virtual actions in the reply content is a more complicated problem. If the voice and the motion cannot match, the user's interactive experience will be greatly affected.

Therefore, how to provide a method, system and robot for synchronizing voice and virtual motion, and improving the human-computer interaction experience become a technical problem that needs to be solved.

Summary of the invention

It is an object of the present invention to provide a method, system and robot for synchronizing speech and virtual actions to enhance the human-computer interaction experience.

The object of the present invention is achieved by the following technical solutions:

A method of synchronizing speech and virtual actions, including:

Obtain multi-modal information of the user;

Generating interactive content according to the user's multimodal information and variable parameters, the interactive content including at least voice information and action information;

Adjust the length of the voice message and the length of the action information to the same length.

Preferably, the length of the voice information and the length of the motion information are adjusted to phase The same specific steps include:

If the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.

Preferably, when the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the time length of the motion information is equal to the length of time of the voice information.

Preferably, the specific steps of adjusting the length of the voice information and the length of the action information to the same include:

If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message.

Preferably, when the time length of the voice information is less than the time length of the motion information, part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.

Preferably, the method for generating the variable parameter of the robot comprises: fitting a parameter of the self-cognition of the robot with a parameter of the scene in the variable parameter to generate a variable parameter of the robot.

Preferably, the variable parameter includes at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.

Preferably, the step of generating the interactive content according to the multimodal information and the variable parameter specifically includes: generating the interactive content according to the multimodal information and the variable parameter and the fitting curve of the parameter changing probability.

Preferably, the method for generating a fitting curve of the parameter change probability comprises: using a probability algorithm, using a network to make a probability estimation of parameters between the robots, and calculating a scene parameter change of the robot on the life time axis on the life time axis. After that, the probability of each parameter change forms a fitted curve of the parameter change probability.

A system for synchronizing voice and virtual actions, including:

An obtaining module, configured to acquire multi-modal information of the user;

An artificial intelligence module, configured to generate interaction content according to multi-modality information and variable parameters of the user, where the interaction content includes at least voice information and action information;

a control module, configured to adjust the length of the voice information and the length of the motion information to the same.

Preferably, the control module is specifically configured to:

If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time. The length is equal to the length of time of the voice information.

Preferably, the system further comprises a processing module for fitting the self-cognitive parameters of the robot with the parameters of the scene in the variable parameters to generate variable parameters.

Preferably, the artificial intelligence module is specifically configured to: generate interaction content according to the multi-modal information and the variable parameter and the fitting curve of the parameter change probability.

Preferably, the system includes a fitting curve generating module for using a probability algorithm to estimate a parameter between the robots using a network, and calculating a scene parameter of the robot on the life time axis after the life time axis is changed. The probability of each parameter change forms a fitted curve of the parameter change probability.

The invention discloses a robot comprising a system for synchronizing speech and virtual actions as described above.

Compared with the prior art, the present invention has the following advantages: the method for synchronizing speech and virtual actions of the present invention includes: acquiring multi-modal information of a user; generating interactive content according to multi-modal information and variable parameters of the user, The interactive content includes at least voice information and motion information; the length of the voice information and the length of the motion information are adjusted to be the same. This can be done by one or more of the user's multimodal information such as user voice, user expressions, user actions, and the like. The interactive content includes at least voice information and motion information, and in order to synchronize the voice information and the motion information, the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can play the sound. It can be synchronized with the action, so that the robot not only has the voice performance when interacting, but also has various expressions such as actions. The robot's expression form is more diverse, which makes the robot more anthropomorphic and improves the user's interaction with the robot. Experience.

DRAWINGS

1 is a flowchart of a method for synchronizing voice and virtual actions according to Embodiment 1 of the present invention;

2 is a schematic diagram of a system for synchronizing voice and virtual actions according to Embodiment 2 of the present invention.

detailed description

Although the flowcharts describe various operations as a sequential process, many of the operations can be implemented in parallel, concurrently or concurrently. The order of the operations can be rearranged. Processing may be terminated when its operation is completed, but may also have additional steps not included in the figures. Processing can correspond to methods, functions, procedures, subroutines, subroutines, and the like.

Computer devices include user devices and network devices. The user equipment or the client includes but is not limited to a computer, a smart phone, a PDA, etc.; the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing-based computer or network server. cloud. The computer device can operate alone to carry out the invention, and can also access the network and implement the invention through interoperation with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

The terms "first," "second," and the like may be used herein to describe the various elements, but the elements should not be limited by these terms, and the terms are used only to distinguish one element from another. The term "and/or" used herein includes any and all combinations of one or more of the associated listed items. When a unit is referred to as being "connected" or "coupled" to another unit, it can be directly connected or coupled to the other unit, or an intermediate unit can be present.

The terminology used herein is for the purpose of describing the particular embodiments, The singular forms "a", "an", It will also be understood that the terms "comprising" and / or "comprising", as used herein, are intended to mean the stated features, integers, steps, operations, units and/or components. The existence or addition of one or more other features, integers, steps, operations, units, components and/or combinations thereof may be present.

The invention will now be further described with reference to the drawings and preferred embodiments.

Embodiment 1

As shown in FIG. 1 , a method for synchronizing voice and virtual actions is disclosed in this embodiment, including:

S101. Acquire multi-modal information of the user.

S102. Generate interactive content according to the multimodal information of the user and the variable parameter 300, where the interactive content includes at least voice information and action information.

S103. Adjust the length of the voice information and the length of the motion information to be the same.

The method for synchronizing speech and virtual actions of the present invention includes: acquiring multimodal information of a user; generating interactive content according to multimodal information and variable parameters of the user, the interactive content including at least voice information and action information; The length of the information and the length of the action information are adjusted to be the same. In this way, the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc., and the interactive content includes at least voice information and motion information, and in order to make the voice information and action The information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting. With a variety of expressions, the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.

The multimodal information in this embodiment may be one of user expression, voice information, gesture information, scene information, image information, video information, face information, pupil iris information, light sense information, and fingerprint information. Several.

In this embodiment, the variable parameters are specifically: sudden changes in people and machines, such as one day on the time axis is eating, sleeping, interacting, running, eating, sleeping. In this case, if the scene of the robot is suddenly changed, such as taking the beach at the time of running, etc., these human active parameters for the robot, as variable parameters, will cause the robot's self-cognition to change. The life timeline and variable parameters can be used to change the attributes of self-cognition, such as mood values, fatigue values, etc., and can also automatically add new self-awareness information, such as no previous anger value, based on the life time axis and The scene of the variable factor will automatically add to the self-cognition of the robot based on the scene that previously simulated the human self-cognition.

For example, according to the life time axis, it should be the time of eating at 12 noon, and if you change this scene, such as going out shopping at 12 noon, then the robot will Write this as one of the variable parameters. During this time period, when the user interacts with the robot, the robot will go out to go shopping at 12 noon to generate interactive content instead of eating at 12 o'clock in the past. In combination with generating the interactive content, when the interactive content is specifically generated, the robot generates the multi-modal information of the acquired user, such as voice information, video information, picture information, and the like, and variable parameters. In this way, some unexpected events in human life can be added to the life axis of the robot, making the interaction of the robot more anthropomorphic.

In this embodiment, the specific steps of adjusting the length of the voice information and the length of the action information to the same include:

When the length of the voice information is longer than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to the length of time of the voice information.

Therefore, when the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, the specific meaning of the adjustment may be the length of time for compressing or stretching the voice information or/and the length of the motion information, or may be accelerated. Speed or slow down the playback speed, for example, multiply the playback speed of the voice message by 2, or multiply the playback time of the action information by 0.8, and so on.

For example, the time length of the voice information and the time length of the motion information are one minute. In the interactive content generated by the robot according to the multimodal information of the user, the length of the voice information is 1 minute, and the duration of the motion information is 2 minutes. Then, the playback speed of the motion information can be accelerated to twice the original playback speed, and then the playback time after the motion information adjustment is 1 minute, thereby synchronizing with the voice information. Of course, the playback speed of the voice information can be slowed down, and adjusted to 0.5 times the original playback speed, so that the voice information is adjusted and then slowed down to 2 minutes, thereby synchronizing with the motion information. In addition, both the voice information and the motion information can be adjusted, for example, the voice information is slowed down, and the motion information is accelerated, and the time is adjusted to 1 minute and 30 seconds, and the voice and the motion can be synchronized.

In addition, in this embodiment, the specific steps of adjusting the length of the voice information and the length of the action information to the same include:

If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message degree.

When the length of the voice information is less than the length of the motion information, part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.

Therefore, when the difference between the length of the voice information and the length of the motion information is greater than the threshold, the meaning of the adjustment is to add or delete part of the action information, so that the time length of the action information is the same as the time length of the voice information.

For example, the time length of the voice information and the time length of the motion information are 30 seconds. In the interactive content generated by the robot according to the multimodal information of the user, the length of the voice information is 3 minutes, and the duration of the motion information is 1 minute. Then, other action information needs to be added to the original action information, for example, to find an action information with a length of 2 minutes, and the two sets of action information are sorted and combined to match the time length of the voice information. It is. Of course, if you do not find the action information with a length of 2 minutes and find a time length of 2 minutes and a half, you can select some of the 2:30 action information (may be part of the frame) to make the selection. The length of the post-action information is 2 minutes, so that the length of the voice information can be matched the same.

In this embodiment, the action information closest to the time length of the voice information may be selected according to the length of the voice information, or the closest voice information may be selected according to the time length of the motion information.

In this way, according to the length of time of the voice information, the control module can conveniently adjust the time length of the voice information and the motion information, and it is easier to adjust to the same, and the adjusted play is more natural and smooth.

According to one example, after adjusting the time length of the voice information and the time length of the motion information to the same step, the method further includes: outputting the adjusted voice information and the motion information to the virtual image for display.

In this way, the output can be output after the adjustment is consistent, and the output can be output on the virtual image, thereby making the virtual robot more anthropomorphic and improving the user experience.

According to one example, the method for generating a robot variable parameter includes: fitting a self-cognitive parameter of the robot with a parameter of a scene in the variable parameter to generate a robot variable parameter. In this way, by expanding the self-cognitive line of the robot itself in the scene of the robot combined with the variable parameters, the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated. .

According to one of the examples, the variable parameter includes at least a behavior that changes the user's original behavior and the change, and a parameter value that represents a change in the user's original behavior and the behavior after the change.

The variable parameters are in the same state as the original plan. The sudden change causes the user to be in another state. The variable parameter represents the change of the behavior or state, and the state or behavior of the user after the change. For example, it was originally running at 5 pm, and suddenly there were other things, such as going to play, then changing from running to playing is a variable parameter, and the probability of such a change is also studied.

According to one example, the step of generating the interactive content according to the multimodal information and the variable parameter specifically includes: generating the interactive content according to the multimodal information and the variable parameter and the fitting curve of the parameter change probability.

In this way, the fitting curve can be generated by the probability training of the variable parameters, thereby generating the robot interaction content.

According to one example, the method for generating a fitting curve of the parameter change probability includes: using a probability algorithm, using a network to make a probability estimation of parameters between the robots, and calculating a scene of the robot on the life time axis on the life time axis. After the parameter is changed, the probability of each parameter changing forms a fitting curve of the parameter change probability. Among them, the probability algorithm can adopt the Bayesian probability algorithm.

By expanding the self-cognitive line of the robot itself in the scene of the robot combined with the variable parameters, the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated. At the same time, coupled with the identification of the location scene, the robot will know its geographical location, and will change the way the interactive content is generated according to the geographical environment in which it is located. In addition, we use Bayesian probability algorithm to estimate the parameters between robots using Bayesian network, and calculate the probability of each parameter change after the change of the time axis scene parameters of the robot itself on the life time axis. The curve dynamically affects the self-recognition of the robot itself. This innovative module makes the robot itself a human lifestyle. For the expression, it can be changed according to the location scene.

Embodiment 2

As shown in FIG. 2, a system for synchronizing voice and virtual actions is disclosed in this embodiment, including:

The obtaining module 201 is configured to acquire multi-modal information of the user;

The artificial intelligence module 202 is configured to generate interaction content according to the multimodal information and the variable parameter of the user, where the interaction content includes at least voice information and action information, wherein the variable parameter is variable Parameter module 301 generates;

The control module 203 is configured to adjust the time length of the voice information and the time length of the motion information to be the same.

In this way, the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc., and the interactive content includes at least voice information and motion information, and in order to make the voice information and action The information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting. With a variety of expressions, the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.

For example, according to the life time axis, it should be the time of eating at 12 noon, and if you change this scene, such as going out shopping at 12 noon, the robot will use this as a variable parameter. Write, when the user interacts with the robot during this time period, the robot will go out to go shopping at 12 noon to generate interactive content, instead of combining the previous 12 noon to generate interactive content in the meal, in the specific interaction In the content, the robot generates the multi-modal information of the acquired user, such as voice information, video information, picture information, and the like, and variable parameters. In this way, some unexpected events in human life can be added to the life axis of the robot, making the interaction of the robot more anthropomorphic.

In this embodiment, the control module is specifically configured to:

If the difference between the length of the voice information and the length of the motion information is not greater than the threshold, when the length of the voice information is less than the length of the motion information, the speed of the motion information is accelerated. Degree, the length of time of the action information is equal to the length of time of the voice information.

Therefore, when the difference between the length of the voice information and the length of the motion information is not greater than the threshold, the specific meaning of the adjustment may compress or stretch the length of the voice information or/and the length of the motion information, or may speed up the playback speed. Or slow down the playback speed, for example, multiply the playback speed of the voice message by 2, or multiply the playback time of the action information by 0.8 or the like.

In addition, in this embodiment, the control module is specifically configured to:

In this embodiment, the artificial intelligence module may be specifically configured to: select motion information that is closest to the time length of the voice information according to the length of the voice information, or select the closest voice information according to the time length of the motion information. .

According to one example, the system further includes an output module 204 for outputting the adjusted voice information and motion information to the virtual image for presentation.

According to one of the examples, the system further includes a processing module for fitting the self-cognitive parameters of the robot with the parameters of the scene in the variable parameters to generate variable parameters.

In this way, by expanding the self-cognitive line of the robot itself in the scene of the robot combined with the variable parameters, the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated. .

According to one example, the artificial intelligence module is specifically configured to: generate interaction content according to the multi-modality information and the variable parameter and the fitting curve of the parameter change probability.

According to one example, the system includes a fitting curve generation module for using a probability algorithm to estimate a parameter between the robots using a network for probability estimation, and calculating a machine on a life time axis After the scene parameters on the life time axis are changed, the probability of each parameter change forms a fitting curve of the parameter change probability. Among them, the probability algorithm can adopt the Bayesian probability algorithm.

The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Claims

A method for synchronizing speech and virtual actions, comprising:

Obtain multi-modal information of the user;

Generating interactive content according to the user's multimodal information and variable parameters, the interactive content including at least voice information and action information;

Adjust the length of the voice message and the length of the action information to the same length.
The method according to claim 1, wherein the specific steps of adjusting the length of time of the voice information and the length of the action information to the same include:

If the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
The method according to claim 2, wherein when the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the duration of the motion information is equal to The length of time of the voice information.
The method according to claim 1, wherein the specific steps of adjusting the length of time of the voice information and the length of the action information to the same include:

If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message.
The method according to claim 4, wherein when the length of the voice information is less than the length of the action information, part of the action information is selected such that the length of the selected part of the action is equal to the voice information. length of time.
The method according to claim 1, wherein the method for generating the variable parameter of the robot comprises: fitting a parameter of the self-cognition of the robot with a parameter of the scene in the variable parameter to generate a variable parameter of the robot.
The method according to claim 6, wherein said variable parameter comprises at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.
The method according to claim 1, wherein the step of generating interactive content according to the multimodal information and the variable parameter comprises: changing a probability according to the multimodal information and variable parameters and parameters The fitted curve generates interactive content.
The method according to claim 8, wherein the method for generating a fitting curve of the parameter change probability comprises: using a probability algorithm, using a network to make a probability estimate of the parameters between the robots, and calculating the life time axis After the scene parameters of the robot change on the life time axis, the probability of each parameter change forms a fitting curve of the parameter change probability.
A system for synchronizing voice and virtual actions, comprising:

An obtaining module, configured to acquire multi-modal information of the user;

An artificial intelligence module, configured to generate interaction content according to multi-modality information and variable parameters of the user, where the interaction content includes at least voice information and action information;

The control module is configured to adjust the length of the voice information and the length of the motion information to be the same.
The system of claim 10, wherein the control module is specifically configured to:

If the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
The system according to claim 11, wherein when the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to The length of time of the voice information.
The system of claim 10, wherein the control module is specifically configured to:

If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time. The length is equal to the length of time of the voice information.
The system according to claim 13, wherein when the length of the voice information is less than the length of the action information, part of the action information is selected such that the length of the selected part of the action is equal to the voice information. length of time.
The system of claim 10, wherein the system further comprises a processing module for fitting the self-cognitive parameters of the robot to the parameters of the scene in the variable parameters to generate variable parameters.
The system according to claim 15, wherein said variable parameter comprises at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.
The system according to claim 10, wherein the artificial intelligence module is specifically configured to: generate interaction content according to the multimodal information and the variable parameter and a fitting curve of the parameter change probability.
The system according to claim 17, wherein said system comprises a fitting curve generating module for using a probability algorithm to estimate a parameter between the robots using a network for probability estimation, and calculating a robot on a life time axis After the scene parameters on the life time axis are changed, the probability of each parameter change forms a fitting curve of the parameter change probability.
A robot comprising a system for synchronizing speech and virtual actions according to any one of claims 9 to 18.