WO2018006369A1 - Method and system for synchronizing speech and virtual actions, and robot - Google Patents

Method and system for synchronizing speech and virtual actions, and robot Download PDF

Info

Publication number
WO2018006369A1
WO2018006369A1 PCT/CN2016/089213 CN2016089213W WO2018006369A1 WO 2018006369 A1 WO2018006369 A1 WO 2018006369A1 CN 2016089213 W CN2016089213 W CN 2016089213W WO 2018006369 A1 WO2018006369 A1 WO 2018006369A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
length
time
voice
robot
Prior art date
Application number
PCT/CN2016/089213
Other languages
French (fr)
Chinese (zh)
Inventor
邱楠
杨新宇
王昊奋
Original Assignee
深圳狗尾草智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳狗尾草智能科技有限公司 filed Critical 深圳狗尾草智能科技有限公司
Priority to PCT/CN2016/089213 priority Critical patent/WO2018006369A1/en
Priority to CN201680001720.7A priority patent/CN106471572B/en
Priority to JP2017133167A priority patent/JP6567609B2/en
Publication of WO2018006369A1 publication Critical patent/WO2018006369A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J13/00Controls for manipulators
    • B25J13/003Controls for manipulators by means of an audio-responsive input
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Definitions

  • the present invention relates to the field of robot interaction technologies, and in particular, to a method, system and robot for synchronizing voice and virtual motion.
  • robots are used more and more. For example, some elderly people and children can interact with robots, including dialogue and entertainment.
  • the inventor developed a virtual robot display device and imaging system, which can form a 3D animated image, and the virtual robot's host accepts human commands such as voice to interact with humans. Then the virtual 3D animated image will respond to the sounds and actions according to the instructions of the host, so that the robot can be more anthropomorphic, not only can interact with humans in sounds and expressions, but also interact with humans in actions, etc. Improve the experience of interaction.
  • a method of synchronizing speech and virtual actions including:
  • the interactive content including at least voice information and action information;
  • the length of the voice information and the length of the motion information are adjusted to phase
  • the same specific steps include:
  • the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice.
  • the length of time for the message is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
  • the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the time length of the motion information is equal to the length of time of the voice information.
  • the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
  • the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message.
  • part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
  • the method for generating the variable parameter of the robot comprises: fitting a parameter of the self-cognition of the robot with a parameter of the scene in the variable parameter to generate a variable parameter of the robot.
  • variable parameter includes at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.
  • the step of generating the interactive content according to the multimodal information and the variable parameter specifically includes: generating the interactive content according to the multimodal information and the variable parameter and the fitting curve of the parameter changing probability.
  • the method for generating a fitting curve of the parameter change probability comprises: using a probability algorithm, using a network to make a probability estimation of parameters between the robots, and calculating a scene parameter change of the robot on the life time axis on the life time axis. After that, the probability of each parameter change forms a fitted curve of the parameter change probability.
  • a system for synchronizing voice and virtual actions including:
  • An obtaining module configured to acquire multi-modal information of the user
  • An artificial intelligence module configured to generate interaction content according to multi-modality information and variable parameters of the user, where the interaction content includes at least voice information and action information;
  • control module configured to adjust the length of the voice information and the length of the motion information to the same.
  • control module is specifically configured to:
  • the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice.
  • the length of time for the message is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
  • the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the time length of the motion information is equal to the length of time of the voice information.
  • control module is specifically configured to:
  • the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time.
  • the length is equal to the length of time of the voice information.
  • part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
  • the system further comprises a processing module for fitting the self-cognitive parameters of the robot with the parameters of the scene in the variable parameters to generate variable parameters.
  • variable parameter includes at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.
  • the artificial intelligence module is specifically configured to: generate interaction content according to the multi-modal information and the variable parameter and the fitting curve of the parameter change probability.
  • the system includes a fitting curve generating module for using a probability algorithm to estimate a parameter between the robots using a network, and calculating a scene parameter of the robot on the life time axis after the life time axis is changed.
  • the probability of each parameter change forms a fitted curve of the parameter change probability.
  • the invention discloses a robot comprising a system for synchronizing speech and virtual actions as described above.
  • the method for synchronizing speech and virtual actions of the present invention includes: acquiring multi-modal information of a user; generating interactive content according to multi-modal information and variable parameters of the user,
  • the interactive content includes at least voice information and motion information; the length of the voice information and the length of the motion information are adjusted to be the same. This can be done by one or more of the user's multimodal information such as user voice, user expressions, user actions, and the like.
  • the interactive content includes at least voice information and motion information, and in order to synchronize the voice information and the motion information, the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can play the sound.
  • the robot can be synchronized with the action, so that the robot not only has the voice performance when interacting, but also has various expressions such as actions.
  • the robot's expression form is more diverse, which makes the robot more anthropomorphic and improves the user's interaction with the robot. Experience.
  • FIG. 1 is a flowchart of a method for synchronizing voice and virtual actions according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic diagram of a system for synchronizing voice and virtual actions according to Embodiment 2 of the present invention.
  • Computer devices include user devices and network devices.
  • the user equipment or the client includes but is not limited to a computer, a smart phone, a PDA, etc.;
  • the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing-based computer or network server. cloud.
  • the computer device can operate alone to carry out the invention, and can also access the network and implement the invention through interoperation with other computer devices in the network.
  • the network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
  • first means “first,” “second,” and the like may be used herein to describe the various elements, but the elements should not be limited by these terms, and the terms are used only to distinguish one element from another.
  • the term “and/or” used herein includes any and all combinations of one or more of the associated listed items. When a unit is referred to as being “connected” or “coupled” to another unit, it can be directly connected or coupled to the other unit, or an intermediate unit can be present.
  • a method for synchronizing voice and virtual actions including:
  • S102 Generate interactive content according to the multimodal information of the user and the variable parameter 300, where the interactive content includes at least voice information and action information.
  • the method for synchronizing speech and virtual actions of the present invention includes: acquiring multimodal information of a user; generating interactive content according to multimodal information and variable parameters of the user, the interactive content including at least voice information and action information; The length of the information and the length of the action information are adjusted to be the same.
  • the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc.
  • the interactive content includes at least voice information and motion information, and in order to make the voice information and action
  • the information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting.
  • the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.
  • the multimodal information in this embodiment may be one of user expression, voice information, gesture information, scene information, image information, video information, face information, pupil iris information, light sense information, and fingerprint information.
  • variable parameters are specifically: sudden changes in people and machines, such as one day on the time axis is eating, sleeping, interacting, running, eating, sleeping. In this case, if the scene of the robot is suddenly changed, such as taking the beach at the time of running, etc., these human active parameters for the robot, as variable parameters, will cause the robot's self-cognition to change.
  • the life timeline and variable parameters can be used to change the attributes of self-cognition, such as mood values, fatigue values, etc., and can also automatically add new self-awareness information, such as no previous anger value, based on the life time axis and The scene of the variable factor will automatically add to the self-cognition of the robot based on the scene that previously simulated the human self-cognition.
  • the robot will Write this as one of the variable parameters.
  • the robot will go out to go shopping at 12 noon to generate interactive content instead of eating at 12 o'clock in the past.
  • the robot In combination with generating the interactive content, when the interactive content is specifically generated, the robot generates the multi-modal information of the acquired user, such as voice information, video information, picture information, and the like, and variable parameters. In this way, some unexpected events in human life can be added to the life axis of the robot, making the interaction of the robot more anthropomorphic.
  • the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
  • the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice.
  • the length of time for the message is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
  • the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to the length of time of the voice information.
  • the specific meaning of the adjustment may be the length of time for compressing or stretching the voice information or/and the length of the motion information, or may be accelerated. Speed or slow down the playback speed, for example, multiply the playback speed of the voice message by 2, or multiply the playback time of the action information by 0.8, and so on.
  • the time length of the voice information and the time length of the motion information are one minute.
  • the length of the voice information is 1 minute, and the duration of the motion information is 2 minutes.
  • the playback speed of the motion information can be accelerated to twice the original playback speed, and then the playback time after the motion information adjustment is 1 minute, thereby synchronizing with the voice information.
  • the playback speed of the voice information can be slowed down, and adjusted to 0.5 times the original playback speed, so that the voice information is adjusted and then slowed down to 2 minutes, thereby synchronizing with the motion information.
  • both the voice information and the motion information can be adjusted, for example, the voice information is slowed down, and the motion information is accelerated, and the time is adjusted to 1 minute and 30 seconds, and the voice and the motion can be synchronized.
  • the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
  • the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message degree.
  • part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
  • the meaning of the adjustment is to add or delete part of the action information, so that the time length of the action information is the same as the time length of the voice information.
  • the time length of the voice information and the time length of the motion information are 30 seconds.
  • the length of the voice information is 3 minutes, and the duration of the motion information is 1 minute.
  • other action information needs to be added to the original action information, for example, to find an action information with a length of 2 minutes, and the two sets of action information are sorted and combined to match the time length of the voice information. It is.
  • the length of the post-action information is 2 minutes, so that the length of the voice information can be matched the same.
  • the action information closest to the time length of the voice information may be selected according to the length of the voice information, or the closest voice information may be selected according to the time length of the motion information.
  • the control module can conveniently adjust the time length of the voice information and the motion information, and it is easier to adjust to the same, and the adjusted play is more natural and smooth.
  • the method further includes: outputting the adjusted voice information and the motion information to the virtual image for display.
  • the output can be output after the adjustment is consistent, and the output can be output on the virtual image, thereby making the virtual robot more anthropomorphic and improving the user experience.
  • the method for generating a robot variable parameter includes: fitting a self-cognitive parameter of the robot with a parameter of a scene in the variable parameter to generate a robot variable parameter.
  • the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated.
  • variable parameter includes at least a behavior that changes the user's original behavior and the change, and a parameter value that represents a change in the user's original behavior and the behavior after the change.
  • variable parameters are in the same state as the original plan.
  • the sudden change causes the user to be in another state.
  • the variable parameter represents the change of the behavior or state, and the state or behavior of the user after the change. For example, it was originally running at 5 pm, and suddenly there were other things, such as going to play, then changing from running to playing is a variable parameter, and the probability of such a change is also studied.
  • the step of generating the interactive content according to the multimodal information and the variable parameter specifically includes: generating the interactive content according to the multimodal information and the variable parameter and the fitting curve of the parameter change probability.
  • the fitting curve can be generated by the probability training of the variable parameters, thereby generating the robot interaction content.
  • the method for generating a fitting curve of the parameter change probability includes: using a probability algorithm, using a network to make a probability estimation of parameters between the robots, and calculating a scene of the robot on the life time axis on the life time axis. After the parameter is changed, the probability of each parameter changing forms a fitting curve of the parameter change probability.
  • the probability algorithm can adopt the Bayesian probability algorithm.
  • the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated.
  • the robot will know its geographical location, and will change the way the interactive content is generated according to the geographical environment in which it is located.
  • Bayesian probability algorithm to estimate the parameters between robots using Bayesian network, and calculate the probability of each parameter change after the change of the time axis scene parameters of the robot itself on the life time axis.
  • the curve dynamically affects the self-recognition of the robot itself.
  • This innovative module makes the robot itself a human lifestyle. For the expression, it can be changed according to the location scene.
  • a system for synchronizing voice and virtual actions including:
  • the obtaining module 201 is configured to acquire multi-modal information of the user
  • the artificial intelligence module 202 is configured to generate interaction content according to the multimodal information and the variable parameter of the user, where the interaction content includes at least voice information and action information, wherein the variable parameter is variable Parameter module 301 generates;
  • the control module 203 is configured to adjust the time length of the voice information and the time length of the motion information to be the same.
  • the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc.
  • the interactive content includes at least voice information and motion information, and in order to make the voice information and action
  • the information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting.
  • the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.
  • the multimodal information in this embodiment may be one of user expression, voice information, gesture information, scene information, image information, video information, face information, pupil iris information, light sense information, and fingerprint information.
  • variable parameters are specifically: sudden changes in people and machines, such as one day on the time axis is eating, sleeping, interacting, running, eating, sleeping. In this case, if the scene of the robot is suddenly changed, such as taking the beach at the time of running, etc., these human active parameters for the robot, as variable parameters, will cause the robot's self-cognition to change.
  • the life timeline and variable parameters can be used to change the attributes of self-cognition, such as mood values, fatigue values, etc., and can also automatically add new self-awareness information, such as no previous anger value, based on the life time axis and The scene of the variable factor will automatically add to the self-cognition of the robot based on the scene that previously simulated the human self-cognition.
  • the robot will use this as a variable parameter.
  • the robot will go out to go shopping at 12 noon to generate interactive content, instead of combining the previous 12 noon to generate interactive content in the meal, in the specific interaction
  • the robot generates the multi-modal information of the acquired user, such as voice information, video information, picture information, and the like, and variable parameters. In this way, some unexpected events in human life can be added to the life axis of the robot, making the interaction of the robot more anthropomorphic.
  • control module is specifically configured to:
  • the speed of the motion information is accelerated.
  • the length of time of the action information is equal to the length of time of the voice information.
  • the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to the length of time of the voice information.
  • the specific meaning of the adjustment may compress or stretch the length of the voice information or/and the length of the motion information, or may speed up the playback speed. Or slow down the playback speed, for example, multiply the playback speed of the voice message by 2, or multiply the playback time of the action information by 0.8 or the like.
  • the time length of the voice information and the time length of the motion information are one minute.
  • the length of the voice information is 1 minute, and the duration of the motion information is 2 minutes.
  • the playback speed of the motion information can be accelerated to twice the original playback speed, and then the playback time after the motion information adjustment is 1 minute, thereby synchronizing with the voice information.
  • the playback speed of the voice information can be slowed down, and adjusted to 0.5 times the original playback speed, so that the voice information is adjusted and then slowed down to 2 minutes, thereby synchronizing with the motion information.
  • both the voice information and the motion information can be adjusted, for example, the voice information is slowed down, and the motion information is accelerated, and the time is adjusted to 1 minute and 30 seconds, and the voice and the motion can be synchronized.
  • control module is specifically configured to:
  • the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time.
  • the length is equal to the length of time of the voice information.
  • part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
  • the meaning of the adjustment is to add or delete part of the action information, so that the time length of the action information is the same as the time length of the voice information.
  • the time length of the voice information and the time length of the motion information are 30 seconds.
  • the length of the voice information is 3 minutes, and the duration of the motion information is 1 minute.
  • other action information needs to be added to the original action information, for example, to find an action information with a length of 2 minutes, and the two sets of action information are sorted and combined to match the time length of the voice information. It is.
  • the length of the post-action information is 2 minutes, so that the length of the voice information can be matched the same.
  • the artificial intelligence module may be specifically configured to: select motion information that is closest to the time length of the voice information according to the length of the voice information, or select the closest voice information according to the time length of the motion information. .
  • the control module can conveniently adjust the time length of the voice information and the motion information, and it is easier to adjust to the same, and the adjusted play is more natural and smooth.
  • the system further includes an output module 204 for outputting the adjusted voice information and motion information to the virtual image for presentation.
  • the output can be output after the adjustment is consistent, and the output can be output on the virtual image, thereby making the virtual robot more anthropomorphic and improving the user experience.
  • the system further includes a processing module for fitting the self-cognitive parameters of the robot with the parameters of the scene in the variable parameters to generate variable parameters.
  • variable parameter includes at least a behavior that changes the user's original behavior and the change, and a parameter value that represents a change in the user's original behavior and the behavior after the change.
  • variable parameters are in the same state as the original plan.
  • the sudden change causes the user to be in another state.
  • the variable parameter represents the change of the behavior or state, and the state or behavior of the user after the change. For example, it was originally running at 5 pm, and suddenly there were other things, such as going to play, then changing from running to playing is a variable parameter, and the probability of such a change is also studied.
  • the artificial intelligence module is specifically configured to: generate interaction content according to the multi-modality information and the variable parameter and the fitting curve of the parameter change probability.
  • the fitting curve can be generated by the probability training of the variable parameters, thereby generating the robot interaction content.
  • the system includes a fitting curve generation module for using a probability algorithm to estimate a parameter between the robots using a network for probability estimation, and calculating a machine on a life time axis After the scene parameters on the life time axis are changed, the probability of each parameter change forms a fitting curve of the parameter change probability.
  • the probability algorithm can adopt the Bayesian probability algorithm.
  • the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated.
  • the robot will know its geographical location, and will change the way the interactive content is generated according to the geographical environment in which it is located.
  • Bayesian probability algorithm to estimate the parameters between robots using Bayesian network, and calculate the probability of each parameter change after the change of the time axis scene parameters of the robot itself on the life time axis.
  • the curve dynamically affects the self-recognition of the robot itself.
  • This innovative module makes the robot itself a human lifestyle. For the expression, it can be changed according to the location scene.
  • the invention discloses a robot comprising a system for synchronizing speech and virtual actions as described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Data Mining & Analysis (AREA)
  • Manipulator (AREA)

Abstract

A method for synchronizing speech and virtual actions, comprising: obtaining multimodal information of a user (S101); generating interactive content according to the multimodal information of the user and a variable parameter (300), the interactive content at least comprising speech information and action information (S102); and adjusting the length of time of the speech information and the length of time of the action information to be the same (S103). The interactive content is generated according to one or more types of the multimodal information of the user, such as user's speech, a user's expression, and a user's action. Moreover, to synchronize the speech information and the action information, the length of time of the speech information and the length of time of the action information are adjusted to be the same, so that sound and actions of a robot can be synchronized and matched during playing, and the robot can use not only speech but also multiple other expression forms, such as actions, for interaction. Therefore, the expression forms of the robot are further diversified, the robot is more humanized, and the user experience in interaction with the robot is also improved.

Description

一种同步语音及虚拟动作的方法、系统及机器人Method, system and robot for synchronizing voice and virtual action 技术领域Technical field
本发明涉及机器人交互技术领域,尤其涉及一种同步语音及虚拟动作的方法、系统及机器人。The present invention relates to the field of robot interaction technologies, and in particular, to a method, system and robot for synchronizing voice and virtual motion.
背景技术Background technique
机器人作为与人类的交互工具,使用的场合越来越多,例如一些老人、小孩较孤独时,就可以与机器人交互,包括对话、娱乐等。而为了让机器人与人类交互时更加拟人化,发明人研究出一种虚拟机器人的显示设备和成像系统,能够形成3D的动画形象,虚拟机器人的主机接受人类的指令例如语音等与人类进行交互,然后虚拟的3D动画形象会根据主机的指令进行声音和动作的回复,这样就可以让机器人更加拟人化,不仅在声音、表情上能够与人类交互,而且还可以在动作等上与人类交互,大大提高了交互的体验感。As an interactive tool with humans, robots are used more and more. For example, some elderly people and children can interact with robots, including dialogue and entertainment. In order to make the robot more humanized when interacting with humans, the inventor developed a virtual robot display device and imaging system, which can form a 3D animated image, and the virtual robot's host accepts human commands such as voice to interact with humans. Then the virtual 3D animated image will respond to the sounds and actions according to the instructions of the host, so that the robot can be more anthropomorphic, not only can interact with humans in sounds and expressions, but also interact with humans in actions, etc. Improve the experience of interaction.
然而,虚拟机器人如何将回复内容中的语音和虚拟动作进行同步是一个比较复杂的问题,如果语音和动作不能匹配,则会大大影响用户的交互体验。However, how the virtual robot synchronizes the voice and virtual actions in the reply content is a more complicated problem. If the voice and the motion cannot match, the user's interactive experience will be greatly affected.
因此,如何提供一种同步语音及虚拟动作的方法、系统及机器人,提升人机交互体验成为亟需解决的技术问题。Therefore, how to provide a method, system and robot for synchronizing voice and virtual motion, and improving the human-computer interaction experience become a technical problem that needs to be solved.
发明内容Summary of the invention
本发明的目的是提供一种同步语音及虚拟动作的方法、系统及机器人,提升人机交互体验。It is an object of the present invention to provide a method, system and robot for synchronizing speech and virtual actions to enhance the human-computer interaction experience.
本发明的目的是通过以下技术方案来实现的:The object of the present invention is achieved by the following technical solutions:
一种同步语音及虚拟动作的方法,包括:A method of synchronizing speech and virtual actions, including:
获取用户的多模态信息;Obtain multi-modal information of the user;
根据用户的多模态信息和可变参数生成交互内容,所述交互内容至少包括语音信息和动作信息;Generating interactive content according to the user's multimodal information and variable parameters, the interactive content including at least voice information and action information;
将语音信息的时间长度和动作信息的时间长度调整到相同。Adjust the length of the voice message and the length of the action information to the same length.
优选的,所述将语音信息的时间长度和动作信息的时间长度调整到相 同的具体步骤包括:Preferably, the length of the voice information and the length of the motion information are adjusted to phase The same specific steps include:
若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
优选的,当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。Preferably, when the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the time length of the motion information is equal to the length of time of the voice information.
优选的,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:Preferably, the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行排序组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message.
优选的,当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。Preferably, when the time length of the voice information is less than the time length of the motion information, part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
优选的,所述机器人可变参数的生成方法包括:将机器人的自我认知的参数与可变参数中场景的参数进行拟合,生成机器人可变参数。Preferably, the method for generating the variable parameter of the robot comprises: fitting a parameter of the self-cognition of the robot with a parameter of the scene in the variable parameter to generate a variable parameter of the robot.
优选的,所述可变参数至少包括改变用户原本的行为和改变之后的行为,以及代表改变用户原本的行为和改变之后的行为的参数值。Preferably, the variable parameter includes at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.
优选的,所述根据所述多模态信息和可变参数生成交互内容的步骤具体包括:根据所述多模态信息和可变参数以及参数改变概率的拟合曲线生成交互内容。Preferably, the step of generating the interactive content according to the multimodal information and the variable parameter specifically includes: generating the interactive content according to the multimodal information and the variable parameter and the fitting curve of the parameter changing probability.
优选的,所述参数改变概率的拟合曲线的生成方法包括:使用概率算法,将机器人之间的参数用网络做概率估计,计算当生活时间轴上的机器人在生活时间轴上的场景参数改变后,每个参数改变的概率,形成所述参数改变概率的拟合曲线。Preferably, the method for generating a fitting curve of the parameter change probability comprises: using a probability algorithm, using a network to make a probability estimation of parameters between the robots, and calculating a scene parameter change of the robot on the life time axis on the life time axis. After that, the probability of each parameter change forms a fitted curve of the parameter change probability.
一种同步语音及虚拟动作的系统,包括:A system for synchronizing voice and virtual actions, including:
获取模块,用于获取用户的多模态信息;An obtaining module, configured to acquire multi-modal information of the user;
人工智能模块,用于根据用户的多模态信息和可变参数生成交互内容,所述交互内容至少包括语音信息和动作信息;An artificial intelligence module, configured to generate interaction content according to multi-modality information and variable parameters of the user, where the interaction content includes at least voice information and action information;
控制模块,用于将语音信息的时间长度和动作信息的时间长度调整到 相同。a control module, configured to adjust the length of the voice information and the length of the motion information to the same.
优选的,所述控制模块具体用于:Preferably, the control module is specifically configured to:
若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
优选的,当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。Preferably, when the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the time length of the motion information is equal to the length of time of the voice information.
优选的,所述控制模块具体用于:Preferably, the control module is specifically configured to:
若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time. The length is equal to the length of time of the voice information.
优选的,当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。Preferably, when the time length of the voice information is less than the time length of the motion information, part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
优选的,所述系统还包括处理模块,用于将机器人的自我认知的参数与可变参数中场景的参数进行拟合,生成可变参数。Preferably, the system further comprises a processing module for fitting the self-cognitive parameters of the robot with the parameters of the scene in the variable parameters to generate variable parameters.
优选的,所述可变参数至少包括改变用户原本的行为和改变之后的行为,以及代表改变用户原本的行为和改变之后的行为的参数值。Preferably, the variable parameter includes at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.
优选的,所述人工智能模块具体用于:根据所述多模态信息和可变参数以及参数改变概率的拟合曲线生成交互内容。Preferably, the artificial intelligence module is specifically configured to: generate interaction content according to the multi-modal information and the variable parameter and the fitting curve of the parameter change probability.
优选的,所述系统包括拟合曲线生成模块,用于使用概率算法,将机器人之间的参数用网络做概率估计,计算当生活时间轴上的机器人在生活时间轴上的场景参数改变后,每个参数改变的概率,形成所述参数改变概率的拟合曲线。Preferably, the system includes a fitting curve generating module for using a probability algorithm to estimate a parameter between the robots using a network, and calculating a scene parameter of the robot on the life time axis after the life time axis is changed. The probability of each parameter change forms a fitted curve of the parameter change probability.
本发明公开一种机器人,包括如上述任一所述的一种同步语音及虚拟动作的系统。The invention discloses a robot comprising a system for synchronizing speech and virtual actions as described above.
相比现有技术,本发明具有以下优点:本发明的同步语音及虚拟动作的方法由于包括:获取用户的多模态信息;根据用户的多模态信息和可变参数生成交互内容,所述交互内容至少包括语音信息和动作信息;将语音信息的时间长度和动作信息的时间长度调整到相同。这样就可以通过用户的多模态信息例如用户语音、用户表情、用户动作等的一种或几种,来生 成交互内容,交互内容中至少包括语音信息和动作信息,而为了让语音信息和动作信息能够同步,将语音信息的时间长度和动作信息的时间长度调整到相同,这样就可以让机器人在播放声音和动作时可以同步匹配,使机器人在交互时不仅具有语音表现,还可以具有动作等多样的表现形式,机器人的表现形式更加多样化,使机器人更加拟人化,也提高了用户于机器人交互时的体验度。Compared with the prior art, the present invention has the following advantages: the method for synchronizing speech and virtual actions of the present invention includes: acquiring multi-modal information of a user; generating interactive content according to multi-modal information and variable parameters of the user, The interactive content includes at least voice information and motion information; the length of the voice information and the length of the motion information are adjusted to be the same. This can be done by one or more of the user's multimodal information such as user voice, user expressions, user actions, and the like. The interactive content includes at least voice information and motion information, and in order to synchronize the voice information and the motion information, the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can play the sound. It can be synchronized with the action, so that the robot not only has the voice performance when interacting, but also has various expressions such as actions. The robot's expression form is more diverse, which makes the robot more anthropomorphic and improves the user's interaction with the robot. Experience.
附图说明DRAWINGS
图1是本发明实施例一的一种同步语音及虚拟动作的方法的流程图;1 is a flowchart of a method for synchronizing voice and virtual actions according to Embodiment 1 of the present invention;
图2是本发明实施例二的一种同步语音及虚拟动作的系统的示意图。2 is a schematic diagram of a system for synchronizing voice and virtual actions according to Embodiment 2 of the present invention.
具体实施方式detailed description
虽然流程图将各项操作描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。各项操作的顺序可以被重新安排。当其操作完成时处理可以被终止,但是还可以具有未包括在附图中的附加步骤。处理可以对应于方法、函数、规程、子例程、子程序等等。Although the flowcharts describe various operations as a sequential process, many of the operations can be implemented in parallel, concurrently or concurrently. The order of the operations can be rearranged. Processing may be terminated when its operation is completed, but may also have additional steps not included in the figures. Processing can correspond to methods, functions, procedures, subroutines, subroutines, and the like.
计算机设备包括用户设备与网络设备。其中,用户设备或客户端包括但不限于电脑、智能手机、PDA等;网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算的由大量计算机或网络服务器构成的云。计算机设备可单独运行来实现本发明,也可接入网络并通过与网络中的其他计算机设备的交互操作来实现本发明。计算机设备所处的网络包括但不限于互联网、广域网、城域网、局域网、VPN网络等。Computer devices include user devices and network devices. The user equipment or the client includes but is not limited to a computer, a smart phone, a PDA, etc.; the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing-based computer or network server. cloud. The computer device can operate alone to carry out the invention, and can also access the network and implement the invention through interoperation with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
在这里可能使用了术语“第一”、“第二”等等来描述各个单元,但是这些单元不应当受这些术语限制,使用这些术语仅仅是为了将一个单元与另一个单元进行区分。这里所使用的术语“和/或”包括其中一个或更多所列出的相关联项目的任意和所有组合。当一个单元被称为“连接”或“耦合”到另一单元时,其可以直接连接或耦合到所述另一单元,或者可以存在中间单元。The terms "first," "second," and the like may be used herein to describe the various elements, but the elements should not be limited by these terms, and the terms are used only to distinguish one element from another. The term "and/or" used herein includes any and all combinations of one or more of the associated listed items. When a unit is referred to as being "connected" or "coupled" to another unit, it can be directly connected or coupled to the other unit, or an intermediate unit can be present.
这里所使用的术语仅仅是为了描述具体实施例而不意图限制示例性实施例。除非上下文明确地另有所指,否则这里所使用的单数形式“一个”、“一项”还意图包括复数。还应当理解的是,这里所使用的术语“包括”和/或“包含”规定所陈述的特征、整数、步骤、操作、单元和/或组件的 存在,而不排除存在或添加一个或更多其他特征、整数、步骤、操作、单元、组件和/或其组合。The terminology used herein is for the purpose of describing the particular embodiments, The singular forms "a", "an", It will also be understood that the terms "comprising" and / or "comprising", as used herein, are intended to mean the stated features, integers, steps, operations, units and/or components. The existence or addition of one or more other features, integers, steps, operations, units, components and/or combinations thereof may be present.
下面结合附图和较佳的实施例对本发明作进一步说明。The invention will now be further described with reference to the drawings and preferred embodiments.
实施例一Embodiment 1
如图1所示,本实施例中公开一种同步语音及虚拟动作的方法,包括:As shown in FIG. 1 , a method for synchronizing voice and virtual actions is disclosed in this embodiment, including:
S101、获取用户的多模态信息;S101. Acquire multi-modal information of the user.
S102、根据用户的多模态信息和可变参数300生成交互内容,所述交互内容至少包括语音信息和动作信息;S102. Generate interactive content according to the multimodal information of the user and the variable parameter 300, where the interactive content includes at least voice information and action information.
S103、将语音信息的时间长度和动作信息的时间长度调整到相同。S103. Adjust the length of the voice information and the length of the motion information to be the same.
本发明的同步语音及虚拟动作的方法由于包括:获取用户的多模态信息;根据用户的多模态信息和可变参数生成交互内容,所述交互内容至少包括语音信息和动作信息;将语音信息的时间长度和动作信息的时间长度调整到相同。这样就可以通过用户的多模态信息例如用户语音、用户表情、用户动作等的一种或几种,来生成交互内容,交互内容中至少包括语音信息和动作信息,而为了让语音信息和动作信息能够同步,将语音信息的时间长度和动作信息的时间长度调整到相同,这样就可以让机器人在播放声音和动作时可以同步匹配,使机器人在交互时不仅具有语音表现,还可以具有动作等多样的表现形式,机器人的表现形式更加多样化,使机器人更加拟人化,也提高了用户于机器人交互时的体验度。The method for synchronizing speech and virtual actions of the present invention includes: acquiring multimodal information of a user; generating interactive content according to multimodal information and variable parameters of the user, the interactive content including at least voice information and action information; The length of the information and the length of the action information are adjusted to be the same. In this way, the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc., and the interactive content includes at least voice information and motion information, and in order to make the voice information and action The information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting. With a variety of expressions, the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.
本实施例中的多模态信息可以是用户表情、语音信息、手势信息、场景信息、图像信息、视频信息、人脸信息、瞳孔虹膜信息、光感信息和指纹信息等其中的其中一种或几种。The multimodal information in this embodiment may be one of user expression, voice information, gesture information, scene information, image information, video information, face information, pupil iris information, light sense information, and fingerprint information. Several.
本实施例中,可变参数具体是:人与机器发生的突发改变,比如时间轴上的一天生活是吃饭、睡觉、交互、跑步、吃饭、睡觉。那在这个情况下,假如突然改变机器人的场景,比如在跑步的时间段带去海边等等,这些人类主动对于机器人的参数,作为可变参数,这些改变会使得机器人的自我认知产生改变。生活时间轴与可变参数可以对自我认知中的属性,例如心情值,疲劳值等等的更改,也可以自动加入新的自我认知信息,比如之前没有愤怒值,基于生活时间轴和可变因素的场景就会自动根据之前模拟人类自我认知的场景,从而对机器人的自我认知进行添加。In this embodiment, the variable parameters are specifically: sudden changes in people and machines, such as one day on the time axis is eating, sleeping, interacting, running, eating, sleeping. In this case, if the scene of the robot is suddenly changed, such as taking the beach at the time of running, etc., these human active parameters for the robot, as variable parameters, will cause the robot's self-cognition to change. The life timeline and variable parameters can be used to change the attributes of self-cognition, such as mood values, fatigue values, etc., and can also automatically add new self-awareness information, such as no previous anger value, based on the life time axis and The scene of the variable factor will automatically add to the self-cognition of the robot based on the scene that previously simulated the human self-cognition.
例如,按照生活时间轴,在中午12点的时候应该是吃饭的时间,而如果改变了这个场景,比如在中午12点的时候出去逛街了,那么机器人就会 将这个作为其中的一个可变参数进行写入,在这个时间段内用户与机器人交互时,机器人就会结合到中午12点出去逛街进行生成交互内容,而不是以之前的中午12点在吃饭进行结合生成交互内容,在具体生成交互内容时,机器人就会结合获取的用户的多模态信息,例如语音信息、视屏信息、图片信息等和可变参数进行生成。这样就可以加入一些人类生活中的突发事件在机器人的生活轴中,让机器人的交互更加拟人化。For example, according to the life time axis, it should be the time of eating at 12 noon, and if you change this scene, such as going out shopping at 12 noon, then the robot will Write this as one of the variable parameters. During this time period, when the user interacts with the robot, the robot will go out to go shopping at 12 noon to generate interactive content instead of eating at 12 o'clock in the past. In combination with generating the interactive content, when the interactive content is specifically generated, the robot generates the multi-modal information of the acquired user, such as voice information, video information, picture information, and the like, and variable parameters. In this way, some unexpected events in human life can be added to the life axis of the robot, making the interaction of the robot more anthropomorphic.
本实施例中,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:In this embodiment, the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。When the length of the voice information is longer than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to the length of time of the voice information.
因此,当语音信息的时间长度与动作信息的时间长度的差值不大于阈值,调整的具体含义可以为压缩或拉伸语音信息的时间长度或/和动作信息的时间长度,也可以是加快播放速度或者减缓播放速度,例如将语音信息的播放速度乘以2,或者将动作信息的播放时间乘以0.8等等。Therefore, when the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, the specific meaning of the adjustment may be the length of time for compressing or stretching the voice information or/and the length of the motion information, or may be accelerated. Speed or slow down the playback speed, for example, multiply the playback speed of the voice message by 2, or multiply the playback time of the action information by 0.8, and so on.
例如,语音信息的时间长度与动作信息的时间长度的阈值是一分钟,机器人根据用户的多模态信息生成的交互内容中,语音信息的时间长度是1分钟,动作信息的时间长度是2分钟,那么就可以将动作信息的播放速度加快,为原来播放速度的两倍,那么动作信息调整后的播放时间就会为1分钟,从而与语音信息进行同步。当然,也可以让语音信息的播放速度减缓,调整为原来播放速度的0.5倍,这样就会让语音信息经过调整后减缓为2分钟,从而与动作信息同步。另外,也可以将语音信息和动作信息都调整,例如语音信息减缓,同时将动作信息加快,都调整到1分30秒,也可以让语音和动作进行同步。For example, the time length of the voice information and the time length of the motion information are one minute. In the interactive content generated by the robot according to the multimodal information of the user, the length of the voice information is 1 minute, and the duration of the motion information is 2 minutes. Then, the playback speed of the motion information can be accelerated to twice the original playback speed, and then the playback time after the motion information adjustment is 1 minute, thereby synchronizing with the voice information. Of course, the playback speed of the voice information can be slowed down, and adjusted to 0.5 times the original playback speed, so that the voice information is adjusted and then slowed down to 2 minutes, thereby synchronizing with the motion information. In addition, both the voice information and the motion information can be adjusted, for example, the voice information is slowed down, and the motion information is accelerated, and the time is adjusted to 1 minute and 30 seconds, and the voice and the motion can be synchronized.
此外,本实施例中,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:In addition, in this embodiment, the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行排序组合,使组合后的动作信息的时间长度等于所述语音信息的时间长 度。If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message degree.
当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。When the length of the voice information is less than the length of the motion information, part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
因此,当语音信息的时间长度与动作信息的时间长度的差值大于阈值,调整的含义就是添加或者删除部分动作信息,以使动作信息的时间长度与语音信息的时间长度相同。Therefore, when the difference between the length of the voice information and the length of the motion information is greater than the threshold, the meaning of the adjustment is to add or delete part of the action information, so that the time length of the action information is the same as the time length of the voice information.
例如,语音信息的时间长度与动作信息的时间长度的阈值是30秒,机器人根据用户的多模态信息生成的交互内容中,语音信息的时间长度是3分钟,动作信息的时间长度是1分钟,那么就需要将其他的动作信息也加入到原本的动作信息中,例如找到一个时间长度为2分钟的动作信息,将上述两组动作信息进行排序组合后就与语音信息的时间长度匹配到相同了。当然,如果没有找到时间长度为2分钟的动作信息,而找到了一个时间长度为了2分半的,那么就可以选取这个2分半的动作信息中的部分动作(可以是部分帧),使选取后的动作信息的时间长度为2分钟,这样就可以语音信息的时间长度匹配相同了。For example, the time length of the voice information and the time length of the motion information are 30 seconds. In the interactive content generated by the robot according to the multimodal information of the user, the length of the voice information is 3 minutes, and the duration of the motion information is 1 minute. Then, other action information needs to be added to the original action information, for example, to find an action information with a length of 2 minutes, and the two sets of action information are sorted and combined to match the time length of the voice information. It is. Of course, if you do not find the action information with a length of 2 minutes and find a time length of 2 minutes and a half, you can select some of the 2:30 action information (may be part of the frame) to make the selection. The length of the post-action information is 2 minutes, so that the length of the voice information can be matched the same.
本实施例中,可以根据语音信息的时间长度,选择与语音信息的时间长度最接近的动作信息,也可以根据动作信息的时间长度选择最接近的语音信息。In this embodiment, the action information closest to the time length of the voice information may be selected according to the length of the voice information, or the closest voice information may be selected according to the time length of the motion information.
这样在选择的时候根据语音信息的时间长度进行选择,可以方便控制模块对语音信息和动作信息的时间长度的调整,更加容易调整到一致,而且调整后的播放更加自然,平滑。In this way, according to the length of time of the voice information, the control module can conveniently adjust the time length of the voice information and the motion information, and it is easier to adjust to the same, and the adjusted play is more natural and smooth.
根据其中一个示例,在将语音信息的时间长度和动作信息的时间长度调整到相同的步骤之后还包括:将调整后的语音信息和动作信息输出到虚拟影像进行展示。According to one example, after adjusting the time length of the voice information and the time length of the motion information to the same step, the method further includes: outputting the adjusted voice information and the motion information to the virtual image for display.
这样就可以在调整一致后进行输出,输出可以是在虚拟影像上进行输出,从而使虚拟机器人更加拟人化,提高用户体验度。In this way, the output can be output after the adjustment is consistent, and the output can be output on the virtual image, thereby making the virtual robot more anthropomorphic and improving the user experience.
根据其中一个示例,所述机器人可变参数的生成方法包括:将机器人的自我认知的参数与可变参数中场景的参数进行拟合,生成机器人可变参数。这样通过在结合可变参数的机器人的场景,将机器人本身的自我认知行扩展,对自我认知中的参数与可变参会苏轴中使用场景的参数进行拟合,产生拟人化的影响。 According to one example, the method for generating a robot variable parameter includes: fitting a self-cognitive parameter of the robot with a parameter of a scene in the variable parameter to generate a robot variable parameter. In this way, by expanding the self-cognitive line of the robot itself in the scene of the robot combined with the variable parameters, the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated. .
根据其中一个示例,所述可变参数至少包括改变用户原本的行为和改变之后的行为,以及代表改变用户原本的行为和改变之后的行为的参数值。According to one of the examples, the variable parameter includes at least a behavior that changes the user's original behavior and the change, and a parameter value that represents a change in the user's original behavior and the behavior after the change.
可变参数就是按照原本计划,是处于一种状态的,突然的改变让用户处于了另一种状态,可变参数就代表了这种行为或状态的变化,以及变化之后用户的状态或者行为,例如原本在下午5点是在跑步,突然有其他的事,例如去打球,那么从跑步改为打球就是可变参数,另外还要研究这种改变的几率。The variable parameters are in the same state as the original plan. The sudden change causes the user to be in another state. The variable parameter represents the change of the behavior or state, and the state or behavior of the user after the change. For example, it was originally running at 5 pm, and suddenly there were other things, such as going to play, then changing from running to playing is a variable parameter, and the probability of such a change is also studied.
根据其中一个示例,所述根据所述多模态信息和可变参数生成交互内容的步骤具体包括:根据所述多模态信息和可变参数以及参数改变概率的拟合曲线生成交互内容。According to one example, the step of generating the interactive content according to the multimodal information and the variable parameter specifically includes: generating the interactive content according to the multimodal information and the variable parameter and the fitting curve of the parameter change probability.
这样就可以通过可变参数的概率训练生成拟合曲线,从而生成机器人交互内容。In this way, the fitting curve can be generated by the probability training of the variable parameters, thereby generating the robot interaction content.
根据其中一个示例,所述参数改变概率的拟合曲线的生成方法包括:使用概率算法,将机器人之间的参数用网络做概率估计,计算当生活时间轴上的机器人在生活时间轴上的场景参数改变后,每个参数改变的概率,形成所述参数改变概率的拟合曲线。其中,概率算法可以采用贝叶斯概率算法。According to one example, the method for generating a fitting curve of the parameter change probability includes: using a probability algorithm, using a network to make a probability estimation of parameters between the robots, and calculating a scene of the robot on the life time axis on the life time axis. After the parameter is changed, the probability of each parameter changing forms a fitting curve of the parameter change probability. Among them, the probability algorithm can adopt the Bayesian probability algorithm.
通过在结合可变参数的机器人的场景,将机器人本身的自我认知行扩展,对自我认知中的参数与可变参会苏轴中使用场景的参数进行拟合,产生拟人化的影响。同时,加上对于地点场景的识别,使得机器人会知道自己的地理位置,会根据自己所处的地理环境,改变交互内容生成的方式。另外,我们使用贝叶斯概率算法,将机器人之间的参数用贝叶斯网络做概率估计,计算生活时间轴上的机器人本身时间轴场景参数改变后,每个参数改变的概率,形成拟合曲线,动态影响机器人本身的自我认知。这种创新的模块使得机器人本身具有人类的生活方式,对于表情这块,可按照所处的地点场景,做表情方面的改变。By expanding the self-cognitive line of the robot itself in the scene of the robot combined with the variable parameters, the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated. At the same time, coupled with the identification of the location scene, the robot will know its geographical location, and will change the way the interactive content is generated according to the geographical environment in which it is located. In addition, we use Bayesian probability algorithm to estimate the parameters between robots using Bayesian network, and calculate the probability of each parameter change after the change of the time axis scene parameters of the robot itself on the life time axis. The curve dynamically affects the self-recognition of the robot itself. This innovative module makes the robot itself a human lifestyle. For the expression, it can be changed according to the location scene.
实施例二Embodiment 2
如图2所示,本实施例中公开一种同步语音及虚拟动作的系统,包括:As shown in FIG. 2, a system for synchronizing voice and virtual actions is disclosed in this embodiment, including:
获取模块201,用于获取用户的多模态信息;The obtaining module 201 is configured to acquire multi-modal information of the user;
人工智能模块202,用于根据用户的多模态信息和可变参数生成交互内容,所述交互内容至少包括语音信息和动作信息,其中可变参数由可变 参数模块301生成;The artificial intelligence module 202 is configured to generate interaction content according to the multimodal information and the variable parameter of the user, where the interaction content includes at least voice information and action information, wherein the variable parameter is variable Parameter module 301 generates;
控制模块203,用于将语音信息的时间长度和动作信息的时间长度调整到相同。The control module 203 is configured to adjust the time length of the voice information and the time length of the motion information to be the same.
这样就可以通过用户的多模态信息例如用户语音、用户表情、用户动作等的一种或几种,来生成交互内容,交互内容中至少包括语音信息和动作信息,而为了让语音信息和动作信息能够同步,将语音信息的时间长度和动作信息的时间长度调整到相同,这样就可以让机器人在播放声音和动作时可以同步匹配,使机器人在交互时不仅具有语音表现,还可以具有动作等多样的表现形式,机器人的表现形式更加多样化,使机器人更加拟人化,也提高了用户于机器人交互时的体验度。In this way, the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc., and the interactive content includes at least voice information and motion information, and in order to make the voice information and action The information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting. With a variety of expressions, the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.
本实施例中的多模态信息可以是用户表情、语音信息、手势信息、场景信息、图像信息、视频信息、人脸信息、瞳孔虹膜信息、光感信息和指纹信息等其中的其中一种或几种。The multimodal information in this embodiment may be one of user expression, voice information, gesture information, scene information, image information, video information, face information, pupil iris information, light sense information, and fingerprint information. Several.
本实施例中,可变参数具体是:人与机器发生的突发改变,比如时间轴上的一天生活是吃饭、睡觉、交互、跑步、吃饭、睡觉。那在这个情况下,假如突然改变机器人的场景,比如在跑步的时间段带去海边等等,这些人类主动对于机器人的参数,作为可变参数,这些改变会使得机器人的自我认知产生改变。生活时间轴与可变参数可以对自我认知中的属性,例如心情值,疲劳值等等的更改,也可以自动加入新的自我认知信息,比如之前没有愤怒值,基于生活时间轴和可变因素的场景就会自动根据之前模拟人类自我认知的场景,从而对机器人的自我认知进行添加。In this embodiment, the variable parameters are specifically: sudden changes in people and machines, such as one day on the time axis is eating, sleeping, interacting, running, eating, sleeping. In this case, if the scene of the robot is suddenly changed, such as taking the beach at the time of running, etc., these human active parameters for the robot, as variable parameters, will cause the robot's self-cognition to change. The life timeline and variable parameters can be used to change the attributes of self-cognition, such as mood values, fatigue values, etc., and can also automatically add new self-awareness information, such as no previous anger value, based on the life time axis and The scene of the variable factor will automatically add to the self-cognition of the robot based on the scene that previously simulated the human self-cognition.
例如,按照生活时间轴,在中午12点的时候应该是吃饭的时间,而如果改变了这个场景,比如在中午12点的时候出去逛街了,那么机器人就会将这个作为其中的一个可变参数进行写入,在这个时间段内用户与机器人交互时,机器人就会结合到中午12点出去逛街进行生成交互内容,而不是以之前的中午12点在吃饭进行结合生成交互内容,在具体生成交互内容时,机器人就会结合获取的用户的多模态信息,例如语音信息、视屏信息、图片信息等和可变参数进行生成。这样就可以加入一些人类生活中的突发事件在机器人的生活轴中,让机器人的交互更加拟人化。For example, according to the life time axis, it should be the time of eating at 12 noon, and if you change this scene, such as going out shopping at 12 noon, the robot will use this as a variable parameter. Write, when the user interacts with the robot during this time period, the robot will go out to go shopping at 12 noon to generate interactive content, instead of combining the previous 12 noon to generate interactive content in the meal, in the specific interaction In the content, the robot generates the multi-modal information of the acquired user, such as voice information, video information, picture information, and the like, and variable parameters. In this way, some unexpected events in human life can be added to the life axis of the robot, making the interaction of the robot more anthropomorphic.
本实施例中,所述控制模块具体用于:In this embodiment, the control module is specifically configured to:
若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速 度,使动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the length of the motion information is not greater than the threshold, when the length of the voice information is less than the length of the motion information, the speed of the motion information is accelerated. Degree, the length of time of the action information is equal to the length of time of the voice information.
当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。When the length of the voice information is longer than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to the length of time of the voice information.
因此,当语音信息的时间长度与动作信息的时间长度的差值不大于阈值,调整的具体含义可以压缩或拉伸语音信息的时间长度或/和动作信息的时间长度,也可以是加快播放速度或者减缓播放速度,例如将语音信息的播放速度乘以2,或者将动作信息的播放时间乘以0.8等等。Therefore, when the difference between the length of the voice information and the length of the motion information is not greater than the threshold, the specific meaning of the adjustment may compress or stretch the length of the voice information or/and the length of the motion information, or may speed up the playback speed. Or slow down the playback speed, for example, multiply the playback speed of the voice message by 2, or multiply the playback time of the action information by 0.8 or the like.
例如,语音信息的时间长度与动作信息的时间长度的阈值是一分钟,机器人根据用户的多模态信息生成的交互内容中,语音信息的时间长度是1分钟,动作信息的时间长度是2分钟,那么就可以将动作信息的播放速度加快,为原来播放速度的两倍,那么动作信息调整后的播放时间就会为1分钟,从而与语音信息进行同步。当然,也可以让语音信息的播放速度减缓,调整为原来播放速度的0.5倍,这样就会让语音信息经过调整后减缓为2分钟,从而与动作信息同步。另外,也可以将语音信息和动作信息都调整,例如语音信息减缓,同时将动作信息加快,都调整到1分30秒,也可以让语音和动作进行同步。For example, the time length of the voice information and the time length of the motion information are one minute. In the interactive content generated by the robot according to the multimodal information of the user, the length of the voice information is 1 minute, and the duration of the motion information is 2 minutes. Then, the playback speed of the motion information can be accelerated to twice the original playback speed, and then the playback time after the motion information adjustment is 1 minute, thereby synchronizing with the voice information. Of course, the playback speed of the voice information can be slowed down, and adjusted to 0.5 times the original playback speed, so that the voice information is adjusted and then slowed down to 2 minutes, thereby synchronizing with the motion information. In addition, both the voice information and the motion information can be adjusted, for example, the voice information is slowed down, and the motion information is accelerated, and the time is adjusted to 1 minute and 30 seconds, and the voice and the motion can be synchronized.
此外,本实施例中,所述控制模块具体用于:In addition, in this embodiment, the control module is specifically configured to:
若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time. The length is equal to the length of time of the voice information.
当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。When the length of the voice information is less than the length of the motion information, part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
因此,当语音信息的时间长度与动作信息的时间长度的差值大于阈值,调整的含义就是添加或者删除部分动作信息,以使动作信息的时间长度与语音信息的时间长度相同。Therefore, when the difference between the length of the voice information and the length of the motion information is greater than the threshold, the meaning of the adjustment is to add or delete part of the action information, so that the time length of the action information is the same as the time length of the voice information.
例如,语音信息的时间长度与动作信息的时间长度的阈值是30秒,机器人根据用户的多模态信息生成的交互内容中,语音信息的时间长度是3分钟,动作信息的时间长度是1分钟,那么就需要将其他的动作信息也加入到原本的动作信息中,例如找到一个时间长度为2分钟的动作信息,将上述两组动作信息进行排序组合后就与语音信息的时间长度匹配到相同 了。当然,如果没有找到时间长度为2分钟的动作信息,而找到了一个时间长度为了2分半的,那么就可以选取这个2分半的动作信息中的部分动作(可以是部分帧),使选取后的动作信息的时间长度为2分钟,这样就可以语音信息的时间长度匹配相同了。For example, the time length of the voice information and the time length of the motion information are 30 seconds. In the interactive content generated by the robot according to the multimodal information of the user, the length of the voice information is 3 minutes, and the duration of the motion information is 1 minute. Then, other action information needs to be added to the original action information, for example, to find an action information with a length of 2 minutes, and the two sets of action information are sorted and combined to match the time length of the voice information. It is. Of course, if you do not find the action information with a length of 2 minutes and find a time length of 2 minutes and a half, you can select some of the 2:30 action information (may be part of the frame) to make the selection. The length of the post-action information is 2 minutes, so that the length of the voice information can be matched the same.
本实施例中,可以为所述人工智能模块具体用于:根据语音信息的时间长度,选择与语音信息的时间长度最接近的动作信息,也可以根据动作信息的时间长度选择最接近的语音信息。In this embodiment, the artificial intelligence module may be specifically configured to: select motion information that is closest to the time length of the voice information according to the length of the voice information, or select the closest voice information according to the time length of the motion information. .
这样在选择的时候根据语音信息的时间长度进行选择,可以方便控制模块对语音信息和动作信息的时间长度的调整,更加容易调整到一致,而且调整后的播放更加自然,平滑。In this way, according to the length of time of the voice information, the control module can conveniently adjust the time length of the voice information and the motion information, and it is easier to adjust to the same, and the adjusted play is more natural and smooth.
根据其中一个示例,所述系统还包括输出模块204,用于将调整后的语音信息和动作信息输出到虚拟影像进行展示。According to one example, the system further includes an output module 204 for outputting the adjusted voice information and motion information to the virtual image for presentation.
这样就可以在调整一致后进行输出,输出可以是在虚拟影像上进行输出,从而使虚拟机器人更加拟人化,提高用户体验度。In this way, the output can be output after the adjustment is consistent, and the output can be output on the virtual image, thereby making the virtual robot more anthropomorphic and improving the user experience.
根据其中一个示例,所述系统还包括处理模块,用于将机器人的自我认知的参数与可变参数中场景的参数进行拟合,生成可变参数。According to one of the examples, the system further includes a processing module for fitting the self-cognitive parameters of the robot with the parameters of the scene in the variable parameters to generate variable parameters.
这样通过在结合可变参数的机器人的场景,将机器人本身的自我认知行扩展,对自我认知中的参数与可变参会苏轴中使用场景的参数进行拟合,产生拟人化的影响。In this way, by expanding the self-cognitive line of the robot itself in the scene of the robot combined with the variable parameters, the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated. .
根据其中一个示例,所述可变参数至少包括改变用户原本的行为和改变之后的行为,以及代表改变用户原本的行为和改变之后的行为的参数值。According to one of the examples, the variable parameter includes at least a behavior that changes the user's original behavior and the change, and a parameter value that represents a change in the user's original behavior and the behavior after the change.
可变参数就是按照原本计划,是处于一种状态的,突然的改变让用户处于了另一种状态,可变参数就代表了这种行为或状态的变化,以及变化之后用户的状态或者行为,例如原本在下午5点是在跑步,突然有其他的事,例如去打球,那么从跑步改为打球就是可变参数,另外还要研究这种改变的几率。The variable parameters are in the same state as the original plan. The sudden change causes the user to be in another state. The variable parameter represents the change of the behavior or state, and the state or behavior of the user after the change. For example, it was originally running at 5 pm, and suddenly there were other things, such as going to play, then changing from running to playing is a variable parameter, and the probability of such a change is also studied.
根据其中一个示例,所述人工智能模块具体用于:根据所述多模态信息和可变参数以及参数改变概率的拟合曲线生成交互内容。According to one example, the artificial intelligence module is specifically configured to: generate interaction content according to the multi-modality information and the variable parameter and the fitting curve of the parameter change probability.
这样就可以通过可变参数的概率训练生成拟合曲线,从而生成机器人交互内容。In this way, the fitting curve can be generated by the probability training of the variable parameters, thereby generating the robot interaction content.
根据其中一个示例,所述系统包括拟合曲线生成模块,用于使用概率算法,将机器人之间的参数用网络做概率估计,计算当生活时间轴上的机 器人在生活时间轴上的场景参数改变后,每个参数改变的概率,形成所述参数改变概率的拟合曲线。其中,概率算法可以采用贝叶斯概率算法。According to one example, the system includes a fitting curve generation module for using a probability algorithm to estimate a parameter between the robots using a network for probability estimation, and calculating a machine on a life time axis After the scene parameters on the life time axis are changed, the probability of each parameter change forms a fitting curve of the parameter change probability. Among them, the probability algorithm can adopt the Bayesian probability algorithm.
通过在结合可变参数的机器人的场景,将机器人本身的自我认知行扩展,对自我认知中的参数与可变参会苏轴中使用场景的参数进行拟合,产生拟人化的影响。同时,加上对于地点场景的识别,使得机器人会知道自己的地理位置,会根据自己所处的地理环境,改变交互内容生成的方式。另外,我们使用贝叶斯概率算法,将机器人之间的参数用贝叶斯网络做概率估计,计算生活时间轴上的机器人本身时间轴场景参数改变后,每个参数改变的概率,形成拟合曲线,动态影响机器人本身的自我认知。这种创新的模块使得机器人本身具有人类的生活方式,对于表情这块,可按照所处的地点场景,做表情方面的改变。By expanding the self-cognitive line of the robot itself in the scene of the robot combined with the variable parameters, the parameters in the self-cognition are matched with the parameters of the scene used in the variable participation axis, and the influence of the personification is generated. At the same time, coupled with the identification of the location scene, the robot will know its geographical location, and will change the way the interactive content is generated according to the geographical environment in which it is located. In addition, we use Bayesian probability algorithm to estimate the parameters between robots using Bayesian network, and calculate the probability of each parameter change after the change of the time axis scene parameters of the robot itself on the life time axis. The curve dynamically affects the self-recognition of the robot itself. This innovative module makes the robot itself a human lifestyle. For the expression, it can be changed according to the location scene.
本发明公开一种机器人,包括如上述任一所述的一种同步语音及虚拟动作的系统。The invention discloses a robot comprising a system for synchronizing speech and virtual actions as described above.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。 The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Claims (19)

  1. 一种同步语音及虚拟动作的方法,其特征在于,包括:A method for synchronizing speech and virtual actions, comprising:
    获取用户的多模态信息;Obtain multi-modal information of the user;
    根据用户的多模态信息和可变参数生成交互内容,所述交互内容至少包括语音信息和动作信息;Generating interactive content according to the user's multimodal information and variable parameters, the interactive content including at least voice information and action information;
    将语音信息的时间长度和动作信息的时间长度调整到相同。Adjust the length of the voice message and the length of the action information to the same length.
  2. 根据权利要求1所述的方法,其特征在于,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:The method according to claim 1, wherein the specific steps of adjusting the length of time of the voice information and the length of the action information to the same include:
    若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
  3. 根据权利要求2所述的方法,其特征在于,当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。The method according to claim 2, wherein when the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the duration of the motion information is equal to The length of time of the voice information.
  4. 根据权利要求1所述的方法,其特征在于,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:The method according to claim 1, wherein the specific steps of adjusting the length of time of the voice information and the length of the action information to the same include:
    若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行排序组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message.
  5. 根据权利要求4所述的方法,其特征在于,当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。The method according to claim 4, wherein when the length of the voice information is less than the length of the action information, part of the action information is selected such that the length of the selected part of the action is equal to the voice information. length of time.
  6. 根据权利要求1所述的方法,其特征在于,所述机器人可变参数的生成方法包括:将机器人的自我认知的参数与可变参数中场景的参数进行拟合,生成机器人可变参数。The method according to claim 1, wherein the method for generating the variable parameter of the robot comprises: fitting a parameter of the self-cognition of the robot with a parameter of the scene in the variable parameter to generate a variable parameter of the robot.
  7. 根据权利要求6所述的方法,其特征在于,所述可变参数至少包括改变用户原本的行为和改变之后的行为,以及代表改变用户原本的行为和改变之后的行为的参数值。The method according to claim 6, wherein said variable parameter comprises at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.
  8. 根据权利要求1所述的方法,其特征在于,所述根据所述多模态信息和可变参数生成交互内容的步骤具体包括:根据所述多模态信息和可变参数以及参数改变概率的拟合曲线生成交互内容。 The method according to claim 1, wherein the step of generating interactive content according to the multimodal information and the variable parameter comprises: changing a probability according to the multimodal information and variable parameters and parameters The fitted curve generates interactive content.
  9. 根据权利要求8所述的方法,其特征在于,所述参数改变概率的拟合曲线的生成方法包括:使用概率算法,将机器人之间的参数用网络做概率估计,计算当生活时间轴上的机器人在生活时间轴上的场景参数改变后,每个参数改变的概率,形成所述参数改变概率的拟合曲线。The method according to claim 8, wherein the method for generating a fitting curve of the parameter change probability comprises: using a probability algorithm, using a network to make a probability estimate of the parameters between the robots, and calculating the life time axis After the scene parameters of the robot change on the life time axis, the probability of each parameter change forms a fitting curve of the parameter change probability.
  10. 一种同步语音及虚拟动作的系统,其特征在于,包括:A system for synchronizing voice and virtual actions, comprising:
    获取模块,用于获取用户的多模态信息;An obtaining module, configured to acquire multi-modal information of the user;
    人工智能模块,用于根据用户的多模态信息和可变参数生成交互内容,所述交互内容至少包括语音信息和动作信息;An artificial intelligence module, configured to generate interaction content according to multi-modality information and variable parameters of the user, where the interaction content includes at least voice information and action information;
    控制模块,用于将语音信息的时间长度和动作信息的时间长度调整到相同。The control module is configured to adjust the length of the voice information and the length of the motion information to be the same.
  11. 根据权利要求10所述的系统,其特征在于,所述控制模块具体用于:The system of claim 10, wherein the control module is specifically configured to:
    若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
  12. 根据权利要求11所述的系统,其特征在于,当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。The system according to claim 11, wherein when the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to The length of time of the voice information.
  13. 根据权利要求10所述的系统,其特征在于,所述控制模块具体用于:The system of claim 10, wherein the control module is specifically configured to:
    若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。If the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time. The length is equal to the length of time of the voice information.
  14. 根据权利要求13所述的系统,其特征在于,当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。The system according to claim 13, wherein when the length of the voice information is less than the length of the action information, part of the action information is selected such that the length of the selected part of the action is equal to the voice information. length of time.
  15. 根据权利要求10所述的系统,其特征在于,所述系统还包括处理模块,用于将机器人的自我认知的参数与可变参数中场景的参数进行拟合,生成可变参数。The system of claim 10, wherein the system further comprises a processing module for fitting the self-cognitive parameters of the robot to the parameters of the scene in the variable parameters to generate variable parameters.
  16. 根据权利要求15所述的系统,其特征在于,所述可变参数至少包括改变用户原本的行为和改变之后的行为,以及代表改变用户原本的行为和改变之后的行为的参数值。 The system according to claim 15, wherein said variable parameter comprises at least a behavior of changing a user's original behavior and a change, and a parameter value representing a behavior of changing a user's original behavior and a change.
  17. 根据权利要求10所述的系统,其特征在于,所述人工智能模块具体用于:根据所述多模态信息和可变参数以及参数改变概率的拟合曲线生成交互内容。The system according to claim 10, wherein the artificial intelligence module is specifically configured to: generate interaction content according to the multimodal information and the variable parameter and a fitting curve of the parameter change probability.
  18. 根据权利要求17所述的系统,其特征在于,所述系统包括拟合曲线生成模块,用于使用概率算法,将机器人之间的参数用网络做概率估计,计算当生活时间轴上的机器人在生活时间轴上的场景参数改变后,每个参数改变的概率,形成所述参数改变概率的拟合曲线。The system according to claim 17, wherein said system comprises a fitting curve generating module for using a probability algorithm to estimate a parameter between the robots using a network for probability estimation, and calculating a robot on a life time axis After the scene parameters on the life time axis are changed, the probability of each parameter change forms a fitting curve of the parameter change probability.
  19. 一种机器人,其特征在于,包括如权利要求9至18任一所述的一种同步语音及虚拟动作的系统。 A robot comprising a system for synchronizing speech and virtual actions according to any one of claims 9 to 18.
PCT/CN2016/089213 2016-07-07 2016-07-07 Method and system for synchronizing speech and virtual actions, and robot WO2018006369A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2016/089213 WO2018006369A1 (en) 2016-07-07 2016-07-07 Method and system for synchronizing speech and virtual actions, and robot
CN201680001720.7A CN106471572B (en) 2016-07-07 2016-07-07 Method, system and the robot of a kind of simultaneous voice and virtual acting
JP2017133167A JP6567609B2 (en) 2016-07-07 2017-07-06 Synchronizing voice and virtual motion, system and robot body

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/089213 WO2018006369A1 (en) 2016-07-07 2016-07-07 Method and system for synchronizing speech and virtual actions, and robot

Publications (1)

Publication Number Publication Date
WO2018006369A1 true WO2018006369A1 (en) 2018-01-11

Family

ID=58230946

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/089213 WO2018006369A1 (en) 2016-07-07 2016-07-07 Method and system for synchronizing speech and virtual actions, and robot

Country Status (3)

Country Link
JP (1) JP6567609B2 (en)
CN (1) CN106471572B (en)
WO (1) WO2018006369A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610703A (en) * 2019-07-26 2019-12-24 深圳壹账通智能科技有限公司 Speech output method, device, robot and medium based on robot recognition

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107457787B (en) * 2017-06-29 2020-12-08 杭州仁盈科技股份有限公司 Service robot interaction decision-making method and device
CN107577661B (en) * 2017-08-07 2020-12-11 北京光年无限科技有限公司 Interactive output method and system for virtual robot
CN107784355A (en) * 2017-10-26 2018-03-09 北京光年无限科技有限公司 The multi-modal interaction data processing method of visual human and system
CN109822587B (en) * 2019-03-05 2022-05-31 哈尔滨理工大学 Control method for head and neck device of voice diagnosis guide robot for factory and mine hospitals
WO2021085193A1 (en) * 2019-10-30 2021-05-06 ソニー株式会社 Information processing device and command processing method
CN115497499A (en) * 2022-08-30 2022-12-20 阿里巴巴(中国)有限公司 Method for synchronizing voice and action time

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5111409A (en) * 1989-07-21 1992-05-05 Elon Gasper Authoring and use systems for sound synchronized animation
CN101364309A (en) * 2008-10-09 2009-02-11 中国科学院计算技术研究所 Cartoon generating method for mouth shape of source virtual characters
US20090044112A1 (en) * 2007-08-09 2009-02-12 H-Care Srl Animated Digital Assistant
CN101968894A (en) * 2009-07-28 2011-02-09 上海冰动信息技术有限公司 Method for automatically realizing sound and lip synchronization through Chinese characters
CN103596051A (en) * 2012-08-14 2014-02-19 金运科技股份有限公司 A television apparatus and a virtual emcee display method thereof
CN104574478A (en) * 2014-12-30 2015-04-29 北京像素软件科技股份有限公司 Method and device for editing mouth shapes of animation figures
CN104866101A (en) * 2015-05-27 2015-08-26 世优(北京)科技有限公司 Real-time interactive control method and real-time interactive control device of virtual object
CN104883557A (en) * 2015-05-27 2015-09-02 世优(北京)科技有限公司 Real time holographic projection method, device and system

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10143351A (en) * 1996-11-13 1998-05-29 Sharp Corp Interface unit
EP2175665B1 (en) * 1996-12-04 2012-11-21 Panasonic Corporation Optical disk for high resolution and three-dimensional video recording, optical disk reproduction apparatus, and optical disk recording apparatus
JP3792882B2 (en) * 1998-03-17 2006-07-05 株式会社東芝 Emotion generation device and emotion generation method
JP4032273B2 (en) * 1999-12-28 2008-01-16 ソニー株式会社 Synchronization control apparatus and method, and recording medium
JP4670136B2 (en) * 2000-10-11 2011-04-13 ソニー株式会社 Authoring system, authoring method, and storage medium
JP3930389B2 (en) * 2002-07-08 2007-06-13 三菱重工業株式会社 Motion program generation device and robot during robot utterance
JP2005003926A (en) * 2003-06-11 2005-01-06 Sony Corp Information processor, method, and program
EP1845724A1 (en) * 2005-02-03 2007-10-17 Matsushita Electric Industrial Co., Ltd. Recording/reproduction device, recording/reproduction method, recording/reproduction apparatus and recording/reproduction method, and recording medium storing recording/reproduction program, and integrated circuit for use in recording/reproduction apparatus
JP2008040726A (en) * 2006-08-04 2008-02-21 Univ Of Electro-Communications User support system and user support method
JP5045519B2 (en) * 2008-03-26 2012-10-10 トヨタ自動車株式会社 Motion generation device, robot, and motion generation method
WO2010038063A2 (en) * 2008-10-03 2010-04-08 Bae Systems Plc Assisting with updating a model for diagnosing failures in a system
CN101604204B (en) * 2009-07-09 2011-01-05 北京科技大学 Distributed cognitive technology for intelligent emotional robot
JP2011054088A (en) * 2009-09-04 2011-03-17 National Institute Of Information & Communication Technology Information processor, information processing method, program, and interactive system
JP2012215645A (en) * 2011-03-31 2012-11-08 Speakglobal Ltd Foreign language conversation training system using computer
CN105598972B (en) * 2016-02-04 2017-08-08 北京光年无限科技有限公司 A kind of robot system and exchange method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5111409A (en) * 1989-07-21 1992-05-05 Elon Gasper Authoring and use systems for sound synchronized animation
US20090044112A1 (en) * 2007-08-09 2009-02-12 H-Care Srl Animated Digital Assistant
CN101364309A (en) * 2008-10-09 2009-02-11 中国科学院计算技术研究所 Cartoon generating method for mouth shape of source virtual characters
CN101968894A (en) * 2009-07-28 2011-02-09 上海冰动信息技术有限公司 Method for automatically realizing sound and lip synchronization through Chinese characters
CN103596051A (en) * 2012-08-14 2014-02-19 金运科技股份有限公司 A television apparatus and a virtual emcee display method thereof
CN104574478A (en) * 2014-12-30 2015-04-29 北京像素软件科技股份有限公司 Method and device for editing mouth shapes of animation figures
CN104866101A (en) * 2015-05-27 2015-08-26 世优(北京)科技有限公司 Real-time interactive control method and real-time interactive control device of virtual object
CN104883557A (en) * 2015-05-27 2015-09-02 世优(北京)科技有限公司 Real time holographic projection method, device and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610703A (en) * 2019-07-26 2019-12-24 深圳壹账通智能科技有限公司 Speech output method, device, robot and medium based on robot recognition

Also Published As

Publication number Publication date
JP2018001403A (en) 2018-01-11
CN106471572B (en) 2019-09-03
JP6567609B2 (en) 2019-08-28
CN106471572A (en) 2017-03-01

Similar Documents

Publication Publication Date Title
WO2018006369A1 (en) Method and system for synchronizing speech and virtual actions, and robot
WO2018006370A1 (en) Interaction method and system for virtual 3d robot, and robot
WO2018006371A1 (en) Method and system for synchronizing speech and virtual actions, and robot
TWI778477B (en) Interaction methods, apparatuses thereof, electronic devices and computer readable storage media
JP7109408B2 (en) Wide range simultaneous remote digital presentation world
KR101306221B1 (en) Method and apparatus for providing moving picture using 3d user avatar
WO2018000267A1 (en) Method for generating robot interaction content, system, and robot
WO2018000259A1 (en) Method and system for generating robot interaction content, and robot
US20220044490A1 (en) Virtual reality presentation of layers of clothing on avatars
WO2018000268A1 (en) Method and system for generating robot interaction content, and robot
JP2016071247A (en) Interaction device
WO2018006374A1 (en) Function recommending method, system, and robot based on automatic wake-up
WO2018006373A1 (en) Method and system for controlling household appliance on basis of intent recognition, and robot
WO2018006372A1 (en) Method and system for controlling household appliance on basis of intent recognition, and robot
US11681372B2 (en) Touch enabling process, haptic accessory, and core haptic engine to enable creation and delivery of tactile-enabled experiences with virtual objects
US20210375067A1 (en) Virtual reality presentation of clothing fitted on avatars
WO2018000266A1 (en) Method and system for generating robot interaction content, and robot
CN104616336B (en) A kind of animation construction method and device
WO2018000258A1 (en) Method and system for generating robot interaction content, and robot
EP4053792A1 (en) Information processing device, information processing method, and artificial intelligence model manufacturing method
EP4275147A1 (en) Producing a digital image representation of a body
JPWO2018168247A1 (en) Information processing apparatus, information processing method and program
Seib et al. Enhancing human-robot interaction by a robot face with facial expressions and synchronized lip movements
JP6637000B2 (en) Robot for deceased possession
Gillies et al. Piavca: a framework for heterogeneous interactions with virtual characters

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16907874

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16907874

Country of ref document: EP

Kind code of ref document: A1