CN107808191A - The output intent and system of the multi-modal interaction of visual human - Google Patents

The output intent and system of the multi-modal interaction of visual human Download PDF

Info

Publication number
CN107808191A
CN107808191A CN201710822978.XA CN201710822978A CN107808191A CN 107808191 A CN107808191 A CN 107808191A CN 201710822978 A CN201710822978 A CN 201710822978A CN 107808191 A CN107808191 A CN 107808191A
Authority
CN
China
Prior art keywords
nozzle type
modal
visual human
voice
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710822978.XA
Other languages
Chinese (zh)
Inventor
尚小维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Guangnian Wuxian Technology Co Ltd
Original Assignee
Beijing Guangnian Wuxian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Guangnian Wuxian Technology Co Ltd filed Critical Beijing Guangnian Wuxian Technology Co Ltd
Priority to CN201710822978.XA priority Critical patent/CN107808191A/en
Publication of CN107808191A publication Critical patent/CN107808191A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/008Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Robotics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present invention provides a kind of output intent of the multi-modal interaction of visual human, and it comprises the following steps:Instruction in response to reception enters wake-up states, and image is included in default viewing area;Obtain and multi-modal interactively enter data;The parsing of call capability interface interactively enters data, and multi-modal decision-making output data corresponding to generation;Voice document in multi-modal output data is matched with nozzle type model, the nozzle type file by matching the voice exports the voice, and nozzle type model includes:Pinyin model and its fused data with cutting word information.The present invention can also carry out complete nozzle type expression using using visual human's interaction that engages in the dialogue with image output so that the voice of visual human's output matches completely with nozzle type, so as to enhance the viscosity of user's visual sensory, improves interactive experience.

Description

The output intent and system of the multi-modal interaction of visual human
Technical field
The present invention relates to artificial intelligence field, specifically, is related to a kind of output side for the multi-modal interaction of visual human Method and system.
Background technology
The exploitation of robot chat interactive system is directed to imitating human conversation.The chat robots that early stage is widely known by the people should Include the received inputs of processing such as small i chat robots, siri chat robots on iPhone (including text with program Sheet or voice) and responded, to attempt to imitate mankind's response between context.
However, wanting to imitate human conversation completely, the interactive experience of user, these existing robot chat systems are enriched Far it is not by far up to the mark.
The content of the invention
To solve the above problems, the invention provides a kind of output intent of the multi-modal interaction of visual human, methods described bag Include following steps:
Instruction in response to reception enters wake-up states, and image is included in default viewing area;
Obtain and multi-modal interactively enter data;
Data, and multi-modal decision-making output data corresponding to generation are interactively entered described in the parsing of call capability interface;
Voice document in the multi-modal output data is matched with nozzle type model, by the mouth for matching the voice Type file exports the voice, and the nozzle type model includes:Pinyin model and its fused data with cutting word information.
According to one embodiment of present invention, in addition to:Pinyin model performs following steps:
Meanwhile speech recognition will be carried out to institute's voice file and be converted to text;
The text is divided according to pinyin syllable, pinyin syllable is matched with nozzle type parameter, generates Pinyin model.
According to one embodiment of present invention, nozzle type model performs following steps:Cutting is carried out to the voice document of collection With generating structure words;
The information of the structuring words is extracted, including:The initial time in voice document where it, end time And most strong amplitude;
The Pinyin model is merged with the structuring word information, generates nozzle type mould corresponding with nozzle type parameter Type.
According to one embodiment of present invention, the Pinyin model is merged with the structuring word information, wrapped Include:
The parameter of nozzle type corresponding to initial consonant and simple or compound vowel of a Chinese syllable is subjected to parameter of the fusion formation corresponding to the nozzle type of word;
The parameter of the nozzle type of each syllable is subjected to parameter of the fusion formation corresponding to the nozzle type of syllable combination;
The parameter of the nozzle type corresponding to word is further carried out to the ginseng that fusion forms the nozzle type for corresponding to the combination of word word Amount;
The parameter that the parameter of above-mentioned each nozzle type is added to the nozzle type for corresponding to ending character according to phonetic representation rhythm is carried out Matching combination forms final nozzle type model.
According to one embodiment of present invention, each parameter of the nozzle type includes:Nozzle type shape, nozzle type amplitude and tongue Form.
According to another aspect of the present invention, a kind of storage medium is additionally provided, is stored thereon with executable any of the above The program code of method and step described in.
According to another aspect of the present invention, a kind of output device of the multi-modal interaction of visual human, the dress are additionally provided Put and include:
Respond module, it is used to enter wake-up states in response to the instruction of reception, and image is included in default viewing area It is interior;
Acquisition module, it, which is used to obtaining, multi-modal interactively enters data;
Calling module, it is used to interactively enter data described in the parsing of call capability interface, and corresponding to generating it is multi-modal certainly Plan output data;
Matching module, it is used to be matched the voice document in the multi-modal output data with nozzle type model, leads to The nozzle type file of the overmatching voice exports the voice, and the nozzle type model includes:Pinyin model and its with cutting word information Fused data.
According to one embodiment of present invention, the matching module also includes with lower unit:
Converting unit, it is used to that speech recognition will to be carried out to institute's voice file and is converted to text;
Division unit, it is used to divide the text according to pinyin syllable, by pinyin syllable and the progress of nozzle type parameter Match somebody with somebody, generate Pinyin model.
According to one embodiment of present invention, described device includes:
Cutting unit, it is used to carry out cutting to the voice document of collection with generating structure words;
Extraction unit, it is used for the information for extracting the structuring words, including:Rising in the voice document where it Begin time, end time and most strong amplitude;
Integrated unit, it is used to be merged the Pinyin model with the structuring word information, generation and nozzle type Nozzle type model corresponding to parameter.
According to another aspect of the present invention, a kind of output system of the multi-modal interaction of visual human is additionally provided, its feature It is, the system includes:
Hardware device, its image for being used to show visual human and user and the processing of data in visual human's interaction;
Cloud server, it is used to coordinate the hardware device to complete following steps:
Data, and multi-modal decision-making output data corresponding to generation are interactively entered described in the parsing of call capability interface;
Voice document in the multi-modal output data is matched with nozzle type model, by the mouth for matching the voice Type file exports the voice, and the nozzle type model includes:Pinyin model and its fused data with cutting word information.
The present invention is engaged in the dialogue interaction using visual human, on the one hand can enrich the individual of dialogue, visual human's image is existed Shown on viewing area so that user look like with true man carry out it is multi-modal interact, add user and set with hardware Interaction fluency between standby.On the other hand, can also have been carried out using visual human's interaction that engages in the dialogue with image output Whole nozzle type expression so that the voice of visual human's image output matches completely with nozzle type, so as to enhance user's visual sensory Viscosity, improve interactive experience.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and obtained in claim and accompanying drawing.
Brief description of the drawings
Accompanying drawing is used for providing a further understanding of the present invention, and a part for constitution instruction, the reality with the present invention Apply example to be provided commonly for explaining the present invention, be not construed as limiting the invention.In the accompanying drawings:
Fig. 1 shows the interaction signal of the output system of the multi-modal interaction of visual human according to an embodiment of the invention Figure;
Fig. 2 shows the structural frames of the output system of the multi-modal interaction of visual human according to an embodiment of the invention Figure;
Fig. 3 shows the voice document of the output system of the multi-modal interaction of visual human according to an embodiment of the invention Matching process figure;
Fig. 4 shows the module frame of the output system of the multi-modal interaction of visual human according to an embodiment of the invention Figure;
Fig. 5 shows the flow chart of the output intent of the multi-modal interaction of visual human according to an embodiment of the invention;
Fig. 6 shows the voice match of the output intent of the multi-modal interaction of visual human according to an embodiment of the invention Flow chart;
Fig. 7 shows the voice match of the output intent of the multi-modal interaction of visual human according to an embodiment of the invention Detail flowchart;
Fig. 8 shows another flow of the output intent of the multi-modal interaction of visual human according to an embodiment of the invention Figure;And
Fig. 9 is shown according to one embodiment of present invention in user, hardware device and cloud server between the parties The flow chart to be communicated.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, the embodiment of the present invention is made below in conjunction with accompanying drawing Further describe in detail.
It is clear, it is necessary to be carried out before embodiment as described below to state:
The virtual artificial smart machine for being equipped on the input/output modules such as support perception, control mentioned of the present invention;
Using height emulation 3d virtual figure images as Main User Interface, possesses the outward appearance of notable character features;
Multi-modal man-machine interaction is supported, possesses natural language understanding, visually-perceptible, touch perception, language voice output, feelings Feel the AI abilities such as facial expressions and acts output;
Configurable social property, personality attribute, personage's technical ability etc., user is set to enjoy intelligent and personalized Flow Experience Virtual portrait.
The cloud server being previously mentioned carries out semanteme to provide the multi-modal interactive robot to the interaction demand of user Understand the terminal of the disposal ability of (language semantic understanding, Action Semantic understanding, affection computation, cognition calculate), realization and user Interaction, so as to help user carry out decision-making.
Each embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.
Fig. 1 shows the interaction signal of the output system of the multi-modal interaction of visual human according to an embodiment of the invention Figure.
As shown in figure 1, the system includes user 101, hardware device (including viewing area 1021 and hardware supported equipment 1022), visual human 103 and cloud server 104.Wherein, the user 101 interacted with visual human 103 can be true people, another One visual human and the visual human of entity, another visual human and the interaction of entity visual human and visual human with it is single People is similar with the interaction of visual human.Therefore, only show that user (people) interacted with the multi-modal of visual human in Fig. 1 Journey.
In addition, hardware device includes viewing area 1021 and hardware supported equipment 1022 (substantially core processor). Viewing area 1021 is used for the image for showing visual human 103, and hardware supported equipment 1022 is used cooperatively with cloud server 104, For the data processing in decision process.Visual human 103 needs screen display carrier to present.Therefore, viewing area 1021 includes:PC Screen, projecting apparatus, television set, multimedia display screen, line holographic projections, VR and AR.Multi-modal interaction proposed by the present invention needs Want certain hardware performance to be used as support, in general, be used as hardware supported equipment 1022 from the PC ends for having main frame.Scheming What viewing area 1021 was selected in 1 is PC screens.
The process interacted in Fig. 1 between visual human 103 and user 101 is:
First, visual human 103 can enter wake-up states in response to the instruction of reception, and image is included in default viewing area It is interior.Before interactive object sends and wakes up instruction, visual human in a dormant state, waits and wakes up sending for instruction, refer in wake-up After order is sent, visual human 103 enters wake-up states, to receive the instruction of the next step of user 101.
The mode of wake-up includes following manner but is not limited thereto, and can be that touch is waken up, voice is waken up, remote control is called out Wake up, face recognition wakes up and special time wakes up.For example, it is exactly the wake-up that visual human 103 is waken up by voice that voice, which wakes up, Mode, user 101 can reach the purpose of wake-up visual human 103 by fixed sound bite.In addition, visual human 103 is also Awakening mode can be entered in the specific time, this specific time can be set and be changed by user 101.In a word, call out The mode of awake visual human 103 has a lot, and any wake-up mode that can wake up visual human 103 can be applied to calling out for the present invention In step of waking up, the present invention makes limitation not to this.
After visual human 103 is waken up and enters wake-up states, acquisition is multi-modal to interactively enter data.Multi-modal interaction data Can be that user 101 is sending or inputted by perceiving environment.Multi-modal interaction data can include text, language The information of the multiple modalities such as sound, vision and perception information.The reception device for obtaining multi-modal interaction data is respectively mounted or matched somebody with somebody It is placed on hardware device, these reception devices include the received text device for receiving text, receive the pronunciation receiver of voice, Receive the camera of vision and receive infrored equipment of perception information etc..
Hardware device get it is multi-modal after interactively entering data, by these data transfers to visual human 103, visual human 103 meeting call capability interface parsings interactively enter data, and multi-modal decision-making output data corresponding to generation.The energy of visual human 103 It is enough by calling the ability interface in cloud server 104 to interactively enter data to parse, robot capability include semantic understanding, The abilities such as understanding, affection computation, cognition calculating are acted, comprehensive parsing can be carried out to multi-modal input data, so as to bright The clear multi-modal interaction intention for interactively entering data.According to the result of parsing, it becomes possible to multi-modal decision-making output corresponding to generation Data.These multi-modal decision-making output datas can pass through visual human's 103 after output matching is carried out with the image of visual human 103 Image output is shown.
Then, the voice document in multi-modal output data is matched with nozzle type model, by matching the voice Nozzle type file exports the voice, and nozzle type model includes:Pinyin model and its fused data with cutting word information.The mistake of matching Journey can be divided into Pinyin model process and nozzle type model process.
First, the execution step of Pinyin model is:
Voice document is changed into text, then text is divided according to pinyin syllable, pinyin syllable and nozzle type are joined Amount is matched.
Nozzle type model implementation procedure is exactly the process that Pinyin model is merged with structuring word information, and step includes:
The parameter of nozzle type corresponding to initial consonant and simple or compound vowel of a Chinese syllable is subjected to parameter of the fusion formation corresponding to the nozzle type of word;
The parameter of the nozzle type of each syllable is subjected to parameter of the fusion formation corresponding to the nozzle type of syllable combination;It will correspond to The parameter of the nozzle type of word further carries out the parameter that fusion forms the nozzle type for corresponding to the combination of word word.
The parameter that the parameter of above-mentioned each nozzle type is added to the nozzle type for corresponding to ending character according to phonetic representation rhythm is carried out Matching combination forms final nozzle type model.Each parameter of nozzle type includes:Nozzle type shape, nozzle type amplitude and tongue form.
It should be noted that above-mentioned steps can be carried out at hardware device end or server is carried out beyond the clouds, do not limit to.
In in interaction, visual human 103 can change oneself when responding and waiting other side to respond Mood.Except the response in expression, visual human 103 can also be expressed virtually by way of lowering one's voice and raising intonation The mood of people at that time.
Visual human 103 can be by parsing multi-modal interaction data to judge the current emotional of interactive object, according to interaction The emotional change of object makes corresponding expression, word speed, intonation.
Herein it should be noted that the image of visual human 103 and dressing up and being not limited to a kind of pattern.Visual human 103 can be with Possess different images and dress up.The image of visual human 103 is generally the high mould animating images of 3D.Visual human 103 can possess Different appearance and decoration.The image of every kind of visual human 103 can also correspond to it is a variety of different dress up, the classification dressed up can be according to Classify according to season, can also classify according to occasion.These images and dress up and may reside in cloud server 104, also may be used To be present in hardware device, can be called at any time when needing to call these images and dressing up.Later stage operation personnel can determine Phase uploads new image can be as needed to interaction platform, user with dressing up, and selects the image oneself liked and dresss up.
Above interactive step is exactly that first, the instruction in response to reception enters wake-up states in simple terms, and image is shown In default viewing area.Then, acquisition is multi-modal interactively enters data.Then, call capability interface parsing interactively enters number According to, and multi-modal decision-making output data corresponding to generation.Finally, by the voice document in multi-modal output data and nozzle type model Matched, the nozzle type file by matching the voice exports the voice, and nozzle type model includes:Pinyin model and its with cutting word The fused data of word information.
Fig. 2 shows the structural frames of the output system of the multi-modal interaction of visual human according to an embodiment of the invention Figure.As shown in Fig. 2 system includes user 101, hardware device, viewing area 1021, visual human 103 and cloud server 104.Wherein, user 101 includes single people, entity visual human and another visual human.Hardware device includes reception device 102A, processing unit 102B and external connection device 102C.Cloud server 104 includes to be filled with the communication of hardware device communication Put 1041.
Need to communicate to connect in builder between the parties in the output system of the multi-modal interaction of visual human provided by the invention, Unobstructed communication port is established between user 101, hardware device and cloud server 104, so as to complete user 101 interact with visual human 103.In order to complete interactive task, hardware device and cloud server 104 can be provided with support Complete the device and part of interaction.The object interacted with visual human can be a side, or multi-party.
Hardware device includes reception device 102A, processing unit 102B and external connection device 102C.Wherein, dress is received 102A is put to be used to receive multi-modal to interactively enter data.Reception device 102A example includes keyboard, cursor control device (mouse Mark), microphone, scanner, touch function (such as to detect the capacitance type transducers of physical touch) for voice operating, Camera (action for not being related to touch using the detection of visible or nonvisible wavelength) etc..Hardware device can be by mentioned above Input equipment multi-modal interactively enter data to obtain.
Processing unit 102B is used for handling the data in interaction.Usually between processing and visual human 103 Message.External connection device 102C is used for contacting between cloud server 104, can be sent by visual human 103 Call instruction multi-modal interactively enters data to call the robot capability on cloud server 104 to parse.
Cloud server 104 includes communicator 1041, and it is used to complete writing to each other between hardware device.Communication Keep in communication and contact between external connection device 102C on device 1041 and hardware device, receive the instruction of hardware device, and The instruction that cloud server 104 is sent is sent, is the medium linked up between hardware device and cloud server 104.
Fig. 3 shows the voice document of the output system of the multi-modal interaction of visual human according to an embodiment of the invention Matching process figure.As shown in figure 3, in order to reach the nozzle type of visual human 103 and voice perfection when voice document is exported The effect matched somebody with somebody, the process of voice document matching include speech recognition, Pinyin model, phonetic segmentation and nozzle type model.
First, the voice document in multi-modal decision-making output data is subjected to speech recognition, the process of speech recognition be by Voice document is converted to text.Then, Pinyin model processing is carried out to the text after conversion.To text according to Pinyin syllable is divided, and the pinyin syllable after division is matched with nozzle type parameter, generates Pinyin model.
In addition, also carrying out phonetic segmentation processing to voice document simultaneously, cutting is carried out to voice document with generating structure Words.Then, the information of structuring words is extracted, including:The initial time in voice document, end time where it with And most strong amplitude.Finally, Pinyin model is merged with structuring word information, generates nozzle type mould corresponding with nozzle type parameter Type.So far, the data can for completing matching coordinates nozzle type to export out by the image of visual human 103.
Pinyin model is included with the process that structuring word information is merged, first, by corresponding to initial consonant and simple or compound vowel of a Chinese syllable The parameter of nozzle type carries out the parameter that fusion forms the nozzle type corresponding to word.Then, the parameter of the nozzle type of each syllable is melted Close the parameter for forming the nozzle type for corresponding to syllable combination.Then, fusion shape will further be carried out corresponding to the parameter of the nozzle type of word Into the parameter of the nozzle type combined corresponding to word word.Finally, by the parameter of above-mentioned each nozzle type according to phonetic representation rhythm plus pair Matching combination should be carried out in the parameter of the nozzle type of ending character form final nozzle type model.
It should be noted that above-mentioned steps can be carried out at hardware device end or server is carried out beyond the clouds, do not limit to.
Fig. 4 shows the module frame of the output system of the multi-modal interaction of visual human according to an embodiment of the invention Figure.As shown in figure 4, system includes respond module 401, acquisition module 402, calling module 403 and matching module 404.
Wherein, respond module 401 includes wakeup unit and display unit.Acquisition module 4021 includes text collection unit 4021st, audio collection unit 4022, image acquisition units 4023 and video acquisition unit 4024.Calling module 403 includes language Reason and good sense solution unit 4031, visual identity unit 4032, cognition computing unit 4033 and affection computation unit 4034.Matching module 404 include Pinyin model unit 4041 and nozzle type model unit 4042.
In the interaction of visual human 103 and user 101, user 101 is firstly the need of wake-up visual human 103.Waking up When, wakeup unit receives the wake-up instruction that user 101 sends, and is verified to waking up instruction, and checking wakes up instruction Correctness, when wake up command verification by after, visual human 103 is waken up, and into wake-up states, waits the friendship of user 101 to be received Mutually instruction.After visual human 103 enters wake-up states, display unit can include the image of visual human 103 in hardware device In viewing area, more intuitively experience so that user 101 can have to visual human 103.
User 101 send it is multi-modal interactively enter data after, acquisition module 402 can call text collection unit 4021, A unit or several lists among audio collection unit 4022, image acquisition units 4023 and video acquisition unit 4024 Member is acquired to the multi-modal data that interactively enter, and by the information transfer collected to visual human 103, so as to visual human 103 These information are further analyzed and handled.In addition to above-mentioned collecting unit, acquisition module 402 can be with It is configured to gather the collecting unit of the other types information such as perception information, the present invention makes limitation not to this.
Visual human 103 receive transmission come it is multi-modal interactively enter data after, calling module 403 can call capability connect Mouth parsing interactively enters data, and multi-modal decision-making output data corresponding to generation.Robot capability includes semantic understanding unit 4031st, visual identity unit 4032, cognition computing unit 4033 and affection computation unit 4034.These robot capabilities can Input information is analyzed and judged, the multimode corresponding to this interaction is generated further according to analysis and the result judged State decision-making output data.
Finally, matching module 404 can carry out the voice document in multi-modal decision-making output data the matching of nozzle type, spell Sound model unit 4041 and nozzle type model unit 4042 can perfectly match the nozzle type of visual human 103 with voice document, with Just it can accomplish when exporting voice document, the unification of nozzle type and voice, avoid the occurrence of sound and draw inconsistent situation.
Fig. 5 shows the flow chart of the output intent of the multi-modal interaction of visual human according to an embodiment of the invention.
Four steps are included in the output intent of the multi-modal interaction of visual human provided by the invention, are in response to respectively in connecing The instruction of receipts enters wake-up states, and image is included in default viewing area.Obtain and multi-modal interactively enter data.Call energy The parsing of power interface interactively enters data, and multi-modal decision-making output data corresponding to generation.By the language in multi-modal output data Sound file is matched with nozzle type model, and the nozzle type file by matching the voice exports the voice, and nozzle type model includes:Phonetic Model and its fused data with cutting word information.
By above step, visual human 103 can just deploy to hand over user 101 under the effect that voice matches with nozzle type Mutually so that the interaction more horn of plenty of virtual person to person and smooth, the interaction performance of visual human 103 is more close to the mankind.
Fig. 6 shows the voice match of the output intent of the multi-modal interaction of visual human according to an embodiment of the invention Flow chart.For the details of the more detailed output intent for introducing the multi-modal interaction of visual human provided by the invention, lead to hereby Cross the flow chart expansion explanation shown in Fig. 6.
During voice document is matched with nozzle type, it is necessary first to voice document is carried out into speech recognition and turned It is changed to text.Then, text is divided according to pinyin syllable, pinyin syllable is matched with nozzle type parameter, generate phonetic mould Type.
While voice document is carried out into speech recognition and is converted to text, cutting is carried out to voice document to generate knot Structure words.Then, the information of structuring words is extracted, including:The initial time in voice document where it, at the end of Between and most strong amplitude.Finally, Pinyin model is merged with structuring word information, generates mouth corresponding with nozzle type parameter Pattern type.
Two steps can be roughly divided into the flow chart shown in Fig. 6, first step is to generate Pinyin model, second Individual step is generation nozzle type model, by the matching process of the two steps, the voice document in multi-modal decision-making output data It can just export and show user 101.
Fig. 7 shows the voice match of the output intent of the multi-modal interaction of visual human according to an embodiment of the invention Detail flowchart.
Fig. 7 mainly illustrates the step of being merged Pinyin model with structuring word information.In fusion process, First, the parameter of nozzle type corresponding to initial consonant and simple or compound vowel of a Chinese syllable is subjected to parameter of the fusion formation corresponding to the nozzle type of word.Then, will be each The parameter of the nozzle type of syllable carries out the parameter that fusion forms the nozzle type for corresponding to syllable combination.Then, by corresponding to the nozzle type of word Parameter further carry out fusion formed correspond to word word combination nozzle type parameter.Finally, by the parameter of above-mentioned each nozzle type Matching combination is carried out plus the parameter of the nozzle type corresponding to ending character according to phonetic representation rhythm and forms final nozzle type model.
By the flow chart shown in Fig. 7, both Pinyin model and structuring word information can be merged, to ensure During interaction, the voice that visual human 103 exports is synchronous with nozzle type holding, lifts interactive quality and the smoothness of interaction Degree.
Fig. 8 shows another flow of the output intent of the multi-modal interaction of visual human according to an embodiment of the invention Figure.
As illustrated, in step S801, hardware device sends conversation content to cloud server 104.Afterwards, hardware is set The standby state for being constantly in the reply for waiting cloud server 104.During wait, hardware device can be to returned data institute The time of cost carries out Clocked operation.If the reply data not returned for a long time, such as, exceed predetermined time length 5S is spent, then hardware device can select to carry out local reply, generate local conventional reply data.Then by the image of visual human 103 Plug-in unit exports the animation coordinated with local conventional response, and calls voice playing equipment to carry out speech play.
Fig. 9 is shown according to one embodiment of present invention in user, hardware device and cloud server between the parties The flow chart to be communicated.
In order to realize interacting between user 101 and visual human 103, user 101, hardware device and cloud service Need to keep in touch in real time between device 104, transmit data and information.
At the beginning of interaction, user 101, which needs to send, wakes up instruction so that visual human 103 enters wake-up states.Now, carry out The object of communication is user 101 and hardware device, and hardware device can enter wake-up states in response to the instruction of reception, by image It is shown in default viewing area.
Then, visual human 103 waits further interactive information, and after interactive information hair, it is defeated to obtain multi-modal interaction Enter data.Then, expansion communicates between hardware device and cloud server 104, the parsing interaction of hardware device call capability interface Input data, and multi-modal decision-making output data corresponding to generation.Robot capability is mounted on cloud server 104, comprising Semantic understanding ability, visual identity ability, cognition computing capability and affection computation ability.
Then, visual human 103 is matched the voice document in multi-modal output data with nozzle type model, passes through matching The nozzle type file of the voice exports the voice, and nozzle type model includes:Pinyin model and its fused data with cutting word information. Finally, the multi-modal decision-making output data after matching is exported out by visual human 103 in the form of multi-modal.
The present invention is engaged in the dialogue interaction using visual human, and visual human's image is shown on the display region so that use Family look like with true man carry out it is multi-modal interact, add the fluency of interacting between user and hardware device.The opposing party Face, complete nozzle type expression can also be carried out using visual human's interaction that engages in the dialogue with image output so that virtual humanoid As the voice of output matches completely with nozzle type.
It should be understood that disclosed embodiment of this invention is not limited to specific structure disclosed herein, processing step Or material, and the equivalent substitute for these features that those of ordinary skill in the related art are understood should be extended to.It should also manage Solution, term as used herein are only used for describing the purpose of specific embodiment, and are not intended to limit.
" one embodiment " or " embodiment " mentioned in specification means special characteristic, the structure described in conjunction with the embodiments Or during characteristic is included at least one embodiment of the present invention.Therefore, the phrase " reality that specification various places throughout occurs Apply example " or " embodiment " same embodiment might not be referred both to.
While it is disclosed that embodiment as above, but described content only to facilitate understand the present invention and adopt Embodiment, it is not limited to the present invention.Any those skilled in the art to which this invention pertains, this is not being departed from On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims (10)

1. a kind of output intent of the multi-modal interaction of visual human, it is characterised in that the described method comprises the following steps:
Instruction in response to reception enters wake-up states, and image is included in default viewing area;
Obtain and multi-modal interactively enter data;
Data, and multi-modal decision-making output data corresponding to generation are interactively entered described in the parsing of call capability interface;
Voice document in the multi-modal output data is matched with nozzle type model, by the nozzle type text for matching the voice Part exports the voice, and the nozzle type model includes:Pinyin model and its fused data with cutting word information.
2. the output intent of the multi-modal interaction of visual human as claimed in claim 1, the Pinyin model are held in accordance with the following steps OK:
Speech recognition will be carried out to institute's voice file and be converted to text;
The text is divided according to pinyin syllable, pinyin syllable is matched with nozzle type parameter, generates Pinyin model.
3. the output intent of the multi-modal interaction of visual human as claimed in claim 2, it is characterised in that the nozzle type model according to Following steps perform:
Cutting is carried out with generating structure words to the voice document of collection;
The information of the structuring words is extracted, including:The initial time in voice document, end time where it and Most strong amplitude;
The Pinyin model is merged with the structuring word information, generates nozzle type model corresponding with nozzle type parameter.
4. the output intent of the multi-modal interaction of visual human as claimed in claim 3, it is characterised in that by the Pinyin model with The structuring word information is merged, including:
The parameter of nozzle type corresponding to initial consonant and simple or compound vowel of a Chinese syllable is subjected to parameter of the fusion formation corresponding to the nozzle type of word;
The parameter of the nozzle type of each syllable is subjected to parameter of the fusion formation corresponding to the nozzle type of syllable combination;
The parameter of the nozzle type corresponding to word is further carried out to the parameter that fusion forms the nozzle type for corresponding to the combination of word word;
The parameter that the parameter of above-mentioned each nozzle type is added to the nozzle type for corresponding to ending character according to phonetic representation rhythm is matched Combination forms final nozzle type model.
5. the output intent of the multi-modal interaction of visual human as described in claim 2-4, it is characterised in that the nozzle type it is each Parameter includes:Nozzle type shape, nozzle type amplitude and tongue form.
6. a kind of storage medium, it is stored thereon with the program of the executable method and step as any one of claim 1-5 Code.
7. a kind of output device of the multi-modal interaction of visual human, it is characterised in that described device includes:
Respond module, it is used to enter wake-up states in response to the instruction of reception, and image is included in default viewing area;
Acquisition module, it, which is used to obtaining, multi-modal interactively enters data;
Calling module, it is used to interactively enter data described in the parsing of call capability interface, and multi-modal decision-making corresponding to generation is defeated Go out data;
Matching module, it is used to be matched the voice document in the multi-modal output data with nozzle type model, by Nozzle type file with the voice exports the voice, and the nozzle type model includes:Pinyin model and its melt with cutting word information Close data.
8. the output device of the multi-modal interaction of visual human as claimed in claim 7, it is characterised in that the matching module also wraps Include with lower unit:
Converting unit, it is used to that speech recognition will to be carried out to institute's voice file and is converted to text;
Division unit, it is used to divide the text according to pinyin syllable, and pinyin syllable is matched with nozzle type parameter, raw Into Pinyin model.
9. the output device of the multi-modal interaction of visual human as claimed in claim 7, it is characterised in that described device includes:
Cutting unit, it is used to carry out cutting to the voice document of collection with generating structure words;
Extraction unit, it is used for the information for extracting the structuring words, including:During starting in the voice document where it Between, end time and most strong amplitude;
Integrated unit, it is used to be merged the Pinyin model with the structuring word information, generation and nozzle type parameter Corresponding nozzle type model.
10. a kind of output system of the multi-modal interaction of visual human, it is characterised in that the system includes:
Hardware device, its image for being used to show visual human and user and the processing of data in visual human's interaction;
Cloud server, it is used to coordinate the hardware device to complete following steps:
Data, and multi-modal decision-making output data corresponding to generation are interactively entered described in the parsing of call capability interface;
Voice document in the multi-modal output data is matched with nozzle type model, by the nozzle type text for matching the voice Part exports the voice, and the nozzle type model includes:Pinyin model and its fused data with cutting word information.
CN201710822978.XA 2017-09-13 2017-09-13 The output intent and system of the multi-modal interaction of visual human Pending CN107808191A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710822978.XA CN107808191A (en) 2017-09-13 2017-09-13 The output intent and system of the multi-modal interaction of visual human

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710822978.XA CN107808191A (en) 2017-09-13 2017-09-13 The output intent and system of the multi-modal interaction of visual human

Publications (1)

Publication Number Publication Date
CN107808191A true CN107808191A (en) 2018-03-16

Family

ID=61591446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710822978.XA Pending CN107808191A (en) 2017-09-13 2017-09-13 The output intent and system of the multi-modal interaction of visual human

Country Status (1)

Country Link
CN (1) CN107808191A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108961431A (en) * 2018-07-03 2018-12-07 百度在线网络技术(北京)有限公司 Generation method, device and the terminal device of facial expression
CN109032340A (en) * 2018-06-29 2018-12-18 百度在线网络技术(北京)有限公司 Operating method for electronic equipment and device
CN109326151A (en) * 2018-11-01 2019-02-12 北京智能优学科技有限公司 Implementation method, client and server based on semantics-driven virtual image
CN110653815A (en) * 2018-06-29 2020-01-07 深圳市优必选科技有限公司 Robot control method, robot and computer storage medium
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN110874137A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Interaction method and device
CN111522929A (en) * 2020-04-22 2020-08-11 深圳创维-Rgb电子有限公司 Conductance decompression data processing method, display device and storage medium
CN112396182A (en) * 2021-01-19 2021-02-23 腾讯科技(深圳)有限公司 Method for training face driving model and generating face mouth shape animation
CN113205797A (en) * 2021-04-30 2021-08-03 平安科技(深圳)有限公司 Virtual anchor generation method and device, computer equipment and readable storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482976A (en) * 2009-01-19 2009-07-15 腾讯科技(深圳)有限公司 Method for driving change of lip shape by voice, method and apparatus for acquiring lip cartoon
CN101826216A (en) * 2010-03-31 2010-09-08 中国科学院自动化研究所 Automatic generating system for role Chinese mouth shape cartoon
CN202315292U (en) * 2011-11-11 2012-07-11 山东科技大学 Comprehensive greeting robot based on smart phone interaction
CN103218841A (en) * 2013-04-26 2013-07-24 中国科学技术大学 Three-dimensional vocal organ animation method combining physiological model and data driving model
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
CN104769645A (en) * 2013-07-10 2015-07-08 哲睿有限公司 Virtual companion
CN105345818A (en) * 2015-11-04 2016-02-24 深圳好未来智能科技有限公司 3D video interaction robot with emotion module and expression module
CN105867633A (en) * 2016-04-26 2016-08-17 北京光年无限科技有限公司 Intelligent robot oriented information processing method and system
CN105975280A (en) * 2016-05-13 2016-09-28 苏州乐派特机器人有限公司 Multipurpose flexible materialization programming module and realizing method thereof
CN106875947A (en) * 2016-12-28 2017-06-20 北京光年无限科技有限公司 For the speech output method and device of intelligent robot

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482976A (en) * 2009-01-19 2009-07-15 腾讯科技(深圳)有限公司 Method for driving change of lip shape by voice, method and apparatus for acquiring lip cartoon
CN101826216A (en) * 2010-03-31 2010-09-08 中国科学院自动化研究所 Automatic generating system for role Chinese mouth shape cartoon
CN202315292U (en) * 2011-11-11 2012-07-11 山东科技大学 Comprehensive greeting robot based on smart phone interaction
CN103218841A (en) * 2013-04-26 2013-07-24 中国科学技术大学 Three-dimensional vocal organ animation method combining physiological model and data driving model
CN104769645A (en) * 2013-07-10 2015-07-08 哲睿有限公司 Virtual companion
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
CN105345818A (en) * 2015-11-04 2016-02-24 深圳好未来智能科技有限公司 3D video interaction robot with emotion module and expression module
CN105867633A (en) * 2016-04-26 2016-08-17 北京光年无限科技有限公司 Intelligent robot oriented information processing method and system
CN105975280A (en) * 2016-05-13 2016-09-28 苏州乐派特机器人有限公司 Multipurpose flexible materialization programming module and realizing method thereof
CN106875947A (en) * 2016-12-28 2017-06-20 北京光年无限科技有限公司 For the speech output method and device of intelligent robot

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032340B (en) * 2018-06-29 2020-08-07 百度在线网络技术(北京)有限公司 Operation method and device for electronic equipment
CN110653815B (en) * 2018-06-29 2021-12-07 深圳市优必选科技有限公司 Robot control method, robot and computer storage medium
CN110653815A (en) * 2018-06-29 2020-01-07 深圳市优必选科技有限公司 Robot control method, robot and computer storage medium
CN109032340A (en) * 2018-06-29 2018-12-18 百度在线网络技术(北京)有限公司 Operating method for electronic equipment and device
CN108961431A (en) * 2018-07-03 2018-12-07 百度在线网络技术(北京)有限公司 Generation method, device and the terminal device of facial expression
CN110874137A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Interaction method and device
CN110874137B (en) * 2018-08-31 2023-06-13 阿里巴巴集团控股有限公司 Interaction method and device
CN109326151A (en) * 2018-11-01 2019-02-12 北京智能优学科技有限公司 Implementation method, client and server based on semantics-driven virtual image
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN110688911B (en) * 2019-09-05 2021-04-02 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN111522929A (en) * 2020-04-22 2020-08-11 深圳创维-Rgb电子有限公司 Conductance decompression data processing method, display device and storage medium
CN112396182A (en) * 2021-01-19 2021-02-23 腾讯科技(深圳)有限公司 Method for training face driving model and generating face mouth shape animation
CN112396182B (en) * 2021-01-19 2021-04-16 腾讯科技(深圳)有限公司 Method for training face driving model and generating face mouth shape animation
CN113205797A (en) * 2021-04-30 2021-08-03 平安科技(深圳)有限公司 Virtual anchor generation method and device, computer equipment and readable storage medium
CN113205797B (en) * 2021-04-30 2024-03-05 平安科技(深圳)有限公司 Virtual anchor generation method, device, computer equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN107808191A (en) The output intent and system of the multi-modal interaction of visual human
CN108000526B (en) Dialogue interaction method and system for intelligent robot
CN107340859A (en) The multi-modal exchange method and system of multi-modal virtual robot
CN107797663A (en) Multi-modal interaction processing method and system based on visual human
CN107632706A (en) The application data processing method and system of multi-modal visual human
CN107765852A (en) Multi-modal interaction processing method and system based on visual human
CN107340865A (en) Multi-modal virtual robot exchange method and system
CN107704169B (en) Virtual human state management method and system
CN107894833A (en) Multi-modal interaction processing method and system based on visual human
CN107870977A (en) Chat robots output is formed based on User Status
CN110400251A (en) Method for processing video frequency, device, terminal device and storage medium
CN107294837A (en) Engaged in the dialogue interactive method and system using virtual robot
CN107728780A (en) A kind of man-machine interaction method and device based on virtual robot
CN107870994A (en) Man-machine interaction method and system for intelligent robot
CN110309254A (en) Intelligent robot and man-machine interaction method
CN107704612A (en) Dialogue exchange method and system for intelligent robot
CN109117952B (en) Robot emotion cognition method based on deep learning
CN105244042B (en) A kind of speech emotional interactive device and method based on finite-state automata
CN109324688A (en) Exchange method and system based on visual human's behavioral standard
CN107480766A (en) The method and system of the content generation of multi-modal virtual robot
CN108052250A (en) Virtual idol deductive data processing method and system based on multi-modal interaction
CN108416420A (en) Limbs exchange method based on visual human and system
CN115330911A (en) Method and system for driving mimicry expression by using audio
CN106653020A (en) Multi-business control method and system for smart sound and video equipment based on deep learning
CN108415561A (en) Gesture interaction method based on visual human and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180316