CN107944542A

CN107944542A - A kind of multi-modal interactive output method and system based on visual human

Info

Publication number: CN107944542A
Application number: CN201711162023.2A
Authority: CN
Inventors: 徐强; 尚小维
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Guangnian Wuxian Technology Co Ltd
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2018-04-20

Abstract

The application provides a kind of multi-modal interactive output method and system based on visual human, wherein, the described method includes：The visual human runs in smart machine, obtain multi-modal data, voice data is included at least in the multi-modal data, parse the multi-modal data, to obtain the semantic data and affection data in the voice data, the semantic data and the affection data are matched with the facial parameter of the visual human, facial bionic data is generated and exports；By carrying out semantic data and affection data parsing to the multi-modal data got, the face of visual human is set to carry out the imitation of face action and facial emotion according to analysis result, strengthen the viscosity of user's visual sense feeling, true to nature, smooth simulation interaction effect is presented, improves interactive experience.

Description

A kind of multi-modal interactive output method and system based on visual human

Technical field

This application involves field of artificial intelligence, more particularly to a kind of multi-modal interactive output method based on visual human And system, a kind of visual human, a kind of smart machine and a kind of computer-readable recording medium.

Background technology

With the continuous development of scientific technology, the introducing of information technology, computer technology and artificial intelligence technology, machine Industrial circle is progressively walked out in the research of people, gradually extend to the neck such as medical treatment, health care, family, amusement and service industry Domain.And requirement of the people for robot also conform to the principle of simplicity the multiple mechanical action of substance be promoted to anthropomorphic question and answer, independence and with The intelligent robot that other robot interacts, human-computer interaction also just become an important factor for determining intelligent robot development.

Robot includes the tangible machine people for possessing entity and the virtual robot being mounted on hardware device at present.It is existing Virtual robot in technology can not carry out multi-modal interaction, show changeless state always, will can not be imitated Mood, the emotion of person is come out by facial true smooth imitation, can not realize true to nature, smooth, anthropomorphic interaction effect.

Therefore, the interaction capabilities and presentation ability of virtual robot are lifted, are the major issues of present urgent need to resolve.

The content of the invention

In view of this, the application provides a kind of multi-modal interactive output method and system based on visual human, one kind virtually People, a kind of smart machine and a kind of computer-readable recording medium, to solve technological deficiency existing in the prior art.

On the one hand, the application provides a kind of multi-modal interactive output method based on visual human, and the visual human is in intelligence Equipment is run, including：

Multi-modal data is obtained, voice data is included at least in the multi-modal data；

The multi-modal data is parsed, to obtain the semantic data and affection data in the voice data；

The semantic data and the affection data are matched with the facial parameter of the visual human, it is imitative to generate face Raw data simultaneously export.

Alternatively, before obtaining multi-modal data, further include：

Visual human is waken up, the visual human is shown in default display area.

Alternatively, the semantic data and the affection data are matched with the facial parameter of the visual human, it is raw Include into facial bionic data and output：

Cutting word is carried out according to the semantic data, cutting word result is matched with the nozzle type model of the visual human, with Generation mouth bionic data simultaneously exports；

For the affection data, affective tag is set；

According to the corresponding facial parameter set of affective tag selection, to coordinate the bionical number of mouth of the nozzle type model According to.

Alternatively, the facial parameter of the visual human includes the facial skeleton, described a crease in the skin, the facial muscle groups And/or the blee.

Alternatively, the facial parameter set includes but not limited to：

The facial skeleton and the bionical synergistic data of facial muscle groups movement；

The facial skeleton and the bionical synergistic data of described a crease in the skin movement；

Described a crease in the skin and the bionical synergistic data of facial muscle groups movement；Or

The facial skeleton, described a crease in the skin, the bionical collaboration number of the facial muscle groups and/or the blee According to.

Alternatively, the visual human is built by 3D high moulds and generated, and possesses default image and technical ability；

The visual human is including operating in the application program on the smart machine, executable file or by the intelligence The hologram that equipment projects.

Alternatively, the system that the smart machine uses includes but not limited to WINDOWS systems, MACOS systems or holography Equipment built-in system.

Alternatively, the default display area includes the throwing of the display interface or the smart machine of the smart machine Penetrate region.

On the other hand, this application provides a kind of multi-modal interactive output system based on visual human, including smart machine And server, the visual human run in smart machine, wherein：

The smart machine obtains multi-modal data, and voice data is included at least in the multi-modal data；

The server parses the multi-modal data, to obtain the semantic data and emotion number in the voice data According to；

The server is matched the semantic data and the affection data with the facial parameter of the visual human, The facial bionic data of generation；

The smart machine receives the facial bionic data and exports.

Alternatively, the server parses the multi-modal data and is implemented as：

For the affection data, affective tag is set；

On the other hand, this application provides a kind of visual human, the visual human to run in smart machine, and the visual human holds The above-mentioned multi-modal interactive output method based on visual human of row.

On the other hand, this application provides a kind of smart machine, above-mentioned visual human is run on the smart machine.

On the other hand, this application provides a kind of computer-readable recording medium, it is stored with computer program, its feature It is, which realizes above-mentioned multi-modal interactive output method based on visual human when being executed by processor the step of.

A kind of multi-modal interactive output method and system, a kind of visual human, one kind based on visual human that the application provides Smart machine and a kind of computer-readable recording medium, by obtaining multi-modal data, include at least in the multi-modal data Voice data；Then the multi-modal data is parsed, to obtain the semantic data and affection data in the voice data；Finally The semantic data and the affection data are matched with the facial parameter of the visual human, generate facial bionic data simultaneously Output；By carrying out semantic data and affection data parsing to the multi-modal data got, allow the face root of visual human The imitation of face action and facial emotion is carried out according to analysis result, strengthens the viscosity of user's visual sense feeling, is presented true to nature, smooth Interaction effect is simulated, improves interactive experience.

Brief description of the drawings

Fig. 1 is a kind of structural representation for multi-modal interactive output system based on visual human that one embodiment of the application provides Figure；

Fig. 2 is a kind of multi-modal interactive output method flow chart based on visual human that one embodiment of the application provides；

Fig. 3 is a kind of multi-modal interactive output method flow chart based on visual human that one embodiment of the application provides；

Fig. 4 is a kind of multi-modal interactive output method flow chart based on visual human that one embodiment of the application provides；

Fig. 5 is a kind of multi-modal interactive output method flow chart based on visual human that one embodiment of the application provides；

Fig. 6 is a kind of structural representation for multi-modal interactive output system based on visual human that one embodiment of the application provides Figure.

Embodiment

Many details are elaborated in the following description in order to fully understand the application.But the application can be with Much implement different from other manner described here, those skilled in the art can be in the situation without prejudice to the application intension Under do similar popularization, therefore the application is from the limitation of following public specific implementation.

In this application, there is provided a kind of multi-modal interactive output method and system based on visual human, a kind of visual human, A kind of smart machine and a kind of computer-readable recording medium, are described in detail one by one in the following embodiments.

In the application, the virtual smart machine for being artificially equipped on the input/output modules such as support perception, control；

Using height emulation 3d virtual figure images as Main User Interface, possesses the appearance of notable character features；

Support multi-modal human-computer interaction, possess natural language understanding, visual perception, touch perception, language voice output, feelings Feel the AI abilities such as facial expressions and acts output；

Configurable social property, personality attribute, personage's technical ability etc., make user enjoy intelligent and personalized Flow Experience Virtual portrait.

Visual human operates in smart machine, the smart machine can be desktop PC, notebook, palm PC and The intellectual computing devices such as mobile terminal, it is even more important that can also be intelligent line holographic projections equipment etc., the mobile terminal It can include smart mobile phone, tablet, intelligent robot etc..

The attribute that the visual human possesses, can include：Visual human's mark, social property, personality attribute, personage's technical ability etc. Attribute.Specifically, social property can include：Appearance, name, gender, native place, age, family relationship, occupation, position, ancestor Teach the attribute fields such as faith, emotion state, educational background；Personality attribute can include：The attribute fields such as personality, makings；Personage's technical ability It can include：Sing and dance, the professional skill such as tell a story, train.

In this application, the attribute of visual human can enable the parsing of multi-modal interaction and the result of decision be more prone to or It is more suitable for the visual human, system can be by calling the attribute information to realize the wake-up of visual human, activity, going to wake up and nullify Etc. the control of state, belong to the adeditive attribute information that visual human distinguishes true people.

In the application, the intelligent holographic projector equipment can use hologram device built-in system, can also pass through certainly Other equipment and platform, other equipment and platform can configure WINDOWS systems or MACOS systems.

Therefore, the visual human can be the hologram or fortune come out by intelligent holographic projection Application program or executable file of the row on the smart machine.

Referring to Fig. 1, for the structure diagram of the multi-modal output system based on visual human of the embodiment of the present application.

The multi-modal output system based on visual human includes smart machine 120 and server, and the server can be High in the clouds brain 110.

The smart machine 120 can include：User interface 121, communication module 122, central processing unit 123 and man-machine Interactively enter output module 124.Wherein, the user interface 121, its shown in default display area be waken up it is virtual People.The human-computer interaction input/output module 124, it obtains multi-modal data and output visual human performs parameter, multi-modal Data include the data from surrounding environment and the multi-modal input data (including at least voice data) interacted with user. The communication module 122, it calls visual human's ability interface and receives parses multi-modal input number by visual human's ability interface According to the multi-modal output data of decision-making.The central processing unit 123, it utilizes the real speech data in multi-modal output data When calculating visual human to the speech imitation to the semantic understanding and affective comprehension of the voice data with visual human mouth movement and The execution parameter of facial expression.

The high in the clouds brain 110 possesses multi-modal data parsing module (also referred to as " visual human's ability interface "), it is to described The multi-modal data that smart machine 120 is sent is parsed, and the multi-modal output data of decision-making, the multi-modal output data packet Include real speech data and visual human semantic understanding and affective comprehension to the voice data.

As shown in Figure 1, corresponding logical process is called respectively in each ability interface of multi-modal data resolving.Below For the explanation of each interface：

Semantic understanding interface 111, it receives the voice messaging from the communication module 122 forwarding, voice knowledge is carried out to it The other and natural language processing based on a large amount of language materials.

Visual identity interface 112, can be directed to human body, face, scene according to computer vision algorithms make, deep learning algorithm Deng progress video content detection, identification, tracking etc..Image is identified according to predetermined algorithm, the detection of quantitative As a result.Possess image preprocessing function, feature extraction functions, decision making function and concrete application function.Image preprocessing can be Basic handling, including the conversion of color space conversion, edge extracting, image and image threshold are carried out to the vision collecting data of acquisition Change；Feature extraction can extract the characteristic information such as the colour of skin of target, color, texture, movement and coordinate in image；Decision-making can be with It is to characteristic information, the concrete application for needing this feature information is distributed to according to certain decision strategy；Concrete application function is real The functions such as existing Face datection, human limbs identification, motion detection.

Affection computation interface 114, it receives the multi-modal data from the communication module 122 forwarding, utilizes affection computation Logic (can be Emotion identification technology) calculates the current emotional state of user.Emotion identification technology is one of affection computation Important component, the content of Emotion identification research can include facial expression, voice, behavior, text and physiological signal identification Etc., the emotional state of user is may determine that by above content.Emotion identification technology can only be known by vision mood Other technology monitors the emotional state of user, can also monitor the emotional state of user using sound Emotion identification technology, and It is not limited thereto.In the present embodiment, mood is monitored by the way of sound Emotion identification technology.

Affection computation interface 114 collects voice when carrying out voice mood identification, by using voice capture device, then Detachable text is converted into, recycles sound Emotion identification technology technology to carry out expression mood analysis.Understand facial expression, lead to Often the delicate change to expression is needed to be detected, such as the change of cheek muscle, mouth and eye, the motion change etc. of eyebrow.

Cognition calculates interface 113, it receives the multi-modal data from the communication module 122 forwarding, and the cognition calculates Interface 113 carries out data acquisition, identification and study to handle multi-modal data, to obtain user's portrait, knowledge mapping etc., with Rational Decision is carried out to multi-modal output data.

A kind of schematical technical solution of the above-mentioned multi-modal output system based on visual human for the embodiment of the present application. For the ease of skilled artisan understands that the technical solution of the application, the description below by multiple embodiments to the application one Multi-modal interactive output method and system, a kind of visual human, a kind of smart machine and a kind of computer of the kind based on visual human can Storage medium is read to be further detailed.

Referring to Fig. 2, one embodiment of the application provides a kind of multi-modal interactive output method based on visual human, described virtual People runs in smart machine, including step 201 is to step 203.

Step 201：Multi-modal data is obtained, voice data is included at least in the multi-modal data.

In the embodiment of the present application, the multi-modal data can be real human, the class mankind or the primate of collection Natural language, visual perception, touch data, the multi-modal data such as perception, language voice, emotional facial expressions and action and may be used also With including the data from surrounding environment.

The voice data includes semantic data, affection data, pitch data, loudness of a sound data, duration of a sound data and sound Chromatic number according to etc..

Step 202：The multi-modal data is parsed, to obtain the semantic data and affection data in the voice data.

In the embodiment of the present application, the semantic data of acquisition and the affection data are needed according in the voice data Pitch data, loudness of a sound data, duration of a sound data and tamber data etc. be determined.

Such as the voice data got is " I am fine ", when pitch data, loudness of a sound data, duration of a sound data and sound Chromatic number according to it is overall it is relatively low in the case of, then cannot only understand from literal meaning " I am fine ", but to combine the expresser State in which, scene or itself occupation etc. understood, should when semantic understanding and emotional expression is carried out Analyze expresser true intention to be expressed " I is not fine ".

The affection data can include multiple affective tags, and the affective tag can be divided into positive affective tag and bear To affective tag, the forward direction affective tag can include unpleasant emotional label, trust affective tag, grateful affective tag and celebrating Good fortune affective tag etc., the negative sense affective tag can include painful affective tag, disdain affective tag, hatred affective tag and Envy affective tag etc., and can also these affective tags be carried out with the division of grade, such as unpleasant emotional label again can be with It is divided into happiness affective tag, pleasant affective tag or comfortable affective tag etc..

Step 203：The semantic data and the affection data are matched with the facial parameter of the visual human, it is raw Into facial bionic data and export.

In the embodiment of the present application, the facial parameter of the visual human includes facial skeleton, a crease in the skin or facial muscle groups, institute It can be facial face bone to state facial skeleton, and described a crease in the skin can be the line that facial epidermis produces when action Road, the facial muscle groups are distributed for the muscle of face, meanwhile, the facial muscle groups can drive a crease in the skin according to affective tag And blee change.

Referring to Fig. 3, in the embodiment of the present application, by the semantic data and the affection data and the face of the visual human Parameter is matched, and generates facial bionic data and output specifically includes step 301 to step 303.

Step 301：Cutting word is carried out according to the semantic data, the nozzle type model of cutting word result and the visual human are carried out Matching, to generate mouth bionic data and export.

In the embodiment of the present application, the voice data is converted into corresponding text data, then according to the semantic number Cutting word is carried out according to the text data, is then matched cutting word result with the nozzle type model of the visual human, to coordinate The nozzle type of the visual human.

Step 302：For the affection data, affective tag is set.

In the embodiment of the present application, pitch data, loudness of a sound data, duration of a sound data and tone color in the voice data The data such as data determine the affection data, and corresponding affective tag, such as the language got are set according to the affection data Sound is " I has enough ", can determine that the words is after then being judged by the pitch of the voice, loudness of a sound, the duration of a sound and tone color Told in the case where meeting happy affective state, then will be that the words sets a happiness affective tag.

In addition, the affection data can also calculate user using affection computation logic (can be Emotion identification technology) Current emotional state.Emotion identification technology is an important component of affection computation, the content bag of Emotion identification research Facial expression, voice, behavior, text and physiological signal identification etc. are included, the feelings of user are may determine that by above content Not-ready status.Emotion identification technology can only monitor the feelings of user by vision Emotion identification technology or sound Emotion identification technology Not-ready status, can also monitor the mood of user by the way of vision Emotion identification technology and sound Emotion identification technology combine State, wherein, the sound Emotion identification technology can pass through：The text Emotion identification realization of text is changed into according to sound, and It is not limited thereto.In the present embodiment, it is preferred to use sound Emotion identification technology monitors mood, calculates affection data.

Step 303：According to the corresponding facial parameter set of affective tag selection, to coordinate the mouth of the nozzle type model Portion's bionic data.

In the embodiment of the present application, the face parameter set includes but not limited to：The facial skeleton and the facial flesh The bionical synergistic data of group's movement；The facial skeleton and the bionical synergistic data of described a crease in the skin movement；The skin fold Wrinkle and the bionical synergistic data of facial muscle groups movement；Or the facial skeleton, described a crease in the skin, the facial muscle groups And/or the bionical synergistic data of the blee.

Each described affective tag corresponds to a suitable facial parameter set, such as when the affective tag is During happiness affective tag, the corresponding facial parameter set can be the one of the facial skeleton and facial muscle groups movement The bionical synergistic data of group, alternatively, can be the facial skeleton, described a crease in the skin, the facial muscle groups and/or the face One group of bionical synergistic data of the colour of skin is, it is necessary to which explanation, the facial muscle groups can drive a crease in the skin according to affective tag And the change of blee.

The embodiment of the present application, the voice data in the multi-modal data to getting carry out corresponding affection computation and obtain To the affection data of user, the affective tag exported according to current state, semantic context decision-making, and according to the voice data Corresponding text carries out cutting word to the voice data, then by the progress of the result of cutting word and the nozzle type model of visual human Match somebody with somebody, then affection data is matched in voice output and facial movement and is merged, the virtual human face coordinated is presented.

Referring to Fig. 4, one embodiment of the application provides a kind of multi-modal interactive output method based on visual human, described virtual People runs in smart machine, including step 401 is to step 406.

Step 401：Visual human is waken up, the visual human is shown in default display area.

In the embodiment of the present application, the visual human is built by 3D high moulds and generated, and possesses default image and technical ability, Such as visual human can be the image appearance of real human, primate either cartoon figure etc., possessing can be according to connecing Received voice carries out the function of nozzle type imitation and facial emotion behavior.

The default display area can include the projection of the display interface or intelligent holographic projector equipment of smart machine Region.

In the embodiment of the present application, the visual human can in standby, dormancy isotype, it is automatic when needing to be imitated or Visual human is waken up manually so that visual human shows that image, such as visual human are operated on smart mobile phone in smart machine One application program, the image of the visual human is the face image of the real human of display after which opens, can With the voice by obtaining one section of real human, cutting word after text is converted the speech into, the visual human is according to cutting word result The imitation of nozzle type is carried out, and the corresponding affection data of the voice data can be calculated, when carrying out nozzle type imitation by described in Affection data fusion is entered, and more vivid imitation is presented, when the application program without using when will enter on backstage it is temporary When resting state, it is necessary to when use manually from backstage switch, you can wake up the visual human run in the application program.

In addition, the visual human can also be the hologram of intelligent holographic projector equipment projection, the hologram Projected area is the display area of the visual human.

Step 402：Multi-modal data is obtained, voice data is included at least in the multi-modal data.

In the embodiment of the present application, the multi-modal data is acquired based on hardware, and the hardware can be smart machine Built-in or external microphone, camera or touch-screen etc..

Step 403：The multi-modal data is parsed, to obtain the semantic data and affection data in the voice data.

In the embodiment of the present application, the multi-modal data is that brain i.e. server is parsed beyond the clouds, then will parsing The facial parameter of obtained affection data, semantic data and visual human are matched.

Step 404：Cutting word is carried out according to the semantic data, the nozzle type model of cutting word result and the visual human are carried out Matching, to generate mouth bionic data and export.

In the embodiment of the present application, the voice data is converted into text, then basis is obtained from the voice data Semantic progress cutting word of the semantic data arrived to text based on context.

Step 405：For the affection data, affective tag is set.

Step 406：According to the corresponding facial parameter set of affective tag selection, to coordinate the mouth of the nozzle type model Portion's bionic data.

In the embodiment of the present application, when the visual human is imitated according to one section of voice, the corresponding text of voice data Determine cutting word as a result, the matching that the nozzle type model of the result of cutting word and the visual human are carried out, realizes the visual human The mouth of the voice is imitated, while the visual human carries out mouth imitation, the face of the visual human also can basis The affection data being calculated is changed accordingly, and mouth movement and face operation are combined realization pair by the visual human The imitation of the voice of real human or primate, the nozzle type of the visual human is imitated and facial emotion changes what can be imitated Incisively and vividly.

Referring to Fig. 5, exemplified by operating in the visual human on smart mobile phone and realize the speech imitation of real human, the application carries For a kind of multi-modal interactive output method based on visual human, including step 501 is to step 507.

Step 501：Visual human is waken up, the visual human is shown in the default display area on smart mobile phone.

The visual human is configured with visual human's wake-up module, when meeting to wake up the preset condition of visual human with judgement, by void Anthropomorphic status attribute transition are wake-up states, and wake-up condition for example can be that user sends the voice messaging for waking up some visual human Or user wake up visual human action message, also or user directly input biological characteristic instruction.Visual human's wake-up module When judging to meet to wake up the preset condition of visual human, then instructed according to wake-up and carry out wake operation.If the wake-up that user sends refers to Order is without specific visual human is referred to, then system default is the last visual human waken up.

In the embodiment of the present application, the visual human can be to build to generate by 3D high moulds, and possess default image And technical ability, such as the face image for the small C of real human face that nozzle type and facial emotions are imitated can be carried out, open in smart mobile phone The application program (APP) of installation, the small C of visual human are operated in the APP, wake up the small C of the visual human, the visual human is small C may be displayed on the default display area of smart mobile phone application program (APP), default display described in the embodiment of the present application Region can be the middle position of smart mobile phone display screen.

Step 502：Multi-modal data is obtained, voice data is included at least in the multi-modal data.

In the embodiment of the present application, the multi-modal data can be real human, the class mankind or the primate of collection Natural language, visual perception, the data for touching the generation such as perception, language voice, emotional facial expressions and action, the multi-modal number According to the data from surrounding environment can also be included.

In the embodiment of the present application, multi-modal data, such as the multi-modal data for the small D of real human of acquisition are obtained, and And one section of real speech comprising small D.

Below with the face image of the small C of virtual artificial real human face, the voice data got is the small D's of real human Illustrated exemplified by one section of real speech.

Step 503：The multi-modal data is parsed, to obtain the semantic data and affection data in the voice data.

In the embodiment of the present application, the multi-modal data of the small D got is parsed, to obtain in the voice data Semantic data and affection data, such as obtain the semantic data and affection data in one section of real speech of small D, and will obtain The real speech of the small D got is converted into text as " glad to meet you again next having time again about ".

Step 504：Cutting word is carried out according to the semantic data, the nozzle type model of cutting word result and the visual human are carried out Matching, to generate mouth bionic data and export.

In the embodiment of the present application, cutting word is carried out according to the semantic data, such as be converted into according to the real speech of small D The semantic of context in text " glad to meet you again next having time again about " carries out cutting word to text, and cutting word result is " again, see you, be very glad, next time, having time, again about ", then by the result of cutting word and the nozzle type model of the visual human Matched, be that the movement of true mouth and the visual human small C that carry out will be needed during word pronunciation after above-mentioned cutting word Mouth model is matched so that the small C of visual human can not only make a sound when above-mentioned voice is imitated, and face also can Corresponding movement is occurred according to the difference of word.

Step 505：For the affection data, affective tag is set.

In the embodiment of the present application, the corresponding affection data is calculated according to voice data, to obtain the voice data Corresponding mood, such as the affection data, the emotion that the affection data is correspondingly arranged are calculated by the real speech of small D Label is unpleasant emotional label.

Step 506：According to the corresponding facial parameter set of affective tag selection, to coordinate the mouth of the nozzle type model Portion's bionic data.

In the embodiment of the present application, according to the corresponding facial parameter set of affective tag selection, such as unpleasant emotional mark Sign corresponding facial parameter collection and be combined into the corners of the mouth and raise up into radian, eyes are crescent shape, and skin of face fold is more.

When the small C of the visual human is imitated in the real speech to small D, not only make a sound, face can be according to word Corresponding movement occurs for the difference of language, facial expression also can according to the corresponding facial parameter set driving of the affection data and Facial face, a crease in the skin and the facial muscle groups for mixing the small C of the visual human are changed, and are, for example, that the corners of the mouth raises up into radian, Eyes are crescent shape, the more smiling face's expression of skin of face fold, meanwhile, under facial muscle groups drive, and specific emotional Also there is color or light and shade change in small C blees under the driving of (shy, angry) etc., and smiling face's expression can also basis The effect of facial expression is also different caused by the either default occupation difference of current state, scene of the small C of visual human, Such as smiling face's facial expression of service worker should be it is professional, and smiling face's facial expression of ordinary people should be compare with Meaning.

Step 507：The face of the visual human carries out imitation and the face of mouth model according to the voice data received The synchronism output of portion's expression.

In the embodiment of the present application, the small C of visual human carries out the movement of mouth according to the real speech of the small D received Can be also changed at the same time according to the corresponding facial parameter set driving facial expression of affective tag, realize it is true to nature, smooth and Anthropomorphic interaction effect.

In the embodiment of the present application, by carrying out semantic data and affection data parsing to the multi-modal data got, make The face of visual human can carry out the imitation of face action and facial emotion according to analysis result, and enhancing user's visual sense feeling glues Degree, is presented true to nature, smooth simulation interaction effect, improves interactive experience.

Referring to Fig. 6, this application provides a kind of multi-modal interactive output system based on visual human, including smart machine 601 and server 602, the visual human 603 runs in smart machine, wherein：

The smart machine 601 obtains multi-modal data, and voice data is included at least in the multi-modal data；

The server 602 parses the multi-modal data, to obtain the semantic data and emotion in the voice data Data；

The server 602 carries out the facial parameter of the semantic data and the affection data and the visual human Match somebody with somebody, generate facial bionic data；

The smart machine 601 receives the facial bionic data and exports.

Alternatively, the server 602 parses the multi-modal data and is implemented as：

Cutting word is carried out according to the semantic data, cutting word result is matched with the nozzle type model of the visual human 603, To generate mouth bionic data and export；

For the affection data, affective tag is set；

The multi-modal interactive output system based on visual human of the embodiment of the present application, passes through the multi-modal data to getting Semantic data and affection data parsing are carried out, the face of visual human is carried out face action and facial feelings according to analysis result The imitation of sense, strengthens the viscosity of user's visual sense feeling, and true to nature, smooth simulation interaction effect is presented, improves interactive experience.

The exemplary scheme of the above-mentioned multi-modal interactive output system based on visual human for the present embodiment.Need what is illustrated Be, should multi-modal interactive output system based on visual human technical solution interacted with above-mentioned multi-modal based on visual human it is defeated The technical solution for going out method belongs to same design, and the technical solution of the multi-modal interactive output system based on visual human is not retouched in detail The detail content stated, may refer to the description of the technical solution of the above-mentioned multi-modal interactive output method based on visual human.

One embodiment of the application also provides a kind of visual human, and the visual human performs the above method.

The exemplary scheme of above-mentioned visual human for the present embodiment a kind of.It should be noted that the technical side of the visual human Case belongs to same design, the technical side of visual human with the above-mentioned multi-modal technical solution for interacting output method based on visual human The detail content that case is not described in detail, may refer to the technical solution of the above-mentioned multi-modal interactive output method based on visual human Description.

This application provides a kind of smart machine, above-mentioned visual human is run on the smart machine.

A kind of exemplary scheme of above-mentioned smart machine for the present embodiment.It should be noted that the skill of the smart machine Art scheme belongs to same design with the above-mentioned multi-modal technical solution for interacting output method based on visual human, smart machine The detail content that technical solution is not described in detail, may refer to the skill of the above-mentioned multi-modal interactive output method based on visual human The description of art scheme.

The smart machine of the application can include processor and memory, and the memory storage has computer instruction, institute Processor is stated to call the computer instruction and perform the foregoing multi-modal interactive output method based on visual human.

It should be noted that the smart machine can be desktop PC, notebook, palm PC and mobile terminal Deng computing device.

The processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng the processor is the control centre of the terminal, utilizes various interfaces and the various pieces of the whole terminal of connection.

The memory mainly includes storing program area and storage data field, wherein, storing program area can store operation system Application program (such as sound-playing function, image player function etc.) needed for system, at least one function etc.；Storage data field can Storage uses created data (such as voice data, phone directory etc.) etc. according to mobile phone.In addition, memory can include height Fast random access memory, can also include nonvolatile memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart MediaCard, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one A disk memory, flush memory device or other volatile solid-state parts.

This application provides a kind of computer-readable recording medium, it is stored with computer program, it is characterised in that the journey The step of above-mentioned multi-modal interactive output method based on visual human is realized when sequence is executed by processor.

A kind of exemplary scheme of above-mentioned computer-readable recording medium for the present embodiment.It should be noted that the meter The technical solution of calculation machine readable storage medium storing program for executing and the above-mentioned multi-modal technical solution category for interacting output method based on visual human In same design, detail content that the technical solution of computer-readable recording medium is not described in detail may refer to above-mentioned base In the description of the technical solution of the multi-modal interactive output method of visual human.

The computer instruction includes computer program code, the computer program code can be source code form, Object identification code form, executable file or some intermediate forms etc..The computer-readable medium can include：Institute can be carried Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, the computer for stating computer program code store Device, read-only storage (ROM, Read-OnlyMemory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer-readable medium include it is interior Appropriate increase and decrease can be carried out according to legislation in jurisdiction and the requirement of patent practice by holding, such as in some jurisdictions of courts Area, according to legislation and patent practice, computer-readable medium does not include electric carrier signal and telecommunication signal.

It should be noted that for foregoing each method embodiment, describe, therefore it is all expressed as a series of for simplicity Combination of actions, but those skilled in the art should know, the application and from the limitation of described sequence of movement because According to the application, some steps can use other orders or be carried out at the same time.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, and involved action and module might not all be this Shens Please be necessary.

In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.

The application preferred embodiment disclosed above is only intended to help and illustrates the application.Alternative embodiment is not detailed All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.This specification is chosen and specifically describes these embodiments, is in order to preferably explain the application Principle and practical application so that skilled artisan can be best understood by and utilize the application.The application is only Limited by claims and its four corner and equivalent.

Claims

A kind of 1. multi-modal interactive output method based on visual human, it is characterised in that the visual human runs in smart machine, Including：

Multi-modal data is obtained, voice data is included at least in the multi-modal data；

The multi-modal data is parsed, to obtain the semantic data and affection data in the voice data；

The semantic data and the affection data are matched with the facial parameter of the visual human, the bionical number of generation face According to and export.
2. according to the method described in claim 1, it is characterized in that, before obtaining multi-modal data, further include：

Visual human is waken up, the visual human is shown in default display area.
3. according to the method described in claim 1, it is characterized in that, by the semantic data and the affection data and the void Anthropomorphic facial parameter is matched, and generating facial bionic data and exporting includes：

Cutting word is carried out according to the semantic data, cutting word result is matched with the nozzle type model of the visual human, with generation Mouth bionic data simultaneously exports；

For the affection data, affective tag is set；

According to the corresponding facial parameter set of affective tag selection, to coordinate the mouth bionic data of the nozzle type model.
4. according to the method described in claims 1 to 3 any one, it is characterised in that the facial parameter of the visual human includes The facial skeleton, described a crease in the skin, the facial muscle groups and/or the blee.
5. according to the method described in claim 4, it is characterized in that, the face parameter set includes but not limited to：

The facial skeleton and the bionical synergistic data of facial muscle groups movement；

The facial skeleton and the bionical synergistic data of described a crease in the skin movement；

Described a crease in the skin and the bionical synergistic data of facial muscle groups movement；Or

The facial skeleton, described a crease in the skin, the bionical synergistic data of the facial muscle groups and/or the blee.
6. according to the method described in claim 1, it is characterized in that, the visual human builds generation by 3D high moulds, possess pre- If image and technical ability；

The visual human is including operating in the application program on the smart machine, executable file or by the smart machine The hologram projected.
7. according to the method described in claim 6, it is characterized in that, the system that the smart machine uses includes but not limited to WINDOWS systems, MAC OS systems or hologram device built-in system.
8. according to the method described in claim 2, it is characterized in that, the default display area includes the smart machine The projected area of display interface or the smart machine.
9. a kind of multi-modal interactive output system based on visual human, it is characterised in that described including smart machine and server Visual human runs in smart machine, wherein：

The smart machine obtains multi-modal data, and voice data is included at least in the multi-modal data；

The server parses the multi-modal data, to obtain the semantic data and affection data in the voice data；

The server is matched the semantic data and the affection data with the facial parameter of the visual human, generation Facial bionic data；

The smart machine receives the facial bionic data and exports.
10. system according to claim 9, it is characterised in that it is specifically real that the server parses the multi-modal data It is now：

Cutting word is carried out according to the semantic data, cutting word result is matched with the nozzle type model of the visual human, with generation Mouth bionic data simultaneously exports；

For the affection data, affective tag is set；

According to the corresponding facial parameter set of affective tag selection, to coordinate the mouth bionic data of the nozzle type model.
11. a kind of visual human, it is characterised in that the visual human runs in smart machine, visual human's perform claim requirement Method described in 1-8 any one.
12. a kind of smart machine, it is characterised in that the visual human described in claim 11 is run on the smart machine.
A kind of 13. computer-readable recording medium, it is characterised in that it is stored with computer program, it is characterised in that the program The step of claim 1-8 any one the methods are realized when being executed by processor.