CN108416420A

CN108416420A - Limbs exchange method based on visual human and system

Info

Publication number: CN108416420A
Application number: CN201810142255.XA
Authority: CN
Inventors: 尚小维; 俞志晨; 李晓丹
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Guangnian Wuxian Technology Co Ltd
Priority date: 2018-02-11
Filing date: 2018-02-11
Publication date: 2018-08-17

Abstract

The present invention provides a kind of limbs exchange method based on visual human, and visual human is shown by smart machine, starts voice, emotion, vision and sensing capability, method when in interaction mode and comprises the steps of：Multi-modal data is exported by visual human；It receives user and is directed to the multi-modal interaction data that multi-modal data provides；Multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts waving in multi-modal interaction data and be used as negative interaction intention；It is intended to the multi-modal interaction of progress according to negative interaction by visual human to export.The limbs exchange method and system of visual human provided by the invention provides a kind of visual human, and visual human has default image and preset attribute, can carry out multi-modal interaction with user.Also, the present invention can also judge the intention of user by the action of waving of limbs, be interacted with user's expansion so that can carry out exchanging for smoothness between user and visual human, and user is made to enjoy anthropomorphic interactive experience.

Description

Limbs exchange method based on visual human and system

Technical field

The present invention relates to artificial intelligence fields, specifically, being related to a kind of limbs exchange method based on visual human and being System.

Background technology

The exploitation of robot chat interactive system is dedicated to imitating human conversation.The relatively more extensive chat machine of early stage application People's application program includes the received input of the processing such as siri chat robots on small i chat robots or iPhone (including text or voice) and corresponding response is made according to input, to attempt to imitate the friendship between the mankind between context Mutually.

But at present for, for the relevant robot of visual human chat interactive system exploitation it is also less perfect, not yet There is the limbs interactive product based on visual human.

Therefore, the present invention provides a kind of limbs exchange method and system based on visual human.

Invention content

To solve the above problems, the present invention provides a kind of limbs exchange method based on visual human, the visual human is logical Cross smart machine displaying, start voice, emotion, vision and sensing capability when in interaction mode, the method include with Lower step：

Multi-modal data is exported by the visual human；

It receives user and is directed to the multi-modal interaction data that the multi-modal data provides；

The multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts the multi-modal interaction data In wave action as negative interaction be intended to；

It is intended to the multi-modal interaction of progress according to negative interaction by the visual human to export.

According to one embodiment of present invention, by visual capacity detect it is described wave action when, work as hand motion Including arm drives palm-outward to carry out the swing of any amplitude in vertical plane, then the identification hand motion is dynamic to wave Make.

According to one embodiment of present invention, further include：

It is not approve the multimode for being intended to be exported to the visual human as user by the negative interaction intention assessment State data disagree feedback；

Or,

It is special intention that the negative, which is interacted intention assessment, based on the multi-modal data that the visual human has exported, In, it is described special to be intended to indicate user and greet or bid farewell to the visual human.

According to one embodiment of present invention, in being detected by visual capacity and extracting the multi-modal interaction data In the step of action of waving is intended to as negative interaction, further include：Do not approve that being intended to storage is directed to the inclined of the user based on described Good data.

According to one embodiment of present invention, the visual human receives is directed to the multi-modal data from multiple users The multi-modal interaction data provided identifies the main users in the multiple user, and dynamic to the limbs of the main users It is detected；

Or,

The limb action of the current all or part of user of acquisition determines that the interaction of the user is intended to according to preset ratio.

According to one embodiment of present invention, when in the multi-modal interaction data include voice data or expression data When, action of waving according to described in is intended to as negative interaction, and above step also includes：

It detects and extracts the voice data or expression data in the multi-modal interaction data；

The voice data or the expression data are parsed, judges the voice data or the expression data and the pendulum Whether the intention made manually meets；

If meeting, it is intended to as negative interaction in conjunction with the action of waving according to the result of parsing；

If not meeting, the action of waving is intended to as negative interaction.

According to one embodiment of present invention, when wave action and the face and head for detecting user by visual capacity When portion acts, preferentially it is intended to using the face and headwork as negative interaction.

According to another aspect of the present invention, a kind of program product is additionally provided, it includes any one of as above for executing The series of instructions of the method and step.

According to another aspect of the present invention, a kind of visual human is additionally provided, the visual human has specific virtual shape As and preset attribute, multi-modal interaction is carried out using upper any one of them method.

According to another aspect of the present invention, a kind of limbs interactive system based on visual human, the system are additionally provided Including：

Smart machine is mounted with the visual human, for obtaining multi-modal interaction data, and has natural language thereon Understanding, visual perception, the ability for touching perception, language voice output, emotional facial expressions action output；

High in the clouds brain, be used to carry out semantic understanding to the multi-modal interaction data, visual identity, cognition calculate and Affection computation exports multi-modal interaction data with visual human described in decision.

The limbs exchange method and system of visual human provided by the invention provides a kind of visual human, and visual human has default Image and preset attribute, can carry out multi-modal interaction with user.Also, the limbs interaction side of visual human provided by the invention Method and system can also judge the intention of user by the action of waving of limbs, interact with user's expansion so that user and virtually Smooth exchange can be carried out between people, and user is made to enjoy anthropomorphic interactive experience.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and is obtained in claim and attached drawing.

Description of the drawings

Attached drawing is used to provide further understanding of the present invention, and a part for constitution instruction, the reality with the present invention It applies example and is used together to explain the present invention, be not construed as limiting the invention.In the accompanying drawings：

Fig. 1 shows the multi-modal interaction of the limbs interactive system according to an embodiment of the invention based on visual human Schematic diagram；

Fig. 2 shows the structure diagram of the limbs interactive system according to an embodiment of the invention based on visual human；

Fig. 3 shows the module frame chart of the limbs interactive system according to an embodiment of the invention based on visual human；

Fig. 4 shows the structural frames of the limbs interactive system based on visual human according to another embodiment of the invention Figure；

Fig. 5 shows the flow chart of the limbs exchange method according to an embodiment of the invention based on visual human；

Fig. 6 shows the determining interaction meaning of the limbs exchange method according to an embodiment of the invention based on visual human The flow chart of figure；

Fig. 7 shows the determining interaction meaning of the limbs exchange method according to an embodiment of the invention based on visual human Another flow chart of figure；

Fig. 8 shows another flow of the limbs exchange method according to an embodiment of the invention based on visual human Figure；And

Fig. 9 show it is according to an embodiment of the invention user, smart machine and high in the clouds brain between the parties The flow chart communicated.

Specific implementation mode

To make the object, technical solutions and advantages of the present invention clearer, the embodiment of the present invention is made below in conjunction with attached drawing Further it is described in detail.

It is clear to state, it needs to carry out before embodiment as described below：

The visual human that the present invention mentions is equipped on the smart machine for supporting the input/output modules such as perception, control；With Gao Mo 3d virtual figure images are Main User Interface, have the appearance of notable character features；It supports multi-modal human-computer interaction, has certainly Right language understanding, visual perception touch the AI abilities such as perception, language voice output, emotional facial expressions action output；Configurable society Attribute, personality attribute, personage's technical ability etc. make user enjoy the virtual portrait of intelligent and personalized Flow Experience.

Visual human's smart machine mounted is：Have the input of non-tactile, non-mouse-keyboard screen (holography, TV screen, Multimedia display screen, LED screen etc.), and the smart machine of camera is carried, meanwhile, can be hologram device, VR equipment, PC Machine.But other smart machines are not precluded, such as：Hand-held tablet, bore hole 3D equipment, even smart mobile phone etc..

Visual human interacts in system level and user, and operating system, such as hologram device are run in the system hardware Built-in system is windows or MAC OS if PC.

Virtual artificial system application or executable file.

Hardware of the virtual robot based on the smart machine obtains the multi-modal interaction data of user, beyond the clouds the energy of brain Under power is supported, semantic understanding, visual identity, cognition calculating, affection computation are carried out to multi-modal interaction data, it is defeated to complete decision The process gone out.

The high in the clouds brain being previously mentioned is to provide the visual human to carry out semantic understanding (language semantic to the interaction demand of user Understanding, Action Semantic understanding, visual identity, affection computation, cognition calculate) processing capacity terminal, realize and the friendship of user Mutually, multi-modal interaction data is exported with visual human described in decision.

Each embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.

Fig. 1 shows the multi-modal interaction of the limbs interactive system according to an embodiment of the invention based on visual human Schematic diagram.As shown in Figure 1, carrying out multi-modal interaction needs user 101, smart machine 102, visual human 103 and high in the clouds brain 104.Wherein, the user 101 interacted with visual human can be the visual human of true people, another visual human and entity, another Visual human and entity visual human are similar with the interactive process of visual human with the interactive process of visual human with single people.Therefore, Only show the multi-modal interactive process of user (people) and visual human in Fig. 1.

In addition, smart machine 102 includes display area 1021 and (the substantially core processing of hardware supported equipment 1022 Device).Display area 1021 is used to show that the image of visual human 103, hardware supported equipment 1022 to make with the cooperation of high in the clouds brain 104 With for the data processing in interactive process.Visual human 103 needs screen display carrier to present.Therefore, display area 1021 is wrapped It includes：Holographic screen, TV screen, multimedia display screen and LED screen etc..

The process interacted between visual human and user 101 in Fig. 1 is：

Interaction required early-stage preparations or condition have, and visual human is carried and operated on smart machine 102, and virtual People has specific image characteristics.Visual human has natural language understanding, visual perception, touches perception, language output, emotion table The AI abilities such as feelings action output.In order to coordinate the touch perceptional function of visual human, be also required to be equipped on smart machine have it is tactile Touch the component of perceptional function.According to one embodiment of present invention, in order to promote interactive experience, visual human after being activated just It is shown in the predeterminable area of hologram device, the overlong time for avoiding user from waiting for.

It should be noted that the image of visual human 103 and dressing up and being not limited to one mode.Visual human 103 can be with Have different images and dresss up.The image of visual human 103 is generally 3D high mould animating images.Visual human 103 can have Different appearance and decoration.The image of each visual human 103 can also correspond to it is a variety of different dress up, the classification dressed up can be according to Classify according to season, can also classify according to occasion.These images and dresss up and can reside in high in the clouds brain 104, it can also It is present in smart machine 102, can be called at any time when needing to call these images and dressing up.

Social property, personality attribute and the personage's technical ability of visual human 103 is also not necessarily limited to a kind of or a kind of.Visual human 103 can have a variety of social properties, multiple personality attribute and a variety of personage's technical ability.These social properties, personality attribute with And personage's technical ability can arrange in pairs or groups respectively, and it is not secured to a kind of collocation mode, user can select and arrange in pairs or groups as needed.

Specifically, social property may include：Appearance, name, dress ornament, decoration, gender, native place, age, family close The attributes such as system, occupation, position, religious belief, emotion state, educational background；Personality attribute may include：The attributes such as personality, makings；People Object technical ability may include：Sing and dance, the professional skills such as tell a story, train, and the displaying of personage's technical ability is not limited to limbs, table The technical ability of feelings, head and/or mouth is shown.

In this application, the social property of visual human, personality attribute and personage's technical ability etc. can make multi-modal interaction Parsing and the result of decision are more prone to or are more suitable for the visual human.

It is that multi-modal interactive process exports multi-modal data by visual human first below.In visual human 103 and user When 101 exchange, visual human 103 exports multi-modal data first, to wait for user 101 for the response of multi-modal data.In reality In border uses, visual human 103 can tell one or one section words, this is in short or one section of word can be to a certain problem Inquiry can also be the viewpoint delivered for a certain topic.For example, visual human ask the user whether to like some singer or certain A film etc..

Then, it receives user and is directed to the multi-modal interaction data that multi-modal data provides.Multi-modal interaction data can wrap Information containing multiple modalities such as text, voice, vision and perception informations.The reception device for obtaining multi-modal interaction data is pacified It fills or is configured on smart machine 102, these reception devices include to receive the received text device of text, receive the language of voice Sound reception device receives the camera of vision and receives the infrored equipment etc. of perception information.

Then, multi-modal interaction data is parsed, wherein：It is detected and is extracted in multi-modal interaction data by visual capacity Action of waving is intended to as negative interaction.By visual capacity detection wave action when, when hand motion include arm drive Palm-outward carries out the swing of any amplitude in vertical plane, then identifies that hand motion is action of waving.

Finally, it is intended to the multi-modal interaction of progress according to negative interaction by visual human to export.

In addition, visual human 103 may also receive from the multi-modal interaction of multiple users provided for multi-modal data Data identify the main users in multiple users, and are detected to the limb action of main users.Alternatively, visual human 103 The limb action of the current all or part of user of acquisition determines that the interaction of user is intended to according to preset ratio.

Illustrate the limbs interactive system provided by the invention based on visual human below by the example in real life Interactive process.

Visual human 103 and the dialogue of user 101 can be：

Visual human 103：, hello！

User 101：" action of waving ", hello.

Visual human 103：There is a first popular rock song recently, you like rock and roll

User 101：" action of waving ", does not like, I likes comparing the song being quiet, and rock and roll is somewhat hot-tempered.

Visual human 103：, it is such, good I is aware of, and good-by.

User 101：" action of waving ", good-by.

In example above, user 101 has issued " action of waving " three times, is beaten with visual human by action of waving for the first time Greeting, belongs to special intention.It is that " you like rock and roll the problem of proposition for visual human 103 for the second time" negative, indicate use Family 101 does not like rock song." action of waving " is that user 101 bids farewell with visual human 103 for the third time.Interaction is last, visual human 103 can remember user 101 for the preference of song, i.e. user 101 does not like rock song, likes quiet song.

According to another embodiment of the invention, a kind of visual human, visual human have specific virtual image and default category Property, multi-modal interaction is carried out using the limbs exchange method based on visual human.

Fig. 2 shows the structure diagram of the limbs interactive system according to an embodiment of the invention based on visual human. As shown in Fig. 2, completing multi-modal interactive needs by system：User 101, smart machine 102 and high in the clouds brain 104.Wherein, Smart machine 102 includes reception device 102A, processing unit 102B, output device 102C and attachment device 102D.High in the clouds is big Brain 104 includes communication device 104A.

It is needed in user 101, smart machine 102 and cloud in the limbs interactive system provided by the invention based on visual human Unobstructed communication port is established between the brain 104 of end, so as to complete the interaction of user 101 and visual human.In order to complete to hand over Mutual task, smart machine 102 and high in the clouds brain 104 can be provided with the device and component for supporting to complete interaction.With it is virtual The object of people's interaction can be a side, or multi-party.

Smart machine 102 includes reception device 102A, processing unit 102B, output device 102C and attachment device 102D.Wherein, reception device 102A is for receiving multi-modal interaction data.The example of reception device 102A includes being grasped for voice The microphone of work, scanner, camera (action touched is not related to using the detection of visible or nonvisible wavelength) etc..Intelligence is set Standby 102 can obtain multi-modal interaction data by above-mentioned input equipment.Output device 102C is virtual for exporting The multi-modal output data that people interacts with user 101, substantially suitable with the configuration of reception device 102A, details are not described herein.

Processing unit 102B is for handling the interaction data transmitted by high in the clouds brain 104 in interactive process.Attachment device 102D is used for contacting between high in the clouds brain 104, and processing unit 102B handles the pretreated multi-modal friendships of reception device 102A Mutual data or the data transmitted by high in the clouds brain 104.Attachment device 102D sends call instruction to call on high in the clouds brain 104 Robot capability.

The communication device 104A that high in the clouds brain 104 includes is for completing writing to each other between smart machine 102.Communication It keeps in communication and contacts between attachment device 102D on device 104A and smart machine 102, receive sending for smart machine 102 Request, and the handling result that high in the clouds brain 104 is sent out is sent, it is Jie linked up between smart machine 102 and high in the clouds brain 104 Matter.

Fig. 3 shows the module frame chart of the limbs interactive system according to an embodiment of the invention based on visual human. As shown in figure 3, system includes interactive module 301, receiving module 302, parsing module 303 and output module 304.Wherein, it connects It includes that text collection unit 3021, audio collection unit 3022, vision collecting unit 3023 and perception acquisition are single to receive module 302 Member 3024.

Interactive module 301 is used to export multi-modal data by visual human.Visual human 103 shown by smart machine 102, Start voice, emotion, vision and sensing capability when in interaction mode.In a wheel dialogue, visual human 103 exports first Multi-modal data, to wait for user 101 for the response of multi-modal data.According to one embodiment of present invention, interactive module 301 include output unit 3011.Output unit 3011 can export multi-modal data.

Receiving module 302 is for receiving multi-modal interaction data.Wherein, text collection unit 3021 is used for acquiring text envelope Breath.Audio collection unit 3022 is used for acquiring audio-frequency information.Vision collecting unit 3023 is used for acquiring visual information.Perception acquisition Unit 3024 is used for acquiring perception information.The example of receiving module 302 includes the microphone for voice operating, scanner, takes the photograph As head, sensing control equipment, such as use visible or nonvisible wavelength ray, signal, environmental data.It can be by above-mentioned Input equipment obtains multi-modal interaction data.Multi-modal interaction can include in text, audio, vision and perception data One kind can also include a variety of, and the present invention restricts not to this.

Parsing module 303 is used to parse multi-modal interaction data, wherein：It is detected by visual capacity and extracts multi-modal friendship Action of waving in mutual data is intended to as negative interaction.Wherein, parsing module 303 includes that detection unit 3031 and extraction are single Member 3032.Detection unit 3031 is used to detect the action of waving in multi-modal interaction data by visual capacity.Detection process can Whether to be, it includes hand motion to detect first in multi-modal interaction data.If dynamic comprising hand in multi-modal interaction data Make, then continue detect hand motion in whether the action of waving sent out containing user 101.

If detection unit 3031 detects there is action of waving in multi-modal interaction data, extraction unit 3032 extracts It waves action, and the action that will wave is intended to as negative interaction.According to one embodiment of present invention, interaction is intended to be divided into two Class does not approve intention and special intention respectively.Judging the process for the classification that interaction is intended to can be：Negative interaction is intended to Be identified as not approving the multi-modal data for being intended to export to the visual human as user disagrees feedback；Or, based on virtual Negative interaction intention assessment is special intention by the multi-modal data that people has exported, wherein special to be intended to indicate user to void Personification is greeted or is bid farewell.

Output module 304 is used to carry out multi-modal interaction according to negative interaction intention by visual human to export.Pass through parsing After module 303 determines that interaction is intended to, output module 304, which can export, meets the multi-modal interaction output that negative interaction is intended to.Output Module 304 includes output data unit 3041, can be intended to determine that the multi-modal interaction for needing to export is defeated according to negative interaction Go out, and multi-modal interaction output is showed by user 101 by visual human.

Fig. 4 shows the structural frames of the limbs interactive system based on visual human according to another embodiment of the invention Figure.As shown in figure 4, completing interaction needs user 101, smart machine 102 and high in the clouds brain 104.Wherein, smart machine 102 Including man-machine interface 401, data processing unit 402, input/output unit 403 and interface unit 404.High in the clouds brain 104 wraps Interface containing semantic understanding 1041, visual identity interface 1042, cognition calculate interface 1043 and affection computation interface 1044.

Limbs interactive system provided by the invention based on visual human includes smart machine 102 and high in the clouds brain 104.It is empty Personification 103 is run in smart machine 102, and visual human 103 has default image and preset attribute, when in interaction mode Voice, emotion, vision and sensing capability can be started.

In one embodiment, smart machine 102 may include：Man-machine interface 401, data processing unit 402, input are defeated Go out device 403 and interface unit 404.Wherein, man-machine interface 401 is shown in the predeterminable area of smart machine 102 in fortune The visual human 103 of row state.

Data processing unit 402 carries out the number generated in multi-modal interactive process for handling user 101 with visual human 103 According to.Processor used can be data processing unit (Central Processing Unit, CPU), can also be that other are logical With processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng processor is the control centre of terminal, utilizes the various pieces of various interfaces and the entire terminal of connection.

Include memory in smart machine 102, memory includes mainly storing program area and storage data field, wherein is deposited Store up program area can storage program area, (for example sound-playing function, image play work(to the application program needed at least one function Energy is equal) etc.；Storage data field can store according to smart machine 102 use created data (such as audio data, browsing note Record etc.) etc..In addition, memory may include high-speed random access memory, can also include nonvolatile memory, such as firmly Disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) block, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-states Part.

Input/output unit 403 is used to obtain multi-modal interaction data and exports the output data in interactive process.It connects Mouthful unit 404 is used to communicate with the expansion of high in the clouds brain 104, and by with the interface in high in the clouds brain 104, to fetching, to transfer high in the clouds big Visual human's ability in brain 104.

High in the clouds brain 104 include semantic understanding interface 1041, visual identity interface 1042, cognition calculate interface 1043 and Affection computation interface 1044.The above interface is communicated with the expansion of interface unit 404 in smart machine 102.Also, high in the clouds is big Brain 104 also includes and the corresponding semantic understanding logic of semantic understanding interface 1041, vision corresponding with visual identity interface 1042 Recognition logic and cognition calculate the corresponding cognition calculating logic of interface 1043 and emotion corresponding with affection computation interface 1044 Calculating logic.

As shown in figure 4, each ability interface calls corresponding logical process respectively in multi-modal data resolving.Below For the explanation of each interface：

Semantic understanding interface 1041 receives the special sound instruction forwarded from interface unit 404, voice knowledge is carried out to it The other and natural language processing based on a large amount of language materials.

Visual identity interface 1042 can be directed to human body, face, scene and be calculated according to computer vision algorithms make, deep learning Method etc. carries out video content detection, identification, tracking etc..Image is identified according to scheduled algorithm, the inspection of quantitative Survey result.Have image preprocessing function, feature extraction functions, decision making function and concrete application function；

Wherein, image preprocessing function can carry out basic handling, including color sky to the vision collecting data of acquisition Between conversion, edge extracting, image transformation and image threshold；

Feature extraction functions can extract the features such as the colour of skin of target, color, texture, movement and coordinate in image and believe Breath；

Decision making function can be distributed to according to certain decision strategy to characteristic information and need the specific of this feature information Multi-modal output equipment or multi-modal output application, such as realize Face datection, human limbs identification, motion detection function.

Cognition calculates interface 1043, receives the multi-modal data forwarded from interface unit 404, and cognition calculates interface 1043 Data acquisition, identification and study are carried out to handle multi-modal data, to obtain user's portrait, knowledge mapping etc., with to multimode State output data carries out Rational Decision.

Affection computation interface 1044 receives the multi-modal data forwarded from interface unit 404, utilizes affection computation logic (can be Emotion identification technology) calculates the current emotional state of user.Emotion identification technology is that one of affection computation is important The content of component part, Emotion identification research includes facial expression, voice, behavior, text and physiological signal identification etc., is led to Cross the emotional state that the above content may determine that user.Emotion identification technology can be monitored only by vision Emotion identification technology The emotional state of user can also monitor use using vision Emotion identification technology and sound Emotion identification technology in conjunction with by the way of The emotional state at family, and be not limited thereto.In the present embodiment, it is preferred to use the two in conjunction with mode monitor mood.

Affection computation interface 1044 is to collect mankind face by using image capture device when carrying out vision Emotion identification Portion's facial expression image is then converted into that data can be analyzed, the technologies such as image procossing is recycled to carry out expression mood analysis.Understand face Expression, it usually needs the delicate variation of expression is detected, such as cheek muscle, mouth variation and choose eyebrow etc..

Fig. 5 shows the flow chart of the limbs exchange method according to an embodiment of the invention based on visual human.

As shown in figure 5, in step S501, multi-modal data is exported by visual human.In this step, smart machine Visual human 103 in 102 exports multi-modal data to user 101, to open a dialogue with user 101 in a wheel interaction.It is empty The multi-modal data of 103 output of personification can be the inquiry that user 101 is unfolded for a problem, can also be visual human 103 statements made for some topic or viewpoint for being discussed with user 101.

In step S502, receives user and be directed to the multi-modal interaction data that multi-modal data provides.In this step, intelligence Energy equipment 102 can obtain multi-modal interaction data, and smart machine 102 can be configured with the corresponding dress for obtaining multi-modal interaction data It sets.Multi-modal interaction data can be the input of the forms such as text input, audio input and perception input.

In step S503, multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts multi-modal interaction Action of waving in data is intended to as negative interaction.May include hand motion in multi-modal interaction data, it is also possible to no Including hand motion, whether in order to determine interactive intention, it includes hand motion to need to detect in multi-modal interaction data.Passing through Visual capacity detection wave action when, when hand motion include arm drive palm-outward carry out any amplitude in vertical plane It swings, then identifies that hand motion is action of waving.

In this step, it detects first whether comprising action of waving in multi-modal interaction data, if multi-modal interactive number Comprising action of waving in, then the interaction that action is interacted as epicycle that will wave is intended to.If in multi-modal interaction data not Including action of waving, then being intended to according to other data in multi-modal interaction data as interaction.In addition, when passing through vision Ability detects wave action and the face and when headwork of user, preferentially using face and headwork as negative interaction It is intended to.

In one embodiment of the invention, negative interaction, which is intended to be divided into, does not approve intention and special intention.Do not approve Be intended to indicate the multi-modal data that user 101 exports visual human disagrees feedback.It is special to be intended to indicate user 101 to virtual People greets or bids farewell.

Finally, in step S504, the multi-modal interaction of progress is intended to according to negative interaction by visual human and is exported.It determines After interaction is intended to, visual human 103 can interact the corresponding multi-modal interaction of intention output according to the negative of confirmation and export.

In addition, the limbs interactive system provided by the invention based on visual human can also coordinate a kind of program product, packet Containing for executing the series of instructions for completing the limbs exchange method step of visual human.Program product can run computer and refer to Enable, computer instruction includes computer program code, computer program code can be source code form, object identification code form, Executable file or certain intermediate forms etc..

Program product may include：Can carry computer program code any entity or device, recording medium, USB flash disk, Mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory Device (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..

It should be noted that the content that program product includes can be wanted according to legislation and patent practice in jurisdiction It asks and carries out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, program product does not include electric carrier wave Signal and telecommunication signal.

Fig. 6 shows the determining interaction meaning of the limbs exchange method according to an embodiment of the invention based on visual human The flow chart of figure.

In step s 601, multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts multi-modal interaction Action of waving in data is intended to as negative interaction.In this step, it needs to parse multi-modal interaction data, multimode State interaction data includes the data of diversified forms.In order to know interactive intention, needs to detect in multi-modal interaction data and whether wrap Containing action of waving.After it includes to wave action to detect in multi-modal interaction data, need to extract the action of waving detected, And it is intended to using action of waving as negative interaction.

According to one embodiment of present invention, negative interaction be intended to be divided into two classes, be respectively do not approve intention and it is special It is intended to.It is that not approve that intention as user exports visual human multi-modal by negative interaction intention assessment in step S602 Data disagree feedback.In embodiment, it is intended to if the multi-modal data of visual human's output includes inquiry, user 101 action of waving can be determined that not approve intention.

Meanwhile in step S603, it is by negative interaction intention assessment based on the multi-modal data that visual human has exported Special intention, wherein special to be intended to indicate user to visual human's greeting or bid farewell.In embodiment, if visual human to Family road never say goodbye, and user is by expression of waving, then the action of waving of user 101 can be determined that as special intention.Visual human It will be apparent from user to bid farewell with oneself, have left, would not be further continued for keeping by force.

If negative interaction is intended to be judged as not approving intention, in step s 604, it is intended to really based on not approving Recognize the preference data for the user.Finally, in step s 605, it is intended to carry out according to negative interaction by visual human multi-modal Interaction output.

Fig. 7 shows the determining interaction meaning of the limbs exchange method according to an embodiment of the invention based on visual human Another flow chart of figure.

In step s 701, it detects and extracts voice data or expression data in multi-modal interaction data.Multi-modal Include the data of diversified forms in interaction data, these data all may include the current interaction wish of user 101.In this step In, whether include voice data or expression data, to be intended to make reference for determining interaction if detecting in multi-modal interaction data.

Then, in step S702, voice data or expression data are parsed.If in multi-modal interaction data including voice Data or expression data parse voice data or expression data, know in voice data or expression data and use in this step The interaction wish at family, obtains analysis result.

Then, in step S703, judge voice data (e.g., or not meaning that) or expression data (negative Expression) whether meet with the intention of action of waving.If voice data or expression data and the intention for action of waving meet, into Enter step S704, action of waving is combined to be intended to as negative interaction according to the result of parsing.If voice data or expression data It is not met with the intention for action of waving, then enters step S705, the action that will wave is intended to as negative interaction.

In method and step flow chart as shown in Figure 7, action of waving is in leading in all multi-modal interaction datas Status, when in multi-modal interaction data there are when the data of other forms, still using interaction meaning of the action as currently interaction of waving Figure.

Fig. 8 shows another flow of the limbs exchange method according to an embodiment of the invention based on visual human Figure.

As shown in figure 8, in step S801, smart machine 102 sends out request to high in the clouds brain 104.Later, in step In S802, smart machine 102 is constantly in the state for waiting for high in the clouds brain 104 to reply.During waiting, smart machine 102 can carry out Clocked operation to returned data the time it takes.

In step S803, if the reply data not returned for a long time, for example, being more than scheduled time span 5S, then smart machine 102 can select to carry out local reply, generate local common reply data.Then, defeated in step S804 Go out the animation with local common response cooperation, and voice playing equipment is called to carry out speech play.

In order to realize the multi-modal interaction between smart machine 102 and user 101, user 101, smart machine 102 are needed And communication connection is set up between high in the clouds brain 104.This communication connection should be real-time, unobstructed, can ensure to hand over It is mutually impregnable.

In order to complete to interact, some conditions or premise are needed to have.These conditions or premise include smart machine Visual human is loaded and run in 102, and smart machine 102 has the hardware facility of perception and control function.Visual human exists Start voice, emotion, vision and sensing capability when in interaction mode.

After completing early-stage preparations, smart machine 102 starts to interact with the expansion of user 101, and first, smart machine 102 passes through Visual human 103 exports multi-modal data.Multi-modal data is that visual human is directed to carries out current session with user in a wheel dialogue The inquiry made of topic or opinion statement.Visual human can propose problem, to the request problem of user 101 as a result, also may be used To propose to take a stand, waits for user 101 to respond and take a stand.At this point, two sides of expansion communication are smart machine 102 and user 101, data The direction of transmission is to be transmitted to user 101 from smart machine 102.

Then, smart machine 102 receives multi-modal interaction data.Multi-modal interaction data is that user is directed to multi-modal data The response of offer.The data that diversified forms can be included in multi-modal interaction data, for example, can be wrapped in multi-modal interaction data Containing text data, voice data, perception data and action data etc..Configured with the multi-modal interaction of reception in smart machine 102 The relevant device of data, for receiving the multi-modal interaction data of the transmission of user 101.At this point, two sides that expanding data transmits are User 101 and smart machine 102, the direction of data transfer is to be transmitted to smart machine 102 from user 101.

Then, smart machine 102 sends to high in the clouds brain 104 and asks.Ask high in the clouds brain 104 to multi-modal interaction data Semantic understanding, visual identity, cognition calculating and affection computation are carried out, to help user to carry out decision.At this point, passing through vision energy Power detects and extracts the action of waving in multi-modal interaction data to be intended to as negative interaction.Then, high in the clouds brain 104 will reply Data transmission is to smart machine 102.At this point, two sides of expansion communication are smart machine 102 and high in the clouds brain 104.

Finally, after smart machine 102 receives the data of the transmission of high in the clouds brain 104, smart machine 102 can be by virtual People is intended to the multi-modal interaction of progress according to negative interaction and exports.At this point, two sides of expansion communication are smart machine 102 and user 101。

It should be understood that disclosed embodiment of this invention is not limited to specific structure disclosed herein, processing step Or material, and the equivalent substitute for these features that those of ordinary skill in the related art are understood should be extended to.It should also manage Solution, term as used herein is used only for the purpose of describing specific embodiments, and is not intended to limit.

" one embodiment " or " embodiment " mentioned in specification means the special characteristic described in conjunction with the embodiments, structure Or characteristic includes at least one embodiment of the present invention.Therefore, the phrase " reality that specification various places throughout occurs Apply example " or " embodiment " the same embodiment might not be referred both to.

While it is disclosed that embodiment content as above but described only to facilitate understanding the present invention and adopting Embodiment is not limited to the present invention.Any those skilled in the art to which this invention pertains are not departing from this Under the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of limbs exchange method based on visual human, which is characterized in that the visual human is shown by smart machine, is being located Start voice, emotion, vision and sensing capability, the method when interaction mode to comprise the steps of：

Multi-modal data is exported by the visual human；

The multi-modal interaction data is parsed, wherein：It is detected and is extracted in the multi-modal interaction data by visual capacity Action of waving is intended to as negative interaction；

2. the limbs exchange method based on visual human as described in claim 1, which is characterized in that detected by visual capacity To it is described wave action when, when hand motion includes that arm drives palm-outward to carry out in vertical plane the swing of any amplitude, Then identify that the hand motion is action of waving.

3. the limbs exchange method based on visual human as described in claim 1, which is characterized in that further include：

It is not approve the multi-modal number for being intended to be exported to the visual human as user by the negative interaction intention assessment According to disagree feedback；

Or,

It is special intention that the negative, which is interacted intention assessment, based on the multi-modal data that the visual human has exported, wherein It is described special to be intended to indicate user and greet or bid farewell to the visual human.

4. the limbs exchange method based on visual human as claimed in claim 3, which is characterized in that detected by visual capacity And extract in the step of action of waving in the multi-modal interaction data is intended to as negative interaction, further include：Based on described Do not approve the preference data for being intended to storage for the user.

5. the limbs exchange method based on visual human as described in any one of claim 1-4, which is characterized in that described virtual People receives the multi-modal interaction data provided for the multi-modal data from multiple users, identifies the multiple user In main users, and the limb action of the main users is detected；

Or,

6. the limbs exchange method based on visual human as described in any one of claim 1-4, which is characterized in that when described more When in mode interaction data including voice data or expression data, action of waving according to described in is intended to as negative interaction, above Step also includes：

Parse the voice data or the expression data, judge the voice data or the expression data with it is described wave it is dynamic Whether the intention of work meets；

If not meeting, the action of waving is intended to as negative interaction.

7. the limbs exchange method based on visual human as described in any one of claim 1-4, which is characterized in that when by regarding Feel ability detects wave action and the face and when headwork of user, preferentially using the face and headwork as no Fixed interaction is intended to.

8. a kind of program product, it includes for executing a series of of the method and step as described in any one of claim 1-7 Instruction.

9. a kind of visual human, which is characterized in that the visual human has specific virtual image and preset attribute, using such as right It is required that the method described in any one of 1-7 carries out multi-modal interaction.

10. a kind of limbs interactive system based on visual human, which is characterized in that the system includes：

Smart machine is mounted with visual human as claimed in claim 9 thereon, for obtaining multi-modal interaction data, and has Natural language understanding, visual perception, the ability for touching perception, language voice output, emotional facial expressions action output；

High in the clouds brain is used to carry out semantic understanding, visual identity, cognition calculating and emotion to the multi-modal interaction data It calculates, multi-modal interaction data is exported with visual human described in decision.