CN108595012A

CN108595012A - Visual interactive method and system based on visual human

Info

Publication number: CN108595012A
Application number: CN201810442264.0A
Authority: CN
Inventors: 尚小维; 李晓丹; 俞志晨
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Guangnian Wuxian Technology Co Ltd
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2018-09-28

Abstract

The present invention provides a kind of visual interactive method based on visual human, and visual human is shown by smart machine, starts voice, emotion, vision and sensing capability, method when in interaction mode and comprises the steps of：Multi-modal data is exported by visual human；It receives user and is directed to the multi-modal interaction data that multi-modal data provides；Multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts the hand frame-type in multi-modal interaction data and be used as interaction intention；It is intended to the multi-modal interaction of progress according to interaction by visual human to export.Visual interactive method and system provided by the invention based on visual human provide a kind of visual human, and visual human has default image and preset attribute, can carry out multi-modal interaction with user.Also, the present invention can also act the intention for judging user by the frame-type of hand, be interacted with user's expansion so that can carry out exchanging for smoothness between user and visual human, and user is made to enjoy anthropomorphic interactive experience.

Description

Visual interactive method and system based on visual human

Technical field

The present invention relates to artificial intelligence fields, specifically, being related to a kind of visual interactive method based on visual human and being System.

Background technology

The exploitation of robot multi-modal interactive system is dedicated to imitating human conversation, to attempt to imitate people between context Interaction between class.But at present for, it is also less complete for the exploitation of the relevant robot multi-modal interactive system of visual human It is kind, not yet occur carrying out the visual human of multi-modal interaction, it is even more important that it there is no for limbs, especially for gesture interaction, And there is the visual interactive product based on visual human of response to limbs, especially gesture interaction.

Therefore, the visual interactive method and system based on visual human that the present invention provides a kind of.

Invention content

To solve the above problems, the present invention provides a kind of visual interactive method based on visual human, the visual human is logical Cross smart machine displaying, start voice, emotion, vision and sensing capability when in interaction mode, the method include with Lower step：

Multi-modal data is exported by the visual human；

It receives user and is directed to the multi-modal interaction data that the multi-modal data provides；

The multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts the multi-modal interaction data In hand frame-type action as interaction be intended to；

It is intended to the multi-modal interaction of progress according to the interaction by the visual human to export.

According to one embodiment of present invention, when detecting that the hand frame-type acts by visual capacity, judging In period, if the hand motion of user includes：Left hand thumb finger pulp and right hand index finger finger pulp are closed, left index finger finger pulp with Right hand thumb finger pulp is closed, and above four fingers form a closed quadrangle, remaining finger is rolled up naturally, then identifies institute Hand motion is stated to act for hand frame-type.

According to one embodiment of present invention, in being detected by visual capacity and extracting the multi-modal interaction data In the step of action of hand frame-type is intended to as interaction, further include：

When identifying that it is primary that the hand frame-type action only occurs in the first recognition cycle, then the interaction is intended to know Wei not take pictures intention, then open the smart machine camera according to the intention and start and take pictures；

Or,

Occur more than once in the first recognition cycle when the identification hand frame-type is acted, and the adjacent time twice Interval is not more than the second preset time, then is to record to be intended to by the interactive intention assessment, then opens the intelligence according to the intention Equipment camera simultaneously starts camera shooting.

According to one embodiment of present invention, the visual human receives is directed to the multi-modal data from multiple users The multi-modal interaction data provided identifies the main users in the multiple user, and dynamic to the hand of the main users It is detected；

Or,

The hand motion of the current all or part of user of acquisition, acquires ratio-dependent according to preset user and is collected user Interaction be intended to.

According to one embodiment of present invention, when in the multi-modal interaction data include voice data or expression data When, it acts according to the hand frame-type and is intended to as interaction, above step also includes：

It detects and extracts the voice data or expression data in the multi-modal interaction data；

The voice data or the expression data are parsed, judges the voice data or the expression data and the hand Whether the intention of portion's frame-type action meets；

If meeting, is acted in conjunction with the hand frame-type according to the result of parsing and be intended to as interaction；

If not meeting, the hand frame-type action is intended to as interaction.

According to one embodiment of present invention, defeated according to the multi-modal interaction of the interaction intention progress by the visual human Go out, including：Corresponding interaction is acted by the visual human according to the hand frame-type to be intended to start the smart machine hardware, And showing multi-modal interaction output, the multi-modal interaction output includes：Intention of taking pictures and/or recording are intended to corresponding response knot Fruit data.

According to another aspect of the present invention, a kind of program product is additionally provided, it includes any one of as above for executing The series of instructions of the method and step.

According to another aspect of the present invention, a kind of visual human is additionally provided, the visual human has specific virtual shape As and preset attribute, multi-modal interaction is carried out using any one of them method as above.

According to another aspect of the present invention, a kind of visual interactive system based on visual human, the system are additionally provided Including：

Smart machine is mounted with visual human as described above thereon, for obtaining multi-modal interaction data, and has language The ability of sound, emotion, expression and action output；

High in the clouds brain is used to carry out natural language understanding, visual identity, cognition calculating to the multi-modal interaction data And affection computation, multi-modal interaction data is exported with visual human described in decision.

Visual interactive method and system provided by the invention based on visual human provide a kind of visual human, and visual human has Default image and preset attribute, can carry out multi-modal interaction with user.Also, the vision of visual human provided by the invention is handed over Mutual method and system can also act the intention for judging user by the frame-type of hand, be interacted with user's expansion so that Yong Huyu Smooth exchange can be carried out between visual human, and user is made to enjoy anthropomorphic interactive experience.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and is obtained in claim and attached drawing.

Description of the drawings

Attached drawing is used to provide further understanding of the present invention, and a part for constitution instruction, the reality with the present invention It applies example and is used together to explain the present invention, be not construed as limiting the invention.In the accompanying drawings：

Fig. 1 shows the structure diagram of the visual interactive system according to an embodiment of the invention based on visual human；

Fig. 2 shows the structure diagram of the visual interactive system according to an embodiment of the invention based on visual human；

Fig. 3 shows the module frame of the visual interactive system based on visual human according to another embodiment of the invention Figure；

Fig. 4 shows the structural frames of the visual interactive system based on visual human according to another embodiment of the invention Figure；

Fig. 5 shows that the visual interactive system according to an embodiment of the invention based on visual human carries out visual interactive Schematic diagram；

Fig. 6 shows the flow chart of the visual interactive method according to an embodiment of the invention based on visual human；

Fig. 7 shows the determining interaction meaning of the visual interactive method according to an embodiment of the invention based on visual human The flow chart of figure；

Fig. 8 shows that determining for the visual interactive method based on visual human according to another embodiment of the invention interacts The flow chart of intention；

Fig. 9 shows another flow of the visual interactive method according to an embodiment of the invention based on visual human Figure；And

Figure 10 show it is according to an embodiment of the invention user, smart machine and high in the clouds brain between the parties The flow chart communicated.

Specific implementation mode

To make the object, technical solutions and advantages of the present invention clearer, the embodiment of the present invention is made below in conjunction with attached drawing Further it is described in detail.

It is clear to state, it needs to carry out before embodiment as described below：

The visual human that the present invention mentions is equipped on the smart machine for supporting the input/output modules such as perception, control；With Gao Fang True 3d virtual figure images are Main User Interface, have the appearance of notable character features；It supports multi-modal human-computer interaction, has Natural language understanding, visual perception touch the AI abilities such as perception, language voice output, emotional facial expressions action output；Configurable society Meeting attribute, personality attribute, personage's technical ability etc. make user enjoy the virtual portrait of intelligent and personalized Flow Experience.

Visual human's smart machine mounted is：Have the input of non-tactile, non-mouse-keyboard screen (holography, TV screen, Multimedia display screen, LED screen etc.), and the smart machine of camera is carried, meanwhile, can be hologram device, VR equipment, PC Machine.But other smart machines are not precluded, such as：Hand-held tablet, bore hole 3D equipment, even smart mobile phone etc..

Visual human interacts in system level and user, and operating system, such as hologram device are run in the system hardware Built-in system is windows or MAC OS if PC.

Virtual artificial system application or executable file.

Hardware of the virtual robot based on the smart machine obtains the multi-modal interaction data of user, beyond the clouds the energy of brain Under power is supported, semantic understanding, visual identity, cognition calculating, affection computation are carried out to multi-modal interaction data, it is defeated to complete decision The process gone out.

The high in the clouds brain being previously mentioned is to provide the visual human to carry out semantic understanding (language semantic to the interaction demand of user Understanding, Action Semantic understanding, visual identity, affection computation, cognition calculate) processing capacity terminal, realize and the friendship of user Mutually, with the multi-modal interaction data of the output of visual human described in decision.

Each embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.

Fig. 1 shows the structure diagram of the visual interactive system according to an embodiment of the invention based on visual human. As shown in Figure 1, carrying out multi-modal interaction needs user 101, smart machine 102, visual human 103 and high in the clouds brain 104.Its In, the user 101 interacted with visual human can be the visual human of true people, another visual human and entity, another visual human And entity visual human is similar with the interactive process of visual human with the interactive process of visual human with single people.Therefore, in Fig. 1 Only show the multi-modal interactive process of user (people) and visual human.

In addition, smart machine 102 includes display area 1021 and (the substantially core processing of hardware supported equipment 1022 Device).Display area 1021 is used to show that the image of visual human 103, hardware supported equipment 1022 to make with the cooperation of high in the clouds brain 104 With for the data processing in interactive process.Visual human 103 needs screen display carrier to present.Therefore, display area 1021 is wrapped It includes：Holographic screen, TV screen, multimedia display screen and LED screen etc..

The process interacted between visual human and user 101 in Fig. 1 is：

Interaction required early-stage preparations or condition have, and visual human is carried and operated on smart machine 102, and virtual People has specific image characteristics.Visual human has natural language understanding, visual perception, touches perception, language output, emotion table The AI abilities such as feelings action output.In order to coordinate the touch perceptional function of visual human, be also required to be equipped on smart machine have it is tactile Touch the component of perceptional function.According to one embodiment of present invention, in order to promote interactive experience, visual human after being activated just It is shown in predeterminable area, the overlong time for avoiding user from waiting for.

It should be noted that the image of visual human 103 and dressing up and being not limited to one mode.Visual human 103 can be with Have different images and dresss up.The image of visual human 103 is generally 3D high mould animating images.Visual human 103 can have Different appearance and decoration.The image of each visual human 103 can also correspond to it is a variety of different dress up, the classification dressed up can be according to Classify according to season, can also classify according to occasion.These images and dresss up and can reside in high in the clouds brain 104, it can also It is present in smart machine 102, can be called at any time when needing to call these images and dressing up.

Social property, personality attribute and the personage's technical ability of visual human 103 is also not necessarily limited to a kind of or a kind of.Visual human 103 can have a variety of social properties, multiple personality attribute and a variety of personage's technical ability.These social properties, personality attribute with And personage's technical ability can arrange in pairs or groups respectively, and it is not secured to a kind of collocation mode, user can select and arrange in pairs or groups as needed.

Specifically, social property may include：Appearance, name, dress ornament, decoration, gender, native place, age, family close The attributes such as system, occupation, position, religious belief, emotion state, educational background；Personality attribute may include：The attributes such as personality, makings；People Object technical ability may include：Sing and dance, the professional skills such as tell a story, train, and the displaying of personage's technical ability is not limited to limbs, table The technical ability of feelings, head and/or mouth is shown.

In this application, the social property of visual human, personality attribute and personage's technical ability etc. can make multi-modal interaction Parsing and the result of decision are more prone to or are more suitable for the visual human.

It is that multi-modal interactive process exports multi-modal data by visual human first below.In visual human 103 and user When 101 exchange, visual human 103 exports multi-modal data first, to wait for user 101 for the response of multi-modal data.In reality In border uses, visual human 103 may export one section of word, one section of music or one section of video.

Then, it receives user and is directed to the multi-modal interaction data that multi-modal data provides.Multi-modal interaction data can wrap Information containing multiple modalities such as text, voice, vision and perception informations.The reception device for obtaining multi-modal interaction data is pacified It fills or is configured on smart machine 102, these reception devices include to receive the received text device of text, receive the language of voice Sound reception device receives the camera of vision and receives the infrored equipment etc. of perception information.

Then, multi-modal interaction data is parsed, wherein：It is detected and is extracted in multi-modal interaction data by visual capacity The action of hand frame-type is intended to as interaction.When detecting the action of hand frame-type by visual capacity, judging in the period, if with The hand motion at family includes：Left hand thumb finger pulp is closed with right hand index finger finger pulp, and left index finger finger pulp refers to right hand thumb Abdomen is closed, and above four fingers form a closed quadrangle, remaining finger can roll up naturally, then identifies that hand motion is hand Portion's frame-type action.

Finally, it is intended to carry out multi-modal interaction output according to interaction by visual human.

In addition, visual human 103 may also receive from the multi-modal interaction of multiple users provided for multi-modal data Data identify the main users in multiple users, and are detected to the hand motion of main users.Alternatively, visual human 103 The hand motion of the current all or part of user of acquisition, the interaction that the collected user of ratio-dependent is acquired according to preset user are anticipated Figure.

According to another embodiment of the invention, a kind of visual human, visual human have specific virtual image and default category Property, multi-modal interaction is carried out using the visual interactive method based on visual human.

Fig. 2 shows the structure diagram of the visual interactive system according to an embodiment of the invention based on visual human. As shown in Fig. 2, completing multi-modal interactive needs by system：User 101, smart machine 102 and high in the clouds brain 104.Wherein, Smart machine 102 includes reception device 102A, processing unit 102B, output device 102C and attachment device 102D.High in the clouds is big Brain 104 includes communication device 104A.

It is needed in user 101, smart machine 102 and cloud in the visual interactive system provided by the invention based on visual human Unobstructed communication port is established between the brain 104 of end, so as to complete the interaction of user 101 and visual human.In order to complete to hand over Mutual task, smart machine 102 and high in the clouds brain 104 can be provided with the device and component for supporting to complete interaction.With it is virtual The object of people's interaction can be a side, or multi-party.

Smart machine 102 includes reception device 102A, processing unit 102B, output device 102C and attachment device 102D.Wherein, reception device 102A is for receiving multi-modal interaction data.The example of reception device 102A includes being grasped for voice The microphone of work, scanner, camera (action touched is not related to using the detection of visible or nonvisible wavelength) etc..Intelligence is set Standby 102 can obtain multi-modal interaction data by above-mentioned input equipment.Output device 102C is virtual for exporting The multi-modal output data that people interacts with user 101, substantially suitable with the configuration of reception device 102A, details are not described herein.

Processing unit 102B is for handling the interaction data transmitted by high in the clouds brain 104 in interactive process.Attachment device 102D is used for contacting between high in the clouds brain 104, and processing unit 102B handles the pretreated multi-modal friendships of reception device 102A Mutual data or the data transmitted by high in the clouds brain 104.Attachment device 102D sends call instruction to call on high in the clouds brain 104 Robot capability.

The communication device 104A that high in the clouds brain 104 includes is for completing writing to each other between smart machine 102.Communication It keeps in communication and contacts between attachment device 102D on device 104A and smart machine 102, what reception smart machine 102 was sent asks It asks, and sends the handling result that high in the clouds brain 104 is sent out, be Jie linked up between smart machine 102 and high in the clouds brain 104 Matter.

Fig. 3 shows the module frame of the visual interactive system based on visual human according to another embodiment of the invention Figure.As shown in figure 3, system includes interactive module 301, receiving module 302, parsing module 303 and decision-making module 304.Wherein, Receiving module 302 includes text collection unit 3021, audio collection unit 3022, vision collecting unit 3023 and perception acquisition Unit 3024.

Interactive module 301 is used to export multi-modal data by visual human.Visual human 103 shown by smart machine 102, Start voice, emotion, vision and sensing capability when in interaction mode.In a wheel interaction, visual human 103 exports first Multi-modal data, to wait for user 101 for the response of multi-modal data.According to one embodiment of present invention, interactive module 301 include output unit 3011.Output unit 3011 can export multi-modal data.

Receiving module 302 is for receiving multi-modal interaction data.Wherein, text collection unit 3021 is used for acquiring text envelope Breath.Audio collection unit 3022 is used for acquiring audio-frequency information.Vision collecting unit 3023 is used for acquiring visual information.Perception acquisition Unit 3024 is used for acquiring perception information.The example of receiving module 302 includes the microphone for voice operating, scanner, takes the photograph As head, sensing control equipment, such as use visible or nonvisible wavelength ray, signal, environmental data.It can be by above-mentioned Input equipment obtains multi-modal interaction data.Multi-modal interaction can include in text, audio, vision and perception data One kind can also include a variety of, and the present invention restricts not to this.

Parsing module 303 is used to parse multi-modal interaction data, wherein：It is detected by visual capacity and extracts multi-modal friendship Hand frame-type action in mutual data is intended to as interaction.Wherein, parsing module 303 includes that detection unit 3031 and extraction are single Member 3032.Detection unit 3031 is used to detect the action of the hand frame-type in multi-modal interaction data by visual capacity.It detected Whether journey can be detected first in multi-modal interaction data comprising hand motion.If in multi-modal interaction data including hand Portion acts, then continues to detect in hand motion whether being acted containing the hand frame-type that user 101 sends out.

If detection unit 3031 detects that there are the action of hand frame-type, extraction units 3032 in multi-modal interaction data The action of hand frame-type is extracted, and hand frame-type is acted and is intended to as interaction.According to one embodiment of present invention, interaction is intended to It is divided into two classes, intention of respectively taking pictures and recording are intended to.Judging the process for the classification that interaction is intended to can be：When identification hand Frame-type action only occurs once in the first recognition cycle, then is intention of taking pictures by interaction intention assessment, then according to the intention It opens the smart machine camera and starts and take pictures.Alternatively, when identification hand frame-type action occurs in the first recognition cycle More than once, and adjacent time interval twice is not more than the second preset time, then is to record to be intended to by interaction intention assessment, then The smart machine camera is opened according to the intention and starts camera shooting.

Output module 304 is used to be intended to the multi-modal interaction of progress according to interaction by visual human to export.Pass through parsing module After 303 determine that interaction is intended to, output module 304 can export the multi-modal interaction output for meeting interaction intention.Output module 304 wraps Unit containing output data 3041 can be intended to determine that the multi-modal interaction for needing to export is exported according to interaction, and by virtual Multi-modal interaction output is showed user 101 by people.

Fig. 4 shows the structural frames of the visual interactive system based on visual human according to another embodiment of the invention Figure.As shown in figure 4, completing interaction needs user 101, smart machine 102 and high in the clouds brain 104.Wherein, smart machine 102 Including man-machine interface 401, data processing unit 402, input/output unit 403 and interface unit 404.High in the clouds brain 104 wraps Interface containing semantic understanding 1041, visual identity interface 1042, cognition calculate interface 1043 and affection computation interface 1044.

Visual interactive system provided by the invention based on visual human includes smart machine 102 and high in the clouds brain 104.It is empty Personification 103 is run in smart machine 102, and visual human 103 has default image and preset attribute, when in interaction mode Voice, emotion, vision and sensing capability can be started.

In one embodiment, smart machine 102 may include：Man-machine interface 401, data processing unit 402, input are defeated Go out device 403 and interface unit 404.Wherein, man-machine interface 401 is shown in the predeterminable area of smart machine 102 in fortune The visual human 103 of row state.

Data processing unit 402 carries out the number generated in multi-modal interactive process for handling user 101 with visual human 103 According to.Processor used can be data processing unit (Central Processing Unit, CPU), can also be that other are logical With processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng processor is the control centre of terminal, utilizes the various pieces of various interfaces and the entire terminal of connection.

Include memory in smart machine 102, memory includes mainly storing program area and storage data field, wherein is deposited Store up program area can storage program area, (for example sound-playing function, image play work(to the application program needed at least one function Energy is equal) etc.；Storage data field can store according to smart machine 102 use created data (such as audio data, browsing note Record etc.) etc..In addition, memory may include high-speed random access memory, can also include nonvolatile memory, such as firmly Disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) block, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-states Part.

Input/output unit 403 is used to obtain multi-modal interaction data and exports the output data in interactive process.It connects Mouthful unit 404 is used to communicate with the expansion of high in the clouds brain 104, and by with the interface in high in the clouds brain 104, to fetching, to transfer high in the clouds big Visual human's ability in brain 104.

High in the clouds brain 104 include semantic understanding interface 1041, visual identity interface 1042, cognition calculate interface 1043 and Affection computation interface 1044.The above interface is communicated with the expansion of interface unit 404 in smart machine 102.Also, high in the clouds is big Brain 104 also includes and the corresponding semantic understanding logic of semantic understanding interface 1041, vision corresponding with visual identity interface 1042 Recognition logic and cognition calculate the corresponding cognition calculating logic of interface 1043 and emotion corresponding with affection computation interface 1044 Calculating logic.

As shown in figure 4, each ability interface calls corresponding logical process respectively in multi-modal data resolving.Below For the explanation of each interface：

Semantic understanding interface 1041 receives the special sound instruction forwarded from interface unit 404, voice knowledge is carried out to it The other and natural language processing based on a large amount of language materials.

Visual identity interface 1042 can be directed to human body, face, scene and be calculated according to computer vision algorithms make, deep learning Method etc. carries out video content detection, identification, tracking etc..Image is identified according to scheduled algorithm, the inspection of quantitative Survey result.Have image preprocessing function, feature extraction functions, decision making function and concrete application function；

Wherein, image preprocessing function can carry out basic handling, including color sky to the vision collecting data of acquisition Between conversion, edge extracting, image transformation and image threshold；

Feature extraction functions can extract the features such as the colour of skin of target, color, texture, movement and coordinate in image and believe Breath；

Decision making function can be distributed to according to certain decision strategy to characteristic information and need the specific of this feature information Multi-modal output equipment or multi-modal output application, such as realize Face datection, human limbs identification, motion detection function.

Cognition calculates interface 1043, receives the multi-modal data forwarded from interface unit 404, and cognition calculates interface 1043 Data acquisition, identification and study are carried out to handle multi-modal data, to obtain user's portrait, knowledge mapping etc., with to multimode State output data carries out Rational Decision.

Affection computation interface 1044 receives the multi-modal data forwarded from interface unit 404, utilizes affection computation logic (can be Emotion identification technology) calculates the current emotional state of user.Emotion identification technology is that one of affection computation is important The content of component part, Emotion identification research includes facial expression, voice, behavior, text and physiological signal identification etc., is led to Cross the emotional state that the above content may determine that user.Emotion identification technology can be monitored only by vision Emotion identification technology The emotional state of user can also monitor use using vision Emotion identification technology and sound Emotion identification technology in conjunction with by the way of The emotional state at family, and be not limited thereto.In the present embodiment, it is preferred to use the two in conjunction with mode monitor mood.

Affection computation interface 1044 is to collect mankind face by using image capture device when carrying out vision Emotion identification Portion's facial expression image is then converted into that data can be analyzed, the technologies such as image procossing is recycled to carry out expression mood analysis.Understand face Expression, it usually needs the delicate variation of expression is detected, such as cheek muscle, mouth variation and choose eyebrow etc..

Fig. 5 shows that the visual interactive system according to an embodiment of the invention based on visual human carries out visual interactive Schematic diagram.As shown in figure 5, multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts multi-modal interactive number Hand frame-type action in is intended to as interaction.Configuration can carry out the hardware of visual capacity detection and set on smart machine 102 Standby, for monitoring the hand motion of user, whether the gesture motion for detecting user in real time is the action of hand frame-type.

In one embodiment of the invention when detecting the action of hand frame-type by visual capacity, the period is being judged It is interior, if the hand motion of user includes：Left hand thumb finger pulp is closed with right hand index finger finger pulp, left index finger finger pulp and the right hand Thumb finger pulp is closed, and above four fingers form a closed quadrangle, remaining finger is rolled up naturally, then identifies that hand is dynamic It is acted as hand frame-type.

Only occur in the first recognition cycle once when identification hand frame-type is acted, is then to take pictures by interaction intention assessment It is intended to, then opens the smart machine camera according to the intention and start and take pictures.When the action of identification hand frame-type is in the first identification Occur more than once in period, and adjacent time interval twice is not more than the second preset time, then it will interaction intention assessment It is intended to record, then opens the smart machine camera according to the intention and start camera shooting.

For example, smart machine 102 playing, one section of music, visual human coordinates music to wave.At this point, if user A hand frame-type action has been shown in the first recognition cycle, then shows that user needs to open smart machine camera and start It takes pictures.Correspondingly, if user shown in the first recognition cycle more than once hand frame-type action, and it is adjacent twice when Between interval be no more than the second preset time, then show that user needs to open and smart machine camera and start camera shooting.According to this hair Bright one embodiment, the first recognition cycle can be 10s, and the second preset time can be 1s.

In addition, according to one embodiment of present invention, under exposal model, the action of hand frame-type can also make smart machine The trigger action taken pictures, i.e., under exposal model, user's hand shows frame-type action, and smart machine identifies that frame-type acts Afterwards, photographing operation is triggered, is taken pictures for user.

Correspondingly, according to one embodiment of present invention, under recording mode, the action of hand frame-type can also make intelligence and set The standby triggering recorded or stopping action, i.e., under recording mode, user's hand shows two frame-type actions, intelligence in 2s After equipment identifies frame-type action, triggering, which is recorded, starts or records the operation terminated, and video capture is carried out for user.It needs to illustrate , 2s and be not both fixed can be adjusted, the present invention makes limitation not to this according to actual conditions.

It is intended to the multi-modal interaction of progress according to interaction by visual human to export, including：By visual human according to hand frame-type It acts corresponding interaction to be intended to start smart machine hardware, and shows multi-modal interaction output, multi-modal interaction output includes：It claps According to intention and/or record the corresponding response results data of intention.Such as：In the case where the intention and recording of taking pictures is intended to, user can be with Selection and visual human's group photo or video recording, can also select oneself individually to take pictures or record a video.It needs to start taking the photograph for smart machine at this time As head and picture processing function.The posture and picture effect taken pictures or recorded a video can also be selected by users.

It should be noted that the spatial position of hand frame-type action is unlimited, the area captured by smart machine can be in In domain.In addition, the centre of the palm of hand frame-type action is positive and negative also unlimited, the palm of the hand can be identified with the palm of the hand outwardly inwardly It is acted for hand frame-type, the present invention makes limitation not to this.

Fig. 6 shows the flow chart of the visual interactive method according to an embodiment of the invention based on visual human.

As shown in fig. 6, in step s 601, multi-modal data is exported by visual human.In this step, smart machine Visual human 103 in 102 exports multi-modal data to user 101, to open a dialogue with user 101 in a wheel interaction or its He interacts.The multi-modal data that visual human 103 exports can be one section of word, one section of music or one section of video.

In step S602, receives user and be directed to the multi-modal interaction data that multi-modal data provides.In this step, intelligence Energy equipment 102 can obtain multi-modal interaction data, and smart machine 102 can be configured with the corresponding dress for obtaining multi-modal interaction data It sets.Multi-modal interaction data can be the input of the forms such as text input, audio input and perception input.

In step S603, multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts multi-modal interaction Hand frame-type action in data is intended to as interaction.May include hand motion in multi-modal interaction data, it is also possible to no Including hand motion, whether in order to determine interactive intention, it includes hand motion to need to detect in multi-modal interaction data.Passing through When visual capacity detects the action of hand frame-type, judging in the period if the hand motion of user includes：Left hand thumb finger pulp It is closed with right hand index finger finger pulp, left index finger finger pulp is closed with right hand thumb finger pulp, and above four fingers form a closing Quadrangle, remaining finger rolls up naturally, then identifies that hand motion acts for hand frame-type.

In this step, it detects in multi-modal interaction data and whether is acted comprising hand frame-type first, if multi-modal friendship It is acted comprising hand frame-type in mutual data, then the interaction that the action of hand frame-type is interacted as epicycle is intended to.If multi-modal Do not include hand frame-type in interaction data to act, then anticipating according to other data in multi-modal interaction data as interaction Figure.

In one embodiment of the invention, interaction is intended to be divided into take pictures intention and recording intention.When identification hand frame Type action only occurs once in the first recognition cycle, then is intention of taking pictures by interaction intention assessment, is then opened according to the intention It opens the smart machine camera and starts and take pictures.When identification hand frame-type action occurs more than one in the first recognition cycle It is secondary, and adjacent time interval twice is not more than the second preset time, then is to record to be intended to by interaction intention assessment, then foundation should It is intended to open the smart machine camera and starts camera shooting.

Finally, in step s 604, it is intended to carry out multi-modal interaction output according to interaction by visual human.Interaction is determined After intention, visual human 103 can be intended to the corresponding multi-modal interaction of output according to the interaction of confirmation and export.

In addition, the visual interactive system provided by the invention based on visual human can also coordinate a kind of program product, packet Containing for executing the series of instructions for completing the visual interactive method and step of visual human.Program product can run computer and refer to Enable, computer instruction includes computer program code, computer program code can be source code form, object identification code form, Executable file or certain intermediate forms etc..

Program product may include：Can carry computer program code any entity or device, recording medium, USB flash disk, Mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory Device (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..

It should be noted that the content that program product includes can be wanted according to legislation and patent practice in jurisdiction It asks and carries out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, program product does not include electric carrier wave Signal and telecommunication signal.

Fig. 7 shows the determining interaction meaning of the visual interactive method according to an embodiment of the invention based on visual human The flow chart of figure.

In step s 701, multi-modal interaction data is parsed, wherein：It is detected by visual capacity and extracts multi-modal interaction Hand frame-type action in data is intended to as interaction.In this step, it needs to parse multi-modal interaction data, multimode State interaction data includes the data of diversified forms.In order to know interactive intention, needs to detect in multi-modal interaction data and whether wrap Frame-type containing hand acts.When detect in multi-modal interaction data comprising the action of hand frame-type after, needs to extract and detect Hand frame-type acts, and is intended to using the action of hand frame-type as interaction.

According to one embodiment of present invention, interaction is intended to be divided into two classes, is take pictures intention and recording intention respectively. In step S702, only occurs in the first recognition cycle once when identification hand frame-type is acted, be then by interaction intention assessment It takes pictures intention, then opens the smart machine camera according to the intention and start and take pictures.

Meanwhile in step S703, occur more than once in the first recognition cycle when identification hand frame-type is acted, and Adjacent time interval twice is not more than the second preset time, then is to record to be intended to by interaction intention assessment, then according to the intention It opens the smart machine camera and starts camera shooting.Finally, more according to interaction intention progress by visual human in step S704 Mode interaction output.

Fig. 8 shows that determining for the visual interactive method based on visual human according to another embodiment of the invention interacts The flow chart of intention.

In step S801, detects and extract voice data or expression data in multi-modal interaction data.Multi-modal Include the data of diversified forms in interaction data, these data all may include the current interaction wish of user 101.In this step In, whether include voice data or expression data, to be intended to make reference for determining interaction if detecting in multi-modal interaction data.

Then, in step S802, voice data or expression data are parsed.If in multi-modal interaction data including voice Data or expression data parse voice data or expression data, know in voice data or expression data and use in this step The interaction wish at family, obtains analysis result.

Then, in step S803, judge whether the intention that voice data or expression data are acted with hand frame-type meets. If the intention that voice data or expression data are acted with hand frame-type meets, S804 is entered step, according to the result of parsing It acts in conjunction with hand frame-type and is intended to as interaction.If the intention that voice data or expression data are acted with hand frame-type is not inconsistent It closes, then enters step S805, hand frame-type is acted and is intended to as interaction.

Fig. 9 shows another flow of the visual interactive method according to an embodiment of the invention based on visual human Figure.

As shown in figure 9, in step S901, smart machine 102 sends out request to high in the clouds brain 104.Later, in step In S902, smart machine 102 is constantly in the state for waiting for high in the clouds brain 104 to reply.During waiting, smart machine 102 can carry out Clocked operation to returned data the time it takes.

In step S903, if the reply data not returned for a long time, for example, being more than scheduled time span 5S, then smart machine 102 can select to carry out local reply, generate local common reply data.Then, defeated in step S904 Go out the animation with local common response cooperation, and voice playing equipment is called to carry out speech play.

In order to realize the multi-modal interaction between smart machine 102 and user 101, user 101, smart machine 102 are needed And communication connection is set up between high in the clouds brain 104.This communication connection should be real-time, unobstructed, can ensure to hand over It is mutually impregnable.

In order to complete to interact, some conditions or premise are needed to have.These conditions or premise include smart machine Visual human is loaded and run in 102, and smart machine 102 has the hardware facility of perception and control function.Visual human exists Start voice, emotion, vision and sensing capability when in interaction mode.

After completing early-stage preparations, smart machine 102 starts to interact with the expansion of user 101, and first, smart machine 102 passes through Visual human 103 exports multi-modal data.Multi-modal data can be one wheel interaction in, visual human output one section words, one section Music or video.At this point, two sides of expansion communication are smart machine 102 and user 101, the direction of data transfer is set from intelligence Standby 102 are transmitted to user 101.

Then, smart machine 102 receives multi-modal interaction data.Multi-modal interaction data is that user is directed to multi-modal data The response of offer.The data that diversified forms can be included in multi-modal interaction data, for example, can be wrapped in multi-modal interaction data Containing text data, voice data, perception data and action data etc..Configured with the multi-modal interaction of reception in smart machine 102 The relevant device of data, for receiving the multi-modal interaction data of the transmission of user 101.At this point, two sides that expanding data transmits are User 101 and smart machine 102, the direction of data transfer is to be transmitted to smart machine 102 from user 101.

Then, smart machine 102 sends to high in the clouds brain 104 and asks.Ask high in the clouds brain 104 to multi-modal interaction data Semantic understanding, visual identity, cognition calculating and affection computation are carried out, to help user to carry out decision.At this point, passing through vision energy Power detects and extracts the action of the hand frame-type in multi-modal interaction data to be intended to as interaction.Then, high in the clouds brain 104 will reply Data transmission is to smart machine 102.At this point, two sides of expansion communication are smart machine 102 and high in the clouds brain 104.

Finally, after smart machine 102 receives the data of the transmission of high in the clouds brain 104, smart machine 102 can be by virtual People is intended to carry out multi-modal interaction output according to interaction.At this point, two sides of expansion communication are smart machine 102 and user 101.

It should be understood that disclosed embodiment of this invention is not limited to specific structure disclosed herein, processing step Or material, and the equivalent substitute for these features that those of ordinary skill in the related art are understood should be extended to.It should also manage Solution, term as used herein is used only for the purpose of describing specific embodiments, and is not intended to limit.

" one embodiment " or " embodiment " mentioned in specification means the special characteristic described in conjunction with the embodiments, structure Or characteristic includes at least one embodiment of the present invention.Therefore, the phrase " reality that specification various places throughout occurs Apply example " or " embodiment " the same embodiment might not be referred both to.

While it is disclosed that embodiment content as above but described only to facilitate understanding the present invention and adopting Embodiment is not limited to the present invention.Any those skilled in the art to which this invention pertains are not departing from this Under the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of visual interactive method based on visual human, which is characterized in that the visual human is shown by smart machine, is being located Start voice, emotion, vision and sensing capability, the method when interaction mode to comprise the steps of：

Multi-modal data is exported by the visual human；

The multi-modal interaction data is parsed, wherein：It is detected and is extracted in the multi-modal interaction data by visual capacity The action of hand frame-type is intended to as interaction；

2. the visual interactive method based on visual human as described in claim 1, which is characterized in that detected by visual capacity When being acted to the hand frame-type, judging in the period, if the hand motion of user includes：Left hand thumb finger pulp and the right hand Index finger finger pulp is closed, and left index finger finger pulp is closed with right hand thumb finger pulp, and above four fingers form closed four side Shape then identifies that the hand motion acts for hand frame-type.

3. the visual interactive method based on visual human as described in any one of claim 1-2, which is characterized in that by regarding Feel ability is detected and is extracted in the step of action of the hand frame-type in the multi-modal interaction data is intended to as interaction, is also wrapped It includes：

When identifying that it is primary that the hand frame-type action only occurs in the first recognition cycle, it is by the interactive intention assessment then It takes pictures intention, then opens the smart machine camera according to the intention and start and take pictures；

Or,

Occur more than once in the first recognition cycle when the identification hand frame-type is acted, and adjacent time interval twice No more than the second preset time, then it is to record to be intended to by the interactive intention assessment, then opens the smart machine according to the intention Camera simultaneously starts camera shooting.

4. the visual interactive method as claimed in any one of claims 1-3 based on visual human, which is characterized in that described virtual People receives the multi-modal interaction data provided for the multi-modal data from multiple users, identifies the multiple user In main users, and the hand motion of the main users is detected；

Or,

The hand motion of the current all or part of user of acquisition, the friendship that ratio-dependent is collected user is acquired according to preset user Mutually it is intended to.

5. the visual interactive method as claimed in any one of claims 1-3 based on visual human, which is characterized in that when described more When in mode interaction data including voice data or expression data, acts according to the hand frame-type and be intended to as interaction, it is above Step also includes：

The voice data or the expression data are parsed, judges the voice data or the expression data and the hand frame Whether the intention of type action meets；

If not meeting, the hand frame-type action is intended to as interaction.

6. the visual interactive method based on visual human as described in any one of claim 1-5, which is characterized in that by described Visual human is intended to carry out multi-modal interaction output according to the interaction, including：By the visual human according to the hand frame-type It acts corresponding interaction to be intended to start the smart machine hardware, and shows multi-modal interaction output, the multi-modal interaction is defeated Go out including：Intention of taking pictures and/or recording are intended to corresponding response results data.

7. a kind of program product, it includes for executing a series of of the method and step as described in any one of claim 1-6 Instruction.

8. a kind of visual human, which is characterized in that the visual human has specific virtual image and preset attribute, using such as right It is required that the method described in any one of 1-6 carries out multi-modal interaction.

9. a kind of visual interactive system based on visual human, which is characterized in that the system includes：

Smart machine is mounted with visual human as claimed in claim 8 thereon, for obtaining multi-modal interaction data, and has The ability of voice, emotion, expression and action output；

High in the clouds brain, be used to carry out natural language understanding to the multi-modal interaction data, visual identity, cognition calculate and Affection computation exports multi-modal interaction data with visual human described in decision.