CN108363706A

CN108363706A - The method and apparatus of human-computer dialogue interaction, the device interacted for human-computer dialogue

Info

Publication number: CN108363706A
Application number: CN201710056801.3A
Authority: CN
Inventors: 赵海舟; 许静芳
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-01-25
Filing date: 2017-01-25
Publication date: 2018-08-03
Anticipated expiration: 2037-01-25
Also published as: CN108363706B

Abstract

An embodiment of the present invention provides a kind of method and apparatus of human-computer dialogue interaction, wherein the method includes：Obtain voice data, image data and the contextual data of interaction side；Corresponding scene characteristic model is obtained according to the contextual data；The voice data and image data are input to the scene characteristic model and obtain target person characteristic attribute；Target dialogue strategy is determined using the target person characteristic attribute and contextual data；Expression, voice based on target dialogue policy control robot and/or action output.The embodiment of the present invention so that during human-computer interaction, machine can coordinate the feature of interaction side's current session, the dialogue to personalize with the side of interaction, to improve interaction side's interactive experience according to target dialogue strategy.

Description

The method and apparatus of human-computer dialogue interaction, the device interacted for human-computer dialogue

Technical field

The present invention relates to technical field of data processing, man-machine more particularly to the method and one kind of a kind of human-computer dialogue interaction Talk with the device of interaction and the device of user's human-computer dialogue interaction.

Background technology

Man-machine interaction is exactly the process that people carries out information exchange with machine.Information flowrate is that measurement is man-machine The most important index of dialogue interaction.Man-machine interaction will follow the Evolution Paths of person to person's interaction, and dialogue interaction is highest The Health For All mode of effect will also become most efficient human-computer dialogue interactive mode.

The voice messaging of interaction side can only be changed into text information by existing interactive, cannot be identified more Information causes machine that can only be gone to generate reply according to single parameter, model when replying interaction side in this way.In addition, existing Interactive system is generally synthetic when replying based on voice messaging, and dialogue form is dull, and interaction cube is tested ineffective.

Invention content

In view of the above problems, it is proposed that the embodiment of the present invention overcoming the above problem or at least partly in order to provide one kind The device of a kind of method of human-computer dialogue interaction to solve the above problems and a kind of corresponding human-computer dialogue interaction.

To solve the above-mentioned problems, the embodiment of the invention discloses a kind of methods of human-computer dialogue interaction, including：

Obtain voice data, image data and the contextual data of interaction side；

Corresponding scene characteristic model is obtained according to the contextual data；

The voice data and image data are input to the scene characteristic model and obtain target person characteristic attribute；

Target dialogue strategy is determined using the target person characteristic attribute and contextual data；

Expression, voice based on target dialogue policy control robot and/or action output.

Optionally, the step of voice data for obtaining interaction side, image data and contextual data includes：

The voice data of the interaction side is acquired based on microphone；

The image data of the interaction side is acquired based on camera；

And the contextual data is acquired based on sensor.

Show interactive interface；

Based on interactive interface prompt interaction side's input voice data, image data and contextual data.

Optionally, described the step of obtaining corresponding scene characteristic model according to the contextual data, includes：

Scene characteristic attribute is extracted from the contextual data；

Obtain the corresponding scene characteristic model of the scene characteristic attribute.

Optionally, the scene characteristic model is trained in the following way：

Obtain the training sample and the corresponding character features attribute of each training sample under each scene characteristic model；Institute It includes training voice data and training image data to state training sample；

Trained intonation characteristic and training patterned feature data are extracted from the trained voice data；

Go out trained expressive features data and training action characteristic from the training image extracting data；

The intonation characteristic that includes using the training sample under each scene characteristic, training patterned feature data, Training expressive features data and/or training action characteristic and corresponding character features attribute, it is special that training obtains each scene Levy model.

Optionally, described the voice data and image data are input to the scene characteristic model to obtain target person Characteristic attribute, including：

Intonation characteristic and patterned feature data are extracted from the voice data；

Go out expressive features data and motion characteristic data from described image extracting data；

Based on the intonation characteristic, patterned feature data, expressive features data and/or motion characteristic data, in conjunction with The scene characteristic model obtains the target person characteristic attribute.

Optionally, expression, voice and/or the action output based on target dialogue policy control robot, packet It includes：

Obtain text information corresponding with the target dialogue strategy, expression instruction, phonetic order, and/or action command；

The robot, which is controlled, based on expression instruction, phonetic order and/or action command exports the text information.

The embodiment of the invention also discloses a kind of devices of human-computer dialogue interaction, including：

Interaction side's data acquisition module, voice data, image data and contextual data for obtaining interaction side；

Scene characteristic model acquisition module, for obtaining corresponding scene characteristic model according to the contextual data；

Character features attribute obtains module, for the voice data and image data to be input to the scene characteristic mould Type obtains target person characteristic attribute；

Target dialogue strategy determining module, for determining target pair using the target person characteristic attribute and contextual data Words strategy；

Human-computer interaction session module is used for the expression based on target dialogue policy control robot, voice and/or moves It exports.

Optionally, interaction side's data acquisition module includes：

First interaction side's data-acquisition submodule, the voice data for acquiring the interaction side based on microphone；It is based on Camera acquires the image data of the interaction side；And the contextual data is acquired based on sensor.

Optionally, interaction side's data acquisition module includes：

Interactive interface shows submodule, for showing interactive interface；

Second interaction side's data-acquisition submodule, for based on the interactive interface prompt interaction side input voice data, Image data and contextual data.

Optionally, the scene characteristic model acquisition module includes：

Scene characteristic attributes extraction submodule, for extracting scene characteristic attribute from the contextual data；

Scene characteristic model determination sub-module, for obtaining the corresponding scene characteristic model of the scene characteristic attribute.

Optionally, described device further includes：

Training sample acquisition module, for obtaining training sample and each training sample under each scene characteristic model Corresponding character features attribute；The training sample includes training voice data and training image data；

First training characteristics data extraction module, for extracting trained intonation characteristic from the trained voice data According to training patterned feature data；

Second training characteristics data extraction module, for going out trained expressive features number from the training image extracting data According to training action characteristic；

Scene characteristic model training module, the intonation for including using the training sample under each scene characteristic are special Levy data, training patterned feature data, training expressive features data and/or training action characteristic and corresponding personage Characteristic attribute, training obtain each scene characteristic model.

Optionally, the character features attribute acquisition module includes：

Fisrt feature data extracting sub-module, it is special for extracting intonation characteristic and lines from the voice data Levy data；

Second feature data extracting sub-module, for going out expressive features data and action spy from described image extracting data Levy data；

Target person characteristic attribute obtains submodule, for based on the intonation characteristic, patterned feature data, expression Characteristic and/or motion characteristic data obtain the target person characteristic attribute in conjunction with the scene characteristic model.

Optionally, the target dialogue strategy setting has corresponding word, expression, action, the human-computer interaction to talk with mould Block includes：

Acquisition submodule is instructed, for obtaining text information corresponding with the target dialogue strategy, expression instruction, voice Instruction, and/or action command；

Command executing sub module, for controlling the machine based on expression instruction, phonetic order and/or action command People exports the text information.

The embodiment of the present invention also provides a kind of device for human-computer dialogue interaction, include memory and one or The more than one program of person, one of them either more than one program be stored in memory and be configured to by one or It includes the instruction for being operated below that more than one processor, which executes the one or more programs,：

Obtain voice data, image data and the contextual data of interaction side；

The embodiment of the present invention includes following advantages：

The embodiment of the present invention obtains voice data, image data and the scene number of interaction side during human-computer interaction According to, and obtained based on contextual data and meet the scene characteristic models of session operational scenarios instantly, then by the voice data of interaction side with/ Or image data is input to scene characteristic model and obtains character features attribute, and corresponding mesh is formulated based on personage's characteristic attribute Mark dialog strategy so that during human-computer interaction, machine can coordinate interaction side's current session according to target dialogue strategy Feature, the dialogue to personalize with the side of interaction.

The embodiment of the present invention can meet the scene characteristic mould under respective fields scape according to the contextual data selection collected Type determines the character features category to match with the phonetic feature of the current side of interaction and/characteristics of image according to the scene characteristic model Property, wherein character features attribute can reflect the intention for the side of speaking, mood, and character features attribute is not had under different scenes The side of speaking phonetic feature and characteristics of image can be slightly different.Therefore, scene characteristic corresponding with current scene data is selected Model determines character features attribute, can make the intention for the side of speaking and the emotion expression service more accurate.The embodiment of the present invention into One step can formulate the target dialogue strategy interacted with the side of interaction according to character features attribute, can make machine and the side of interaction Interaction it is more personalized, being capable of preferably service interaction side.

Description of the drawings

Fig. 1 is a kind of step flow chart of the embodiment of the method one of human-computer dialogue interaction of the present invention；

Fig. 2 is a kind of step flow chart of the embodiment of the method two of human-computer dialogue interaction of the present invention；

Fig. 3 is a kind of structure diagram of the device embodiment of human-computer dialogue interaction of the present invention；

Fig. 4 is a kind of block diagram of the device of human-computer interaction shown according to an exemplary embodiment；

Fig. 5 be shown according to an exemplary embodiment it is a kind of for human-computer dialogue interaction device as server when Block diagram.

Specific implementation mode

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.

Referring to Fig.1, a kind of step flow chart of the embodiment of the method one of human-computer dialogue interaction of the present invention is shown, specifically It may include steps of：

Step 101, the voice data of acquisition interaction side and image data and contextual data；

It engages in the dialogue with robot in interaction side during interact, obtains the associated data of interaction side in real time.Wherein, should Associated data may include the voice data of interaction side, image data, and, the contextual data etc. of interaction scenarios where dialogue Deng.Wherein, which can be obtained by data acquisition equipment, can also be by being manually entered.

In embodiments of the present invention, above-mentioned associated data can actively be acquired.As, once it is determined that current robot In human-computer interaction state, then automatically by robot built-in or the data acquisition equipment of external connection to currently interactive object (as interaction side) carries out the acquisition of voice data and/image data and contextual data.

Specifically, the step 101 may include following sub-step：

Sub-step S11 acquires the voice data of interaction side based on microphone；The picture number of interaction side is acquired based on camera According to；And the contextual data of current session is acquired based on sensor.

Data acquisition equipment there are many being installed in the robot interacted with the side of interaction, including robot interior are pacified It is filled with and external connection.Different interaction side's associated datas can be collected based on different data acquisition equipments, such as interactive Voice data when side speaks, the facial expression data of interaction side, the gesture motion data of interaction side, what interaction side was presently in Scene (environment) data etc..

In one preferred embodiment of the invention, the step 101 may include following sub-step：

Sub-step S12 shows interactive interface；

Sub-step S13, based on interactive interface prompt interaction side's input voice data, image data and scene number According to.Optionally, the embodiment of the present invention can also input the association by inquiry interaction side or by interactive interface guided interaction side Data can be specifically that interaction side shows interactive interface, and the input of prompt interaction side is corresponding item by item on the interactive interface Data, such as prompt alignment microphone input voice data, alignment cameras input image data are currently located field according to interaction side Scape determines contextual data.Further, it includes the data such as age, gender information that can also prompt the input of interaction side.

Certainly, in practice, the mode that opposed robots voluntarily acquire in such a way that interaction side inputs more not in time, And the data that can be obtained are less, can not cope with the variation of interaction side, therefore in interactive process, should be with machine certainly Based on the data of dynamic acquisition, supplemented by the data that interaction side inputs.

It should be noted that the embodiment of the present invention can also acquire the Human Physiology of interaction side by wearable device Data, such as heartbeat, breathing, digestion, body temperature etc. are based on these data, and robot can analyze and identifying processing, to sentence Break and the emotional characteristics of interaction side.

Step 102, corresponding scene characteristic model is obtained according to the contextual data；

In embodiments of the present invention, it is contemplated that under different scenes, the character features attribute of interaction side will be different, So that the associated data of the interaction side collected also slightly has difference.Such as personage's performance indoors may be more overcautious, It is outdoor then may more decontrol, then may be more dull in driving, therefore, the embodiment of the present invention is according to different scenes Corresponding scene characteristic model is arranged in corresponding character features attribute.For example, corresponding and indoor environment can be arranged indoors Corresponding outdoor character features attribute can be arranged in outdoor in the corresponding indoor characteristic model of more matched character features attribute Driving corresponding with the more matched character features attribute of environment when driving a vehicle can be arranged in driving in corresponding outdoor characteristics model Characteristic model.

In one preferred embodiment of the invention, described 202 may include following sub-step：

Sub-step S21 extracts scene characteristic attribute from the contextual data；

Sub-step S22 obtains the corresponding scene characteristic model of the scene characteristic attribute.

The embodiment of the present invention can acquire the specific environment information of human-computer dialogue occurrence scene by sensor, that is, collect Contextual data.Specifically, temperature, humidity can be identified by Temperature Humidity Sensor, is to move by velocity sensor identification It is dynamic or static, it is daytime or evening by optical sensor identification, is indoor or outdoor etc. by environmental sensor identification. For the data that sensor recognizes, corresponding scene characteristic attribute will be therefrom extracted, is based on these scene characteristic attributes, it can Gone out whether indoors, whether in driving etc. with simple analysis.For example, the speed data that velocity sensor identifies, Ke Yicong In extract the current velocity information in interaction side, if speed reaches preset Vehicle Speed, so that it may to think interaction side Under driving scene.

Certainly, when the embodiments of the present invention are specifically implemented, can also be identified using other sensors or other modes Go out the scene that interaction side is presently in, the embodiment of the present invention does not limit this.

In one preferred embodiment of the invention, the scene characteristic model can be trained in the following way：

Obtain the training sample and the corresponding character features attribute of each training sample under each scene characteristic；The instruction It includes training voice data and training image data to practice sample；

The embodiment of the present invention can utilize magnanimity training sample and training sample under each scene characteristic collected in advance This corresponding character features attribute is as training data.Training sample may include trained voice data and training image data. For example, voice collecting and Image Acquisition are carried out to the different personages under some scene, and according to a certain personage's collected Voice data and image data establish the correspondence between the character features attribute of the personage, constitute a training data.Example Such as, under some scene, the voice data and image data of a child are obtained, and is extracted from the voice data of the child small The intonation characteristic and patterned feature data of child, and extract from the image data of child the facial expression feature number of child It establishes the character features attribute with child using the partial data as training sample according to gesture feature data and works as front court The association of scape feature is stored in as a training sample in training sample database.Based on this so that preserved in training sample database The training sample for different scenes of magnanimity.

By deep learning, according to the training sample of magnanimity under each scene characteristic, can train to obtain under each scene Scene characteristic model.Wherein, the corresponding character features attribute of training sample may include age attribute (child/youth/old People), personality attribute (optimistic/containing/shy) and mood attribute (sad/tranquil/excited) etc..It can be to language after the completion of training Sound data, image data make corresponding classification.

Scene characteristic model can determine corresponding character features attribute based on the voice data of input, image data.

In the concrete application of the present invention, the scene of the mood of interaction side can be trained by training intonation characteristic Characteristic model further, can also add other by training characteristics data come the scene characteristic model of sport career age Data are trained so that model accuracy more increases.

Step 103, the voice data and image data are input to the scene characteristic model and obtain target person spy Levy attribute.

Specifically, step 103 of the embodiment of the present invention may include：

For collected interaction side's associated data, for example, voice data and/or image data will be input to it is corresponding Scene characteristic model is obtained special with current scene and the more matched personage of corresponding interaction side's feature based on the scene characteristic model Levy attribute.Specifically, character features attribute may include essential characteristic, emotional characteristics, character trait etc., wherein mood Feature may include excitement, tranquil, sad, and essential characteristic may include old man, children, male, women etc., and character trait can be with Including optimistic, active, containing, shy etc..

It should be noted that division and judgement for character features attribute, can be arranged above-mentioned one according to actual demand Kind is a variety of, or the other character features attributes of addition, and the embodiment of the present invention is to this without limiting.

Step 104, target dialogue strategy is determined using the target person characteristic attribute and contextual data；

Different dialog strategies is arranged based on different character features attributes and scene characteristic for the embodiment of the present invention, for example, If identifying the atmosphere that current scene is relatively more oppressive, the mood of interaction side is sad, can be used in dialog procedure The dialog strategy consoled.

Step 105, the expression, voice based on target dialogue policy control robot and/or action output.

During human-computer dialogue, the control expression of robot that can be based on target dialogue strategy, voice, and/or dynamic It exports, engages in the dialogue with the side of interaction.Wherein, expression can be facial expression, such as the performance of the performance characteristic, face of eyes Feature etc.；Voice can be that intonation, the height of sound that robot voice exports are low；Action can be gesture, the head of robot Portion acts and the action etc. of other limbs.

The patterned feature data in voice data can be acquired in the embodiment of the present invention by Application on Voiceprint Recognition.Application on Voiceprint Recognition, Also referred to as Speaker Identification has two classes, i.e. speaker's identification and speaker verification.The former is judging that certain section of voice is several people Which of described in, be " multiselect one " problem, and the latter is confirming whether certain section of voice is described in specified someone 's.

The embodiment of the present invention can using patterned feature data as the identity of interaction side, when recognize there are two or Can be the language which interaction side generates based on the judgement of patterned feature data when the more than two interaction sides of person carry out human-computer dialogue Sound data, and the voice data that is generated based on the interaction side and/or picture number are come the target dialogue for the interaction side that determines Strategy engages in the dialogue with the interaction side, rather than uses identical dialog strategy for multiple interaction sides of two interaction sides, thus The individual demand of interaction side can be met so that dialogue is more interesting.

The embodiment of the present invention is during human-computer interaction, the voice data and image data of acquisition interaction side and field Scape data, and obtained based on contextual data and meet the scene characteristic model of scene instantly, then by the voice data of interaction side with Image data is input to scene characteristic model and obtains the character features attribute to match with current scene and the side's of interaction feature, and base Corresponding target dialogue strategy is formulated in personage's characteristic attribute, the expression and action for controlling robot export so that interaction side During human-computer interaction, robot can coordinate according to target dialogue strategy to engage in the dialogue with the side of interaction.

The embodiment of the present invention can select to meet the scene characteristic model under respective fields scape, to determine machine under current scene People needs the character features attribute showed, wherein character features attribute can reflect personality, intention, mood of the mankind etc., not The voice data and image data that the mankind of different personage's characteristic attributes generate under same scene can be slightly different, and therefore, be passed through To determine, robot needs the character features attribute shown to scene characteristic model under corresponding scene under current scene, can make It is more accurate to obtain personalizing for robot.The embodiment of the present invention is further according to the target person characteristic attribute of acquisition come control machine The expression of device people and action export, and the interaction of machine and the mankind can be made more personalized, being capable of preferably service user.

With reference to Fig. 2, a kind of step flow chart of the embodiment of the method two of human-computer dialogue interaction of the present invention is shown, specifically It may include steps of：

Step 201, voice data, image data and the contextual data of interaction side are obtained；

Step 202, corresponding scene characteristic model is obtained according to the contextual data；

Step 203, the voice data and image data are input to the scene characteristic model and obtain target person spy Levy attribute；

Step 204, target dialogue strategy is determined using the target person characteristic attribute and contextual data；

Step 205, text information corresponding with the target dialogue strategy, expression instruction, phonetic order are obtained and/or is moved It instructs；

Since the specific implementation mode of the step 201- steps 205 in embodiment of the method two essentially corresponds to method above-mentioned The specific implementation mode of embodiment one, therefore the present embodiment is for not detailed place in the description of step 201- steps 205, Ke Yican As soon as seeing the related description in previous embodiment, do not repeat herein.

Step 206, expression instruction, phonetic order and/or action command is based on to control described in the robot output Text information.

In embodiments of the present invention, target dialogue strategy is determined based on character features attribute, be arranged in target dialogue strategy Have corresponding text information, expression instruction, phonetic order and action command, to guidance machine people under current scene how Personage's characteristic attribute is showed by certain dialogue and expression, action.

For example, if identifying that the mood of interaction side is sad, the possible character features attribute obtained according to model Corresponding target dialogue strategy is that type of pacifying and the target dialogue strategy obtain corresponding text information, expression instruction, voice Instruction and/or action command, for example, text information is comfort property language, and expression instruction is sad, face is soft etc., voice refers to It is sound is relatively low, intonation is soft etc. to enable, and action command is interactive square toes portion etc. of stroking, and based on expression instruction, phonetic order And/or action command control robot exports the text information.

It is possible thereby to which so that human-computer interaction promotes user experience effect more close to the current actual conditions in interaction side.

The voice of interaction side can only be changed into word by existing technology, cannot be identified more information, be caused machine in this way People can only go to generate reply according to single parameter, model in dialogue.

In addition it is generally synthetic based on voice when existing robot dialogue, and lacks the intonation change for different inputs Change, combination of embodiment of the present invention voice, intonation, expression, action improve interactive experience.In other words, the embodiment of the present invention is real The man-machine interaction method of multimode input and multimode output is showed so that the dialogue of machine is no longer single.

In summary, the embodiment of the present invention is by state (voice data, the figure of the who object for being presently in scene and facing As data and contextual data) etc. as reference, obtain robot and need the expression showed and action etc. more under current scene Mode exports, and realizes the multi-modal human-computer interaction of machine.That is, a kind of comprehensive utilization speech recognition of proposition of the embodiment of the present invention, feelings Perception not, recognition of face, the technological maheups multimode input such as scene Recognition；So that robot voice cooperation expression, action composition are more Mould exports, to promote conversational system experience.Wherein, the realization process of the embodiment of the present invention can be divided into two kinds：

1, off-line procedure

Collection data that off-line procedure is mentioned before, the process of training, the embodiment of the present invention is according to people in each scene The information such as object characteristic attribute and conversation content, for different types of personage, under different scenes, the expression of generation, action Etc. for statistical analysis, scene characteristic model is established.Also, it is arranged in each scene and the corresponding dialogue plan of character features attribute Slightly.For example, under certain scene, action corresponding with current chat content, expression, intonation etc.；Either for old man, small Child, under certain scene, action corresponding with current chat content, expression, intonation etc..

2, in line process

According to current scene data, and the chat content of interaction side and image data etc., the situation of presence is sentenced It is disconnected, it determines scene characteristic model, interaction side's characteristic under the scene is determined based on scene characteristic model and determines target pair Words strategy, is then based on target dialogue strategy, obtains expression, the action etc. to match with current scene and chat content, real The multi-modal output of existing robot.

It should be noted that for embodiment of the method, for simple description, therefore it is all expressed as a series of action group It closes, but those skilled in the art should understand that, the embodiment of the present invention is not limited by the described action sequence, because according to According to the embodiment of the present invention, certain steps can be performed in other orders or simultaneously.Secondly, those skilled in the art also should Know, embodiment described in this description belongs to preferred embodiment, and the involved action not necessarily present invention is implemented Necessary to example.

With reference to Fig. 3, a kind of structure diagram of the device embodiment of human-computer dialogue interaction of the present invention is shown, it specifically can be with Including following module：

Interaction side's data acquisition module 301, voice data, image data and contextual data for obtaining interaction side；

Scene characteristic model acquisition module 302, for obtaining corresponding scene characteristic model according to the contextual data；

Character features attribute obtains module 303, special for the voice data and image data to be input to the scene Sign model obtains target person characteristic attribute；

Target dialogue strategy determining module 304, for determining mesh using the target person characteristic attribute and contextual data Mark dialog strategy；

Human-computer interaction session module 305, for based on target dialogue policy control robot expression, voice and/ Or action output.

In one preferred embodiment of the invention, interaction side's data acquisition module 301 may include：

Interactive interface shows submodule, for showing interactive interface；

In one preferred embodiment of the invention, the scene characteristic model acquisition module 302 may include：

In one preferred embodiment of the invention, described device can also include：

In one preferred embodiment of the invention, the character features attribute acquisition module 303 may include：

In one preferred embodiment of the invention, the target dialogue strategy setting has corresponding word, expression, moves Make, the human-computer interaction session module 305 may include：

For device embodiments, since it is basically similar to the method embodiment, so fairly simple, the correlation of description Place illustrates referring to the part of embodiment of the method.

Fig. 4 is a kind of block diagram of the device 500 of human-computer interaction shown according to an exemplary embodiment.For example, device 500 Can be mobile phone, computer, digital broadcast terminal, messaging devices, game console, tablet device, Medical Devices, Body-building equipment, personal digital assistant etc..

With reference to Fig. 4, device 500 may include following one or more components：Processing component 502, memory 504, power supply Component 506, multimedia component 508, audio component 510, the interface 512 of input/output (I/O), sensor module 514, and Communication component 516.

The integrated operation of 502 usual control device 500 of processing component, such as with display, call, data communication, phase Machine operates and record operates associated operation.Processing element 502 may include that one or more processors 520 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 502 may include one or more modules, just Interaction between processing component 502 and other assemblies.For example, processing component 502 may include multi-media module, it is more to facilitate Interaction between media component 508 and processing component 502.

Memory 504 is configured as storing various types of data to support the operation in device 500.These data are shown Example includes instruction for any application program or method that operate on device 500, contact data, and telephone book data disappears Breath, picture, video etc..Memory 504 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 506 provides electric power for the various assemblies of device 500.Power supply module 506 may include power management system System, one or more power supplys and other generated with for device 500, management and the associated component of distribution electric power.

Multimedia component 508 is included in the screen of one output interface of offer between described device 500 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 508 includes a front camera and/or rear camera.When equipment 500 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 510 is configured as output and/or input audio signal.For example, audio component 510 includes a Mike Wind (MIC), when device 500 is in operation mode, when such as call model, logging mode and speech recognition mode, microphone by with It is set to reception external audio signal.The received audio signal can be further stored in memory 504 or via communication set Part 516 is sent.In some embodiments, audio component 510 further includes a loud speaker, is used for exports audio signal.

I/O interfaces 512 provide interface between processing component 502 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to：Home button, volume button, start button and lock Determine button.

Sensor module 514 includes one or more sensors, and the state for providing various aspects for device 500 is commented Estimate.For example, sensor module 514 can detect the state that opens/closes of equipment 500, and the relative positioning of component, for example, it is described Component is the display and keypad of device 500, and sensor module 514 can be with 500 1 components of detection device 500 or device Position change, the existence or non-existence that user contacts with device 500,500 orientation of device or acceleration/deceleration and device 500 Temperature change.Sensor module 514 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 514 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 516 is configured to facilitate the communication of wired or wireless way between device 500 and other equipment.Device 500 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary implementation In example, communication component 514 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 514 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 500 can be believed by one or more application application-specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, it includes the non-transitorycomputer readable storage medium instructed, example to additionally provide a kind of Such as include the memory 504 of instruction, above-metioned instruction can be executed by the processor 520 of device 500 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

Fig. 5 be shown according to an exemplary embodiment it is a kind of for human-computer dialogue interaction device as server when Block diagram.The server 1900 can generate bigger difference because configuration or performance are different, may include one or more Central processing unit (central processing units, CPU) 1922 (for example, one or more processors) and storage The storage medium 1930 (such as one or one of device 1932, one or more storage application programs 1942 or data 1944 The above mass memory unit).Wherein, memory 1932 and storage medium 1930 can be of short duration storage or persistent storage.Storage May include one or more modules (diagram does not mark) in the program of storage medium 1930, each module may include pair Series of instructions operation in server.Further, central processing unit 1922 could be provided as logical with storage medium 1930 Letter executes the series of instructions operation in storage medium 1930 on server 1900.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of mobile terminal When device executes so that mobile terminal is able to carry out a kind of method of human-computer dialogue interaction, the method includes：

Obtain voice data, image data and the contextual data of interaction side；

The voice data of the interaction side is acquired based on microphone；

The image data of the interaction side is acquired based on camera；

And the contextual data is acquired based on sensor.

Show interactive interface；

Scene characteristic attribute is extracted from the contextual data；

Optionally, the scene characteristic model is trained in the following way：

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the present invention Its embodiment.The present invention is directed to cover the present invention any variations, uses, or adaptations, these modifications, purposes or Person's adaptive change follows the general principle of the present invention and includes the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the invention is not limited in the precision architectures for being described above and being shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of method of human-computer dialogue interaction, which is characterized in that including：

Obtain voice data, image data and the contextual data of interaction side；

2. according to the method described in claim 1, it is characterized in that, it is described obtain the voice data of interaction side, image data, with And the step of contextual data, includes：

The voice data of the interaction side is acquired based on microphone；

The image data of the interaction side is acquired based on camera；

And the contextual data is acquired based on sensor.

3. according to the method described in claim 2, it is characterized in that, it is described obtain the voice data of interaction side, image data, with And the step of contextual data, includes：

Show interactive interface；

4. according to the method described in claim 1, it is characterized in that, described obtain corresponding scene spy according to the contextual data Levy model the step of include：

Scene characteristic attribute is extracted from the contextual data；

5. according to the method described in claim 1, it is characterized in that, the scene characteristic model is trained in the following way：

Obtain the training sample and the corresponding character features attribute of each training sample under each scene characteristic model；The instruction It includes training voice data and training image data to practice sample；

The intonation characteristic that includes using the training sample under each scene characteristic, training patterned feature data, training Expressive features data and/or training action characteristic and corresponding character features attribute, training obtain each scene characteristic mould Type.

6. according to the method described in claim 5, it is characterized in that, described be input to institute by the voice data and image data It states scene characteristic model and obtains target person characteristic attribute, including：

Based on the intonation characteristic, patterned feature data, expressive features data and/or motion characteristic data, in conjunction with described Scene characteristic model obtains the target person characteristic attribute.

7. according to any methods of claim 1-6, which is characterized in that described to be based on the target dialogue policy control machine Expression, voice and/or the action output of device people, including：

8. a kind of device of human-computer dialogue interaction, which is characterized in that including：

Character features attribute obtains module, is obtained for the voice data and image data to be input to the scene characteristic model To target person characteristic attribute；

Target dialogue strategy determining module, for determining target dialogue plan using the target person characteristic attribute and contextual data Slightly；

Human-computer interaction session module is used for the expression based on target dialogue policy control robot, voice and/or acts defeated Go out.

9. device according to claim 8, which is characterized in that interaction side's data acquisition module includes：

First interaction side's data-acquisition submodule, the voice data for acquiring the interaction side based on microphone；Based on camera shooting Head acquires the image data of the interaction side；And the contextual data is acquired based on sensor.

10. a kind of device for human-computer dialogue interaction, which is characterized in that include memory and one or one with On program, one of them either more than one program be stored in memory and be configured to by one or more than one It includes the instruction for being operated below that processor, which executes the one or more programs,：

Obtain voice data, image data and the contextual data of interaction side；