CN102339129B - Multichannel human-computer interaction method based on voice and gestures - Google Patents

Multichannel human-computer interaction method based on voice and gestures Download PDF

Info

Publication number
CN102339129B
CN102339129B CN 201110278390 CN201110278390A CN102339129B CN 102339129 B CN102339129 B CN 102339129B CN 201110278390 CN201110278390 CN 201110278390 CN 201110278390 A CN201110278390 A CN 201110278390A CN 102339129 B CN102339129 B CN 102339129B
Authority
CN
China
Prior art keywords
referent
gesture
information
voice
constraint information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201110278390
Other languages
Chinese (zh)
Other versions
CN102339129A (en
Inventor
赵沁平
陈小武
蒋恺
许楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN 201110278390 priority Critical patent/CN102339129B/en
Publication of CN102339129A publication Critical patent/CN102339129A/en
Application granted granted Critical
Publication of CN102339129B publication Critical patent/CN102339129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a multichannel human-computer interaction method based on voice and gestures, comprising the following steps of: extracting voice referent constraint information from voice information and extracting gesture referent constraint information from gesture information, wherein the gesture referent constraint information comprises a statistical value of distance from any point in a referring point region limited by the current referring point gesture to a referring center of the referring point gesture and the statistical value of time maintained by the referring point gesture; when the gesture referent constraint information is analyzed, acquiring the statistical value of the distance and the statistical value of the time maintained by the referring point gesture so as to lower the fuzziness of a referring point in three-dimensional interaction; and in the referent determining process, dividing a model object in a virtual environment into four types, and comparing the referent with a certain type of model object according to the possibility of the referent appearing in a certain type so as to be beneficial to the reduction of a finding range of the referent and the influence of the fuzziness of the referring point.

Description

A kind of multimodal human-computer interaction method based on voice and gesture
Technical field
The present invention relates to field of human-computer interaction, relate in particular to a kind of multimodal human-computer interaction method based on voice and gesture.
Background technology
Multimodal human-computer interaction can enlarge the bandwidth of message exchange between people and computing machine effectively, thereby reaches the purpose that improves interactive efficiency; And can bring into play between man-machine the cognitive potentiality that differ from one another, reduce user's cognitive load.The user can be by various exchange channels and the mutual combination between them, the interactive task that cooperated, and this has just in time made up restriction and burden that single interactive mode brings to the user.In multimodal human-computer interaction, censure end and be defined as: the common referent of obtaining a plurality of passage input messages.Wherein, censure the pronoun, location adverbial word, deictic words and the restriction noun that mainly comprise in natural language, such as " it ", " here ", " this ", " that house " etc.; Referent is the Subject that the user censures, such as model in three dimensions etc.In traditional single channel user interface, the denotion technology is single, and normally accurate, and the border between target and target is clearly.And, in MultiModal User Interface, the denotion technology is compound and normally fuzzy, border is unsharp.
Current multichannel research has been not limited to integrating speech sound and conventional mouse keyboard, and based on voice and pen, voice and lip are moving, and the multi-channel system of voice and three-dimension gesture has obtained larger concern.The hyperchannel cooperative system QuickSet that Typical Representative wherein comprises Agent-base structure, support voice and pen, integrated the XWand system of " Magic wand " (a kind of new six degree of freedom equipment) and voice etc.W3C international organization has set up " hyperchannel is mutual " work group, the class that exploitation W3C is new is supported the hyperchannel consensus standard of mobile device, comprises Multimodal Interaction Framework, hyperchannel interaction demand, the mutual use-case of hyperchannel, can expand hyperchannel annotation language needs, digital ink demand, can expand hyperchannel comment token language etc.The formulation of these standards has reflected that multichannel technology has started maturation.
About censuring the research of Reduced problem in multimodal human-computer interaction, Kehler uses the relative theory of cognitive science and computational linguistics, study and verified under the hyperchannel environment corresponding relation of censuring with cognitive state, propose a kind ofly to cognitive state coding and obtain the method for referent in conjunction with one group of simple judgment rule, and reached very high accuracy rate in a two-dimentional travel folder application based on pen and voice.The Kehler method is to process single denotions very effective when accurately giving directions gesture, but these rules suppose all objects and can choose with being determined, can not support fuzzy gesture.
Under the joint study augmented reality such as Columbia University, Oregon science and healthy university and reality environment, three-dimensional hyperchannel is mutual, proposes to solve by the method for perceived shape the problem of summing up of censuring.Perceived shape is the solid of being controlled by the user, and it is mutual that the user passes through it and augmented reality or reality environment, and in reciprocal process, perceived shape produces the selection of various statistic auxiliary mark.The method has mainly solved to censure sums up middle finger point fuzziness problem, but does not pay close attention to not deduction and the multi-channel aligning of specified information.The proposition hyperchannels such as Pfeiffer of Germany University Bielefeld are censured aspects such as summing up the complicacy that note censuring type, statement, consistent background, uncertainty, and have designed a kind of denotion towards immersive virtual environment and summed up engine.This engine is the expert system of a three-decker: core layer, field layer, application layer.Core layer is a constraint satisfaction manager; The field layer provides the access to knowledge base; Application layer is extraneous program and the interface of censuring the end engine, is responsible for the denotion in phonetic entry is converted into to the inquiry of to censuring, summing up engine.This denotion end engine will be censured Reduced problem and be regarded the constraint satisfaction problem as, mainly pay close attention to and extract effectively constraint from complicated natural language.But the method also lacks corresponding processing to situation and the indication ambiguity of underconstrained.
Summary of the invention
The present invention has designed and developed a kind of multimodal human-computer interaction method based on voice and gesture.
One object of the present invention is, solves the indication fuzzy problem in the multimodal human-computer interaction method based on voice and gesture.While in virtual environment, carrying out three-dimension interaction, gesture (from identification, give directions and start to finish to giving directions) has not only been expressed spatial information, has also carried time-related information.Object residence time in giving directions zone is longer, can think that selected possibility is larger.Therefore, when carrying out the analysis of gesture referent constraint information, not only to obtain the distance statistics amount, and want the acquisition time statistic, thereby reduce the indication ambiguity in three-dimension interaction.And, in referent is carried out to definite process, be that the model object in virtual environment is divided into to four classes, and referent and a certain Type model object are contrasted, this method also contributes to dwindle the searching scope of referent, reduces the impact of giving directions ambiguity.
Another object of the present invention is, solves the problem that the not specified information in the multimodal human-computer interaction method based on voice and gesture is inferred.Model object in virtual environment is divided into four classes, wherein, the referent of focal object for being determined in upper once interactive process, that is to say, if indicative pronoun " it " occurred in the statement of phonetic entry in this time man-machine interaction, can think that the referent of this man-machine interaction is exactly focal object, thereby solve the problem that specified information is not inferred.
Another purpose of the present invention is, a kind of multimodal human-computer interaction method based on voice and gesture is provided.By building hyperchannel layering Integrated Models, set up four layers in hyperchannel layering Integrated Models: Physical layer, morphology layer, grammer layer and semantic layer, and required command information and the referent of man-machine interaction is packed into the task groove the most at last, the target of above-mentioned integration process and successful integration whether criterion is all to take the integrality of task structure of man-machine interaction to be basis, but final purpose is exactly to generate the task structure that submission system is carried out, and guarantees effectively carrying out of man-machine interaction.
Technical scheme provided by the invention is:
A kind of multimodal human-computer interaction method based on voice and gesture, is characterized in that, comprises the following steps:
Step 1, structure voice channel and gesture passage, and the referent of man-machine interaction is carried out to the input of voice messaging and gesture information by voice channel and gesture passage respectively;
Step 2, extract voice referent constraint information from above-mentioned voice messaging, extract gesture referent constraint information from above-mentioned gesture information, wherein, described gesture referent constraint information comprises that any point in the indication zone that current indication gesture limits arrives the distance statistics amount at the indication center of giving directions gesture and the time statistic that above-mentioned indication gesture maintains;
Step 3, the characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and virtual environment is contrasted, determine the referent of man-machine interaction, extract the command information to referent from above-mentioned voice referent constraint information, command information is acted on to referent, complete a man-machine interaction.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, model object in described virtual environment is divided into the indication object, focal object, activate object and quiet object four classes, described indication object is the object that is positioned at the indication zone that current indication gesture limits, the referent of described focal object for being determined in upper once interactive process, described activation object is the model object except giving directions object and activation object that is positioned at visual range, described quiet object is the model object except giving directions object and activation object that is positioned at not visible scope, in step 3, by above-mentioned voice referent constraint information and gesture referent constraint information in order one by one with above-mentioned indication object, focal object, activate object, the characteristic information of quiet object is contrasted, determine the referent of man-machine interaction.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, in described step 2,
Extract voice referent constraint information and extract gesture referent constraint information from above-mentioned voice messaging from above-mentioned gesture information and be achieved in the following ways:
Build hyperchannel layering Integrated Models, described hyperchannel layering Integrated Models includes four layers, be respectively Physical layer, the morphology layer, grammer layer and semantic layer, wherein, described Physical layer receives voice messaging and the gesture information of being inputted by voice channel and gesture passage respectively, described morphology layer includes speech recognition parsing module and gesture identification parsing module, described speech recognition parsing module resolves to voice referent constraint information by the voice messaging of Physical layer, described gesture identification parsing module resolves to gesture referent constraint information by the gesture information of Physical layer.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, in described step 3,
The characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and virtual environment is contrasted, determine the referent of man-machine interaction, the definite of described referent realize on described grammer layer,
The command information of extracting referent from above-mentioned voice referent constraint information is achieved in the following ways:
Described grammer layer extracts command information from voice referent constraint information,
Command information is acted on to referent to be achieved in the following ways:
The command information that described semantic layer extracts the grammer layer acts on referent.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, described hyperchannel layering Integrated Models also includes the task groove, and described task groove comprises order list item and referent list item,
The command information that wherein said semantic layer extracts the grammer layer acts on referent and carries out in the following manner:
The command information that described semantic layer extracts the grammer layer is inserted the order list item, and referent is inserted to the referent list item, and described task groove is filled complete, described hyperchannel layering Integrated Models production system executable command.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, in the situation that described task groove is not filled complete, stand-by period is set, described task groove is filled complete within the stand-by period, continue this man-machine interaction, described task groove is not filled complete within the stand-by period, abandons this man-machine interaction.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, described order list item includes action list item and parameter list item, and while extracting the command information to referent in described voice referent constraint information, described command information comprises action message and parameter information.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, in described step 1, when voice channel receives first statement, start interactive process one time.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, in described step 1, when voice channel receives a statement, the input of time-out time with the gesture information of reception gesture passage is set, as the input of gesture information exceeds set time-out time, abandon this interactive process.
Multimodal human-computer interaction method based on voice and gesture of the present invention has following beneficial effect:
(1) solve the indication fuzzy problem in the multimodal human-computer interaction method based on voice and gesture.While in virtual environment, carrying out three-dimension interaction, gesture (from identification, give directions and start to finish to giving directions) has not only been expressed spatial information, has also carried time-related information.Object residence time in giving directions zone is longer, can think that selected possibility is larger.Therefore, when carrying out the analysis of gesture referent constraint information, not only to obtain the distance statistics amount, and want the acquisition time statistic, thereby reduce the indication ambiguity in three-dimension interaction.And, in referent is carried out to definite process, be that the model object in virtual environment is divided into to four classes, and referent and a certain Type model object are contrasted, this method also contributes to dwindle the searching scope of referent, reduces the impact of giving directions ambiguity.
(2) solve the problem that the not specified information in the multimodal human-computer interaction method based on voice and gesture is inferred.Model object in virtual environment is divided into four classes, wherein, the referent of focal object for being determined in upper once interactive process, that is to say, if indicative pronoun " it " occurred in the statement of phonetic entry in this time man-machine interaction, can think that the referent of this man-machine interaction is exactly focal object, thereby solve the problem that specified information is not inferred.
(3) provide a kind of multimodal human-computer interaction method based on voice and gesture.By building hyperchannel layering Integrated Models, set up four layers in hyperchannel layering Integrated Models: Physical layer, morphology layer, grammer layer and semantic layer, and required command information and the referent of man-machine interaction is packed into the task groove the most at last, the target of above-mentioned integration process and successful integration whether criterion is all to take the integrality of task structure of man-machine interaction to be basis, but final purpose is exactly to generate the task structure that submission system is carried out, guarantee effectively carrying out of man-machine interaction, improved the reliability of man-machine interaction.
The accompanying drawing explanation
The schematic diagram of the interactive process that Fig. 1 is the multimodal human-computer interaction method based on voice and gesture of the present invention.
The general frame figure that the denotion that Fig. 2 is the multimodal human-computer interaction method based on voice and gesture of the present invention is summed up.
The overview flow chart that Fig. 3 is the multimodal human-computer interaction method based on voice and gesture of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail, to make those skilled in the art, with reference to the instructions word, can implement according to this.
As shown in Figure 1, Figure 2 and Figure 3, the invention provides a kind of multimodal human-computer interaction method based on voice and gesture, comprise the following steps:
Step 1, structure voice channel and gesture passage, and the referent of man-machine interaction is carried out to the input of voice messaging and gesture information by voice channel and gesture passage respectively;
Step 2, extract voice referent constraint information from above-mentioned voice messaging, extract gesture referent constraint information from above-mentioned gesture information, wherein, described gesture referent constraint information comprises that any point in the indication zone that current indication gesture limits arrives the distance statistics amount at the indication center of giving directions gesture and the time statistic that above-mentioned indication gesture maintains;
Step 3, the characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and virtual environment is contrasted, determine the referent of man-machine interaction, extract the command information to referent from above-mentioned voice referent constraint information, command information is acted on to referent, complete a man-machine interaction.
As shown in Figure 1, the above-mentioned multimodal human-computer interaction method based on voice and gesture, at first two exchange channels of support voice and gesture.Wherein sound identification module adopts Microsoft's speech recognition engine, and user's voice command is mapped as to the text message with timestamp, by the voice parsing module, therefrom extracts voice referent constraint information.Gesture passage usage data gloves obtain joint and positional information for gesture identification, and the gesture parsing module accepts to give directions gesture, and produce and give directions the object vector.The module integrated information from voice and gesture passage of Multimodal Integration realizes the end to censuring in integration process, finally produces system executable command or corresponding prompting.
The present invention adopts hyperchannel layering Integrated Models to realize Multimodal Integration.Integration process is the task guiding, the target of integrating and successful integration whether criterion are all that to take the integrality of interactive task structure be basis, but final purpose is exactly to generate the task structure that submission system is carried out, comprising information such as the object of the action of task, task effect and relevant parameters.Therefore, defined the task groove in the present invention, the task groove belongs to the part of hyperchannel layering Integrated Models.The structure of task groove is divided into three parts, is respectively action list item, referent list item and parameter list item, also can be referred to as to move groove, referent groove and parameter groove.In fact, action list item and parameter list item all belong to the order list item.Wherein the referent in the referent groove can be more than one, and the parameter groove can only filling position information at present.Different order meetings is to there being the task groove with different structure, and for example the task groove of select command only has action and two list items of referent.The process of integrating has just become the filling process to the task groove, once the task groove fills up, has just formed the executable complete task of system.
For instance, " rotate it " as only carried out phonetic entry, and do not make the indication gesture, namely can't determine referent.The task groove, when filling, will in the action groove, insert " rotation ", and the referent groove is empty.Now, owing to being provided with the stand-by period, complete if the task groove is filled within the stand-by period, namely within the stand-by period, make the indication gesture, thereby determined referent, proceed this man-machine interaction.Hyperchannel layering Integrated Models meeting generation system executable command, complete if the task groove is not filled within the stand-by period, abandon this man-machine interaction.
The hyperchannel layering Integrated Models of the present invention's definition, as its name suggests, be based on the thought of layering, and channel information is become to four layers of Physical layer, morphology layer, grammer layer and semantic layers etc. from concrete facility information to the semantic abstraction that finally will be filled to the task groove.Physical layer information is that its form has diversity, directly related with concrete interactive device from the raw information of interactive device input.Such as from phonetic entry be character string information, and be sensor information from data glove input.The morphology layer is crucial one deck, and it is to the processing that unitizes of the raw information from mechanical floor, meaning is identical and the input that form is different is unified means for identical information, thereby provides the information of device independent to the grammer layer.In the morphology layer, the voice messaging of voice channel process sound identification module and voice parsing module carry out abstract, generate voice referent constraint information; Simultaneously, the gesture information of gesture passage, through after gesture identification module and gesture parsing module abstract, generates gesture referent constraint information.The grammer layer mainly will be decomposed according to the syntax gauge of man-machine interaction from the information of morphology layer, be decomposed into the form that meets each list item of task groove, for follow-up semantic fusion is prepared.Censure to sum up and mainly carry out at the grammer layer.And the grammer layer also extracts command information from voice referent constraint information.At semantic layer, utilize exactly the task guiding mechanism, carry out the filling of task groove and perfect, although task is relevant with concrete application, the filling of task groove and improve and be independent of application.
In fact, interactive process can be divided into two kinds of strategies, " seed of garden balsam " and " phlegmatic temperament " two kinds.The impetuous integration as long as the hyperchannel input supports integration to a certain degree just to start to process, this process can be regarded as event driven.The integration of phlegmatic temperament will be arrived to have had whole inputs or has just started to process than after more completely inputting.For example, when carrying out man-machine interaction, shorttempered strategy is, phonetic entry " is rotated it ", and hyperchannel layering Integrated Models is just started working, and starts the processing of the information of carrying out.And the strategy of phlegmatic temperament is, phonetic entry " is rotated it ", gives directions gesture to make simultaneously and gives directions certain object, so that model can be determined referent, now model just starts.Namely, phlegmatic temperament is the full detail that provides a man-machine interaction disposable.Because user's phonetic entry is frequent, occur discontinuous situation occurring the larger time interval in the middle of the order of a complete mobile object.Be subject to the restriction of speech recognition engine, the present invention uses " seed of garden balsam " strategy simultaneously, adopts voice driven, when voice channel receives the first statement, just starts interactive process one time.
The process that referent is confirmed is namely censured the process of end.In the present invention, censure to sum up want to take voice referent constraint information and gesture referent constraint information is foundation simultaneously.The present invention is based on following two hypothesis: the semanteme in (1) phonetic entry is clearly, the present invention mainly pays close attention to Solving Multichannel and censures the indication ambiguity in summing up, therefore suppose that the semanteme in phonetic entry is clearly, do not have the fuzzy sets such as " upper left corner ", " centre ", " in the past "; (2) by the denotion of " centered by the oneself ", denotion can be divided into three types: centered by the oneself, centered by object of reference, centered by other people.All denotions in the present invention are all centered by the oneself, do not have " object of selecting his left side " this situation centered by other viewpoints.
The present invention adopts the integrated strategy of voice driven, after a statement is identified, triggers the Multimodal Integration process.In hyperchannel layering Integrated Models, at first, voice referent constraint information is packed into the voice constraint set.Can distribute identity for all model objects in virtual environment according to gesture referent constraint information, all model objects are divided into and give directions object, focal object, activation object and quiet object four classes.Described indication object is the object that is positioned at the indication zone that current indication gesture limits, the referent of described focal object for being determined in upper once interactive process, described activation object is the model object except indication object and activation object that is positioned at visual range, and described quiet object is the model object except giving directions object and activation object that is positioned at not visible scope.The corresponding initialization coupling of each Type model object matrix, be respectively and give directions matrix, focussing matrix, activated matrix and quiet matrix.
The present invention adopts the method for perceived shape in censuring the end process, and perceived shape is controlled and interactive object solid for information about can be provided by the user.When the current gesture of system identification is while giving directions gesture, generation is attached to the cone (namely by the indication zone of giving directions gesture to limit) on virtual hand forefinger finger tip, by collision detection record cast object and cone reciprocal process, generate various statistic data.Then the statistic weighted mean is generated and gives directions priority.After once indication completes alternately, obtain the two tuple vectors corresponding with this indication gesture, first element of this two tuple is for giving directions the object vector, and second element is for giving directions priority.
The present invention has defined time series T rankwith distance sequence D ranktwo kinds of statistics.Time in perceived shape is longer, and nearer apart from indication center (virtual hand forefinger finger tip), the priority of this model object is higher.
T rankcomputation process be shown below, T wherein objectmean the time of certain model object in cone, T periodthe life period (being the duration of giving directions gesture) that means cone in certain reciprocal process.
T rank = T object T period , 0<T rank≤1
D rankcomputation process be shown below, D wherein objectmean the distance of certain model object center to the indication center, D maxit is model object in the cone maximum distance to the indication center.
D rank = 1 - D object D max , 0<D rank≤1
Give directions priority P rankby above-mentioned two kinds of statistic weighted means, obtained, its computing method are as follows:
P rank=T rank*λ+D rank*(1-λ),0≤λ≤1
Because interactive device is not designed to work in the mode of cooperation, carry out just must relying on temporal correlation across the integration of passage.Therefore calculate the indication priority P by perceived shape rankafter, should record the Multimodal Integration use of current time for after-stage.Have the setting of stand-by period for further input information due to the task groove, the numerical value of this stand-by period will consider, further the gesture information input also completes and censures the end needed time of process together with voice messaging.
Above-mentioned obtain giving directions priority and give directions the object vector after, the searching of comparing one by one in giving directions matrix, focussing matrix, activated matrix, quiet matrix, the model object in four matrixes has corresponding state.At every one-phase, while for the model object that is arranged in same matrix, censuring end, be by adaptation function Match (o, e) quantitative model object status.
Being constructed as follows of adaptation function:
Match ( o , e ) = [ Σ S ∈ { P , F , A , E } P ( o | S ) * P ( S | e ) ] * Semantic ( o , e ) * Temp ( o , e )
Wherein, o means model object, and e means to censure.P means the indication state, and F means focus state, and A means state of activation, and E means quiet state, and S means the state of current object.Below each ingredient of Match (o, e):
(1) P (o|S) and P (S|e)
The selected probability of object o when P (o|S) means given cognitive state S, for weighing the gesture passage to censuring the impact of summing up.Circular is: P (o|P)=P rank;
Figure BDA0000092466880000103
(number that M is focal object),
Figure BDA0000092466880000104
(N is for activating the number of object),
Figure BDA0000092466880000105
(L means the number of all model objects in virtual environment).P (S|e) is the probability that the referent state is S when censuring as e.
(2)Semantic(o,e)
Semantic (o, e) means model object o and censures the semantic compatibility between e, the impact of denotion being summed up for weighing voice channel, and it is constructed as follows:
Semantic ( o , e ) = Σ k Attr k ( o , e ) K
The present invention all puts identifier and semantic type under attribute Attr kin, Attr when o and e all have attribute k and both values not to wait k(o, e) is 0, and all the other situations are 1.The attribute sum that K is referent.
(3)Temp(o,e)
Temp (o, e) means model object o and censures the time compatibility between e, for the impact of measurement time on the denotion end.It is a piecewise function:
As o and e, with in once mutual the time, the computation process of Temp (o, e) is as follows:
Temp(o,e)=exp(-|Time(o)-Time(e)|)
When o and e are in distinct interaction, the computation process of Temp (o, e) is as follows:
Temp(o,e)=exp(-|OrderIndex(o)-OrderIndex(e)|)
Wherein Time (o) is for giving directions the gesture time of origin, and Time (e) is for censuring time of origin, and unit is second; OrderIndex (o) means the precedence of o in giving directions the gesture sequence, and OrderIndex (e) means the precedence of e in censuring sequence.In the Temp of focusing, activation or quiet status object (o, e)=1.
When censuring with model object in a certain state (being positioned at a certain matrix) after contrast is mated, referent is confirmed.
In the described multimodal human-computer interaction method based on voice and gesture, model object in described virtual environment is divided into the indication object, focal object, activate object and quiet object four classes, described indication object is the object that is positioned at the indication zone that current indication gesture limits, the referent of described focal object for being determined in upper once interactive process, described activation object is the model object except giving directions object and activation object that is positioned at visual range, described quiet object is the model object except giving directions object and activation object that is positioned at not visible scope, in step 3, by above-mentioned voice referent constraint information and gesture referent constraint information in order one by one with above-mentioned indication object, focal object, activate object, the characteristic information of quiet object is contrasted, determine the referent of man-machine interaction.
In the described multimodal human-computer interaction method based on voice and gesture, in described step 2, extract voice referent constraint information and extract gesture referent constraint information from above-mentioned voice messaging from above-mentioned gesture information and be achieved in the following ways: build hyperchannel layering Integrated Models, described hyperchannel layering Integrated Models includes four layers, be respectively Physical layer, the morphology layer, grammer layer and semantic layer, wherein, described Physical layer receives voice messaging and the gesture information of being inputted by voice channel and gesture passage respectively, described morphology layer includes speech recognition parsing module and gesture identification parsing module, described speech recognition parsing module resolves to voice referent constraint information by the voice messaging of Physical layer, described gesture identification parsing module resolves to gesture referent constraint information by the gesture information of Physical layer.
In the described multimodal human-computer interaction method based on voice and gesture, in described step 3, the characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and virtual environment is contrasted, determine the referent of man-machine interaction, the definite of described referent realizes on described grammer layer, the command information of extracting referent from above-mentioned voice referent constraint information is achieved in the following ways: described grammer layer extracts command information from voice referent constraint information, command information is acted on to referent to be achieved in the following ways: the command information that described semantic layer extracts the grammer layer acts on referent.
In the described multimodal human-computer interaction method based on voice and gesture, described hyperchannel layering Integrated Models also includes the task groove, described task groove comprises order list item and referent list item, it is to carry out in the following manner that the command information that wherein said semantic layer extracts the grammer layer acts on referent: the command information that described semantic layer extracts the grammer layer is inserted the order list item, referent is inserted to the referent list item, described task groove is filled complete, described hyperchannel layering Integrated Models production system executable command.
In the described multimodal human-computer interaction method based on voice and gesture, in the situation that described task groove is not filled complete, stand-by period is set, described task groove is filled complete within the stand-by period, continue this man-machine interaction, described task groove is not filled complete within the stand-by period, abandons this man-machine interaction.
In the described multimodal human-computer interaction method based on voice and gesture, described order list item includes action list item and parameter list item, while extracting the command information to referent in described voice referent constraint information, described command information comprises action message and parameter information.
In the described multimodal human-computer interaction method based on voice and gesture, in described step 1, when voice channel receives first statement, start interactive process one time.
In the described multimodal human-computer interaction method based on voice and gesture, in described step 1, when voice channel receives a statement, the input of time-out time with the gesture information of reception gesture passage is set, as the input of gesture information exceeds set time-out time, abandon this interactive process.
Although embodiment of the present invention are open as above, but it is not restricted to listed utilization in instructions and embodiment, it can be applied to various applicable the field of the invention fully, for those skilled in the art, can easily realize other modification, therefore do not deviating under the universal that claim and equivalency range limit, the present invention is not limited to specific details and illustrates here and the legend of describing.

Claims (1)

1. the multimodal human-computer interaction method based on voice and gesture, is characterized in that, comprises the following steps:
Step 1, structure voice channel and gesture passage, and the referent of man-machine interaction is carried out to the input of voice messaging and gesture information by voice channel and gesture passage respectively;
When voice channel receives first statement, start interactive process one time;
When voice channel receives a statement, time-out time is set, as the input of gesture information exceeds set time-out time, abandon this interactive process;
Step 2, extract voice referent constraint information from above-mentioned voice messaging; extract gesture referent constraint information from above-mentioned gesture information; wherein, described gesture referent constraint information comprises that any point in the indication zone that current indication gesture limits arrives the distance statistics amount at the indication center of giving directions gesture and the time statistic that above-mentioned indication gesture maintains;
Extract voice referent constraint information and extract gesture referent constraint information from above-mentioned voice messaging from above-mentioned gesture information and be achieved in the following ways:
Build hyperchannel layering Integrated Models; described hyperchannel layering Integrated Models includes four layers; be respectively Physical layer, morphology layer, grammer layer and semantic layer; wherein; described Physical layer receives voice messaging and the gesture information of being inputted by voice channel and gesture passage respectively; described morphology layer includes speech recognition parsing module and gesture identification parsing module; described speech recognition parsing module resolves to voice referent constraint information by the voice messaging of Physical layer, and described gesture identification parsing module resolves to gesture referent constraint information by the gesture information of Physical layer;
Step 3, model object in virtual environment is divided into the indication object, focal object, activate object and quiet object four classes, described indication object is the object that is positioned at the indication zone that current indication gesture limits, the referent of described focal object for being determined in upper once interactive process, described activation object is the model object except giving directions object and activation object that is positioned at visual range, described quiet object is the model object except giving directions object and activation object that is positioned at not visible scope, by above-mentioned voice referent constraint information and gesture referent constraint information in order one by one with above-mentioned indication object, focal object, activate object, the characteristic information of quiet object is contrasted, determine the referent of man-machine interaction, extract the command information to referent from above-mentioned voice referent constraint information, command information is acted on to referent, complete a man-machine interaction,
The characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and virtual environment is contrasted, determine the referent of man-machine interaction, the definite of described referent realize on described grammer layer,
The command information of extracting referent from above-mentioned voice referent constraint information is achieved in the following ways:
Described grammer layer extracts command information from voice referent constraint information,
Command information is acted on to referent to be achieved in the following ways:
The command information that described semantic layer extracts the grammer layer acts on referent;
Described hyperchannel layering Integrated Models also includes the task groove, and described task groove comprises order list item and referent list item,
The command information that wherein said semantic layer extracts the grammer layer acts on referent and carries out in the following manner:
The command information that described semantic layer extracts the grammer layer is inserted the order list item, and referent is inserted to the referent list item, and described task groove is filled complete, described hyperchannel layering Integrated Models production system executable command;
In the situation that described task groove is not filled is complete, the stand-by period is set, described task grain is filled complete within the stand-by period, continues this man-machine interaction, and described task groove is not filled complete within the stand-by period, abandons this man-machine interaction;
Described order list item includes action list item and parameter list item, and while extracting the command information to referent in described voice referent constraint information, described command information comprises action message and parameter information.
CN 201110278390 2011-09-19 2011-09-19 Multichannel human-computer interaction method based on voice and gestures Active CN102339129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110278390 CN102339129B (en) 2011-09-19 2011-09-19 Multichannel human-computer interaction method based on voice and gestures

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110278390 CN102339129B (en) 2011-09-19 2011-09-19 Multichannel human-computer interaction method based on voice and gestures

Publications (2)

Publication Number Publication Date
CN102339129A CN102339129A (en) 2012-02-01
CN102339129B true CN102339129B (en) 2013-12-25

Family

ID=45514896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110278390 Active CN102339129B (en) 2011-09-19 2011-09-19 Multichannel human-computer interaction method based on voice and gestures

Country Status (1)

Country Link
CN (1) CN102339129B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9256711B2 (en) * 2011-07-05 2016-02-09 Saudi Arabian Oil Company Systems, computer medium and computer-implemented methods for providing health information to employees via augmented reality display
CN102824092A (en) * 2012-08-31 2012-12-19 华南理工大学 Intelligent gesture and voice control system of curtain and control method thereof
US8994681B2 (en) * 2012-10-19 2015-03-31 Google Inc. Decoding imprecise gestures for gesture-keyboards
CN103422764A (en) * 2013-08-20 2013-12-04 华南理工大学 Door control system and control method thereof
CN104423543A (en) * 2013-08-26 2015-03-18 联想(北京)有限公司 Information processing method and device
CN103987169B (en) * 2014-05-13 2016-04-06 广西大学 A kind of based on gesture and voice-operated intelligent LED desk lamp and control method thereof
CN104615243A (en) * 2015-01-15 2015-05-13 深圳市掌网立体时代视讯技术有限公司 Head-wearable type multi-channel interaction system and multi-channel interaction method
CN105867595A (en) * 2015-01-21 2016-08-17 武汉明科智慧科技有限公司 Human-machine interaction mode combing voice information with gesture information and implementation device thereof
CN104965592A (en) * 2015-07-08 2015-10-07 苏州思必驰信息科技有限公司 Voice and gesture recognition based multimodal non-touch human-machine interaction method and system
CN105511612A (en) * 2015-12-02 2016-04-20 上海航空电器有限公司 Multi-channel fusion method based on voice/gestures
CN106933585B (en) * 2017-03-07 2020-02-21 吉林大学 Self-adaptive multi-channel interface selection method under distributed cloud environment
CN107122109A (en) * 2017-05-31 2017-09-01 吉林大学 A kind of multi-channel adaptive operating method towards three-dimensional pen-based interaction platform
CN109992095A (en) * 2017-12-29 2019-07-09 青岛有屋科技有限公司 The control method and control device that the voice and gesture of a kind of intelligent kitchen combine
CN108399427A (en) * 2018-02-09 2018-08-14 华南理工大学 Natural interactive method based on multimodal information fusion
CN108334199A (en) * 2018-02-12 2018-07-27 华南理工大学 The multi-modal exchange method of movable type based on augmented reality and device
CN111968470B (en) * 2020-09-02 2022-05-17 济南大学 Pass-through interactive experimental method and system for virtual-real fusion
CN112069834A (en) * 2020-09-02 2020-12-11 中国航空无线电电子研究所 Integration method of multi-channel control instruction
CN112462940A (en) * 2020-11-25 2021-03-09 苏州科技大学 Intelligent home multi-mode man-machine natural interaction system and method thereof
CN115268623A (en) * 2022-04-13 2022-11-01 北京航空航天大学 Contact processing method and system for virtual hand-force sense interaction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281435A1 (en) * 2009-04-30 2010-11-04 At&T Intellectual Property I, L.P. System and method for multimodal interaction using robust gesture processing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张国华, 老松杨, 凌云翔, 叶挺.指挥控制中的多人多通道人机交互研究.《国防科技大学学报》.2010,第32卷(第5期),第153-159页.
指挥控制中的多人多通道人机交互研究;张国华, 老松杨, 凌云翔, 叶挺;《国防科技大学学报》;20101231;第32卷(第5期);第153-159页 *
马翠霞,戴国忠.基于手势和语音的草图技术研究.《第五届中国计算机图形学大会》.2004,第302-305页. *

Also Published As

Publication number Publication date
CN102339129A (en) 2012-02-01

Similar Documents

Publication Publication Date Title
CN102339129B (en) Multichannel human-computer interaction method based on voice and gestures
CN106997236B (en) Based on the multi-modal method and apparatus for inputting and interacting
CN110019752B (en) Multi-directional dialog
Oviatt et al. Perceptual user interfaces: multimodal interfaces that process what comes naturally
CN109328381A (en) Detect the triggering of digital assistants
CN105930785B (en) Intelligent concealed-type interaction system
US20060072738A1 (en) Dialoguing rational agent, intelligent dialoguing system using this agent, method of controlling an intelligent dialogue, and program for using it
EP4243013A2 (en) Method, apparatus and computer-readable media for touch and speech interface with audio location
CN106407666A (en) Method, apparatus and system for generating electronic medical record information
CN106796789A (en) Interacted with the speech that cooperates with of speech reference point
CN110444199A (en) A kind of voice keyword recognition method, device, terminal and server
CN104090652A (en) Voice input method and device
CN103955267A (en) Double-hand man-machine interaction method in x-ray fluoroscopy augmented reality system
Duy Khuat et al. Vietnamese sign language detection using Mediapipe
CN105677716A (en) Computer data acquisition, processing and analysis system
Gu et al. Shape grammars: A key generative design algorithm
CN106502382A (en) Active exchange method and system for intelligent robot
CN115344119A (en) Digital assistant for health requests
CN107193853A (en) A kind of social scenario building method and system based on linguistic context
CN107122109A (en) A kind of multi-channel adaptive operating method towards three-dimensional pen-based interaction platform
Awada et al. Multimodal interface for elderly people
CN103297389B (en) Interactive method and device
Kumar et al. Intelligent assistant for exploring data visualizations
CN105511612A (en) Multi-channel fusion method based on voice/gestures
CN105404449B (en) Can level expansion more pie body-sensing menus and its grammar-guided recognition methods

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant