CN102339129A - Multichannel human-computer interaction method based on voice and gestures - Google Patents

Multichannel human-computer interaction method based on voice and gestures Download PDF

Info

Publication number
CN102339129A
CN102339129A CN2011102783905A CN201110278390A CN102339129A CN 102339129 A CN102339129 A CN 102339129A CN 2011102783905 A CN2011102783905 A CN 2011102783905A CN 201110278390 A CN201110278390 A CN 201110278390A CN 102339129 A CN102339129 A CN 102339129A
Authority
CN
China
Prior art keywords
referent
gesture
voice
information
constraint information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011102783905A
Other languages
Chinese (zh)
Other versions
CN102339129B (en
Inventor
赵沁平
陈小武
蒋恺
许楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN 201110278390 priority Critical patent/CN102339129B/en
Publication of CN102339129A publication Critical patent/CN102339129A/en
Application granted granted Critical
Publication of CN102339129B publication Critical patent/CN102339129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a multichannel human-computer interaction method based on voice and gestures, comprising the following steps of: extracting voice referent constraint information from voice information and extracting gesture referent constraint information from gesture information, wherein the gesture referent constraint information comprises a statistical value of distance from any point in a referring point region limited by the current referring point gesture to a referring center of the referring point gesture and the statistical value of time maintained by the referring point gesture; when the gesture referent constraint information is analyzed, acquiring the statistical value of the distance and the statistical value of the time maintained by the referring point gesture so as to lower the fuzziness of a referring point in three-dimensional interaction; and in the referent determining process, dividing a model object in a virtual environment into four types, and comparing the referent with a certain type of model object according to the possibility of the referent appearing in a certain type so as to be beneficial to the reduction of a finding range of the referent and the influence of the fuzziness of the referring point.

Description

A kind of multimodal human-computer interaction method based on voice and gesture
Technical field
The present invention relates to field of human-computer interaction, relate in particular to a kind of multimodal human-computer interaction method based on voice and gesture.
Background technology
Multimodal human-computer interaction can enlarge the bandwidth of message exchange between people and the computing machine effectively, thereby reaches the purpose that improves interactive efficiency; And can bring into play the cognitive potentiality that differ from one another between man-machine, reduce user's cognitive load.The user can accomplish interactive task through various exchange channels and the mutual combination between them, cooperation, and this has just in time remedied restriction and burden that single interactive mode brings to the user.In the multimodal human-computer interaction, censure end and be defined as: the common referent of obtaining a plurality of passage input informations.Wherein, censure pronoun, location adverbial word, deictic words and qualification noun, for example " it ", " here ", " this ", " that house " etc. that mainly comprise in the natural language; Referent is the objective entity that the user censures, for example model in the three dimensions etc.In traditional single channel user interface, the technology of denotion is single, and normally accurate, and the border between target and the target is clearly.And in the hyperchannel user interface, the technology of denotion is compound and normally fuzzy, and the border is unsharp.
Present multichannel research has been not limited to integrating speech sound and conventional mouse keyboard, and based on voice and pen, voice and lip are moving, and the multi-channel system of voice and three-dimension gesture has obtained bigger concern.Typical case's representative wherein comprises based on the hyperchannel cooperative system QuickSet of Agent structure, support voice and pen, has integrated the XWand system of " Magic wand " (a kind of new six degree of freedom equipment) and voice etc.W3C international organization has set up " hyperchannel is mutual " work group; One type of hyperchannel consensus standard of supporting mobile device that exploitation W3C is new comprises hyperchannel interactive frame, hyperchannel interaction demand, the mutual use-case of hyperchannel, can expand hyperchannel note language needs, digital ink demand, can expand hyperchannel comment token language etc.The formulation of these standards has reflected that multichannel technology has begun maturation.
About censuring the research of end problem in the multimodal human-computer interaction; The relative theory of Kehler utilization cognitive science and computational linguistics; Research has also verified that the hyperchannel environment is censured down and the corresponding relation of cognitive state; Propose a kind ofly to cognitive state coding and combine one group of simple judgment rule to obtain the method for referent, and in a two-dimentional travel folder application, reached very high accuracy rate based on pen and voice.The Kehler method is very effective when handling single denotion and combine accurately to give directions gesture, but these all objects of rule hypothesis can both choose with being determined, can not support the gesture of bluring.
Three-dimensional hyperchannel is mutual under joint study augmented reality such as Columbia University, Oregon science and healthy university and the reality environment, proposes to solve the problem of censuring end with the method for perceived shape.Perceived shape is the solid by user's control, and it is mutual that the user passes through it and augmented reality or reality environment, and perceived shape produces the selection of various statistic auxiliary mark in reciprocal process.This method has mainly solved to censure sums up middle finger point fuzziness property problem, but does not pay close attention to the not deduction and the hyperchannel alignment of specified information.The proposition hyperchannels such as Pfeiffer of Germany University Bielefeld are censured to sum up and noted censuring aspects such as the complicacy of type, statement, consistent background, uncertainty, and have designed a kind of denotion towards immersive virtual environment and summed up engine.This engine is the expert system of a three-decker: core layer, field layer, application layer.Core layer is a constraint satisfaction manager; The field layer provides the visit to knowledge base; Application layer is extraneous program and the interface of censuring the end engine, is responsible for the denotion in the phonetic entry is converted into the inquiry of summing up engine to censuring.This denotion end engine will be censured the end problem and regarded the constraint satisfaction problem as, mainly pay close attention to and from the complex natural language, extract effectively constraint.But this method also lacks corresponding processing to the situation and the indication ambiguity of underconstrained.
Summary of the invention
The present invention has designed and developed a kind of multimodal human-computer interaction method based on voice and gesture.
One object of the present invention is, solves based on the indication fuzzy problem in the multimodal human-computer interaction method of voice and gesture.When carrying out three-dimension interaction in the virtual environment, gesture (beginning to finish to giving directions from the identification indication) has not only been expressed spatial information, has also carried time-related information.Object residence time in giving directions the zone is long more, can think that selected possibility is big more.Therefore, when carrying out the analysis of gesture referent constraint information, not only to obtain the distance statistics amount, and want the acquisition time statistic, thereby reduce the indication ambiguity in the three-dimension interaction.And; In the process that referent is confirmed, be that the model object in the virtual environment is divided into four types, and referent and a certain type model object are compared; This method also helps to dwindle the searching scope of referent, reduces the influence of giving directions ambiguity.
Another object of the present invention is, solves the problem of inferring based on the not specified information in the multimodal human-computer interaction method of voice and gesture.Model object in the virtual environment is divided into four types; Wherein, The referent of focal object in last once man-machine interaction process, being determined that is to say, if indicative pronoun " it " occurred in the statement of phonetic entry in the man-machine interaction this time; Can think that then the referent of this man-machine interaction is exactly a focal object, thereby solve the not problem of specified information deduction.
Another purpose of the present invention is, a kind of multimodal human-computer interaction method based on voice and gesture is provided.Integrate model through making up the hyperchannel layering; In hyperchannel layering integration model, set up four layers: Physical layer, morphology layer, grammer layer and semantic layer; And required command information and the referent of man-machine interaction is packed into the task groove the most at last; The target of above-mentioned integration process and the criterion of integrating success or not all are that the integrality with the task structure of man-machine interaction is basis, but final purpose is exactly to generate the task structure that submission system is carried out, effectively the carrying out of assurance man-machine interaction.
Technical scheme provided by the invention is:
A kind of multimodal human-computer interaction method based on voice and gesture is characterized in that, may further comprise the steps:
Step 1, structure voice channel and gesture passage, and the referent of man-machine interaction is carried out the input of voice messaging and gesture information through voice channel and gesture passage respectively;
Step 2, from above-mentioned voice messaging, extract voice referent constraint information; From above-mentioned gesture information, extract gesture referent constraint information; Wherein, said gesture referent constraint information comprises that any point in the indication zone that current indication gesture limited arrives the distance statistics amount at the indication center of giving directions gesture and the time statistic that above-mentioned indication gesture is kept;
Step 3, the characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and the virtual environment is compared; Determine the referent of man-machine interaction; From above-mentioned voice referent constraint information, extract command information to referent; Command information is acted on referent, accomplish a man-machine interaction.
Preferably; In the described multimodal human-computer interaction method based on voice and gesture; Model object in the said virtual environment is divided into gives directions four types of object, focal object, activation object and quiet objects; Said indication object is the object that is positioned at the indication zone that current indication gesture limited; The referent of said focal object in last once man-machine interaction process, being determined; Said activation object be positioned at visual range except that giving directions object and the model object that activates the object, said quiet object be positioned at visual scope not remove indication object and the model object the activation object, in step 3; Above-mentioned voice referent constraint information and gesture referent constraint information are compared with the characteristic information of above-mentioned indication object, focal object, activation object, quiet object in order one by one, determine the referent of man-machine interaction.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, in said step 2,
From above-mentioned voice messaging, extract voice referent constraint information and from above-mentioned gesture information, extract gesture referent constraint information and realize in the following manner:
Make up the hyperchannel layering and integrate model; Said hyperchannel layering is integrated model and is included four layers; Be respectively Physical layer, morphology layer, grammer layer and semantic layer; Wherein, said Physical layer receives voice messaging and the gesture information of being imported by voice channel and gesture passage respectively, and said morphology layer includes speech recognition parsing module and gesture identification parsing module; Said speech recognition parsing module resolves to voice referent constraint information with the voice messaging of Physical layer, and said gesture identification parsing module resolves to gesture referent constraint information with the gesture information of Physical layer.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, in the said step 3,
The characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and the virtual environment is compared, determine the referent of man-machine interaction, the confirming of said referent realize on said grammer layer,
The command information of from above-mentioned voice referent constraint information, extracting referent realizes in the following manner:
Said grammer layer extracts command information from voice referent constraint information,
Command information is acted on referent to be realized in the following manner:
Said semantic layer acts on referent with the command information that the grammer layer is extracted.
Preferably, in the described multimodal human-computer interaction method based on voice and gesture, said hyperchannel layering is integrated model and is also included the task groove, and said task groove comprises order list item and referent list item,
The command information that wherein said semantic layer is extracted the grammer layer acts on referent and carries out in the following manner:
Said semantic layer is inserted the order list item with the command information that the grammer layer is extracted, and referent is inserted the referent list item, and said task groove is filled complete, and model production system executable command is integrated in said hyperchannel layering.
Preferably; In the described multimodal human-computer interaction method, do not fill under the complete situation, the stand-by period is set at said task groove based on voice and gesture; Said task groove is filled complete in the stand-by period; Then continue this man-machine interaction, said task groove is not filled complete in the stand-by period, then abandons this man-machine interaction.
Preferably; In the described multimodal human-computer interaction method based on voice and gesture; Said order list item includes action list item and parameter list item, and when extraction was to the command information of referent in the said voice referent constraint information, said command information comprised action message and parameter information.
Preferably, in the described multimodal human-computer interaction method, in the said step 1, when voice channel receives first statement, begin one time the man-machine interaction process based on voice and gesture.
Preferably; In the described multimodal human-computer interaction method based on voice and gesture; In the said step 1, when voice channel receives a statement, the input of time-out time with the gesture information of reception gesture passage is set; Input like gesture information exceeds set time-out time, then abandons this man-machine interaction process.
Multimodal human-computer interaction method based on voice and gesture of the present invention has following beneficial effect:
(1) solves based on the indication fuzzy problem in the multimodal human-computer interaction method of voice and gesture.When carrying out three-dimension interaction in the virtual environment, gesture (beginning to finish to giving directions from the identification indication) has not only been expressed spatial information, has also carried time-related information.Object residence time in giving directions the zone is long more, can think that selected possibility is big more.Therefore, when carrying out the analysis of gesture referent constraint information, not only to obtain the distance statistics amount, and want the acquisition time statistic, thereby reduce the indication ambiguity in the three-dimension interaction.And; In the process that referent is confirmed, be that the model object in the virtual environment is divided into four types, and referent and a certain type model object are compared; This method also helps to dwindle the searching scope of referent, reduces the influence of giving directions ambiguity.
(2) solve the problem of inferring based on the not specified information in the multimodal human-computer interaction method of voice and gesture.Model object in the virtual environment is divided into four types; Wherein, The referent of focal object in last once man-machine interaction process, being determined that is to say, if indicative pronoun " it " occurred in the statement of phonetic entry in the man-machine interaction this time; Can think that then the referent of this man-machine interaction is exactly a focal object, thereby solve the not problem of specified information deduction.
(3) a kind of multimodal human-computer interaction method based on voice and gesture is provided.Integrate model through making up the hyperchannel layering; In hyperchannel layering integration model, set up four layers: Physical layer, morphology layer, grammer layer and semantic layer; And required command information and the referent of man-machine interaction is packed into the task groove the most at last; The target of above-mentioned integration process and the criterion of integrating success or not all are that the integrality with the task structure of man-machine interaction is the basis; But final purpose is exactly to generate the task structure that submission system is carried out, and guarantees effectively carrying out of man-machine interaction, has improved the reliability of man-machine interaction.
Description of drawings
Fig. 1 is the synoptic diagram of the man-machine interaction process of the multimodal human-computer interaction method based on voice and gesture of the present invention.
Fig. 2 is the general frame figure that the denotion of the multimodal human-computer interaction method based on voice and gesture of the present invention is summed up.
Fig. 3 is the overview flow chart of the multimodal human-computer interaction method based on voice and gesture of the present invention.
Embodiment
Below in conjunction with accompanying drawing the present invention is done further detailed description, can implement according to this with reference to the instructions literal to make those skilled in the art.
Like Fig. 1, Fig. 2 and shown in Figure 3, the present invention provides a kind of multimodal human-computer interaction method based on voice and gesture, may further comprise the steps:
Step 1, structure voice channel and gesture passage, and the referent of man-machine interaction is carried out the input of voice messaging and gesture information through voice channel and gesture passage respectively;
Step 2, from above-mentioned voice messaging, extract voice referent constraint information; From above-mentioned gesture information, extract gesture referent constraint information; Wherein, said gesture referent constraint information comprises that any point in the indication zone that current indication gesture limited arrives the distance statistics amount at the indication center of giving directions gesture and the time statistic that above-mentioned indication gesture is kept;
Step 3, the characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and the virtual environment is compared; Determine the referent of man-machine interaction; From above-mentioned voice referent constraint information, extract command information to referent; Command information is acted on referent, accomplish a man-machine interaction.
As shown in Figure 1, above-mentioned multimodal human-computer interaction method, at first two exchange channels of support voice and gesture based on voice and gesture.Wherein sound identification module adopts Microsoft's speech recognition engine, with the text message of user's voice command mapping for the band timestamp, therefrom extracts voice referent constraint information by the voice parsing module.The gesture passage uses data glove to obtain joint and positional information for gesture identification, and the gesture parsing module accepts to give directions gesture, and produces and give directions the object vector.The hyperchannel integrate module is integrated the information from voice and gesture passage, in integration process, realizes the end to censuring, and produces system's executable command or corresponding prompting at last.
The present invention adopts the hyperchannel layering to integrate model and realizes the hyperchannel integration.Integration process is the task guiding; Target of integrating and the criterion of integrating success or not all are that the integrality with the interactive task structure is the basis; But final purpose is exactly to generate the task structure that submission system is carried out, comprising information such as the object of the action of task, task effect and relevant parameters.Therefore, defined the task groove among the present invention, the task groove belongs to the part that model is integrated in the hyperchannel layering.The structure of task groove is divided into three parts, is respectively action list item, referent list item and parameter list item, also can be referred to as to move groove, referent groove and parameter groove.In fact, action list item and parameter list item all belong to the order list item.Wherein the referent in the referent groove can be more than one, and the parameter groove can only filling position information at present.Different commands can be to there being the task groove with different structure, and for example the task groove of select command has only action and two list items of referent.The process of integrating has just become the filling process to the task groove, in case the task groove fills up, has just formed the executable complete task of system.
For instance, " rotate it " as only having carried out phonetic entry, and do not make the indication gesture, just can't confirm referent.Then the task groove will in the action groove, insert " rotation ", and the referent groove is empty when filling.At this moment, owing to be provided with the stand-by period, complete if the task groove was filled in the stand-by period, just in the stand-by period, make the indication gesture, thereby confirmed referent, then proceed this man-machine interaction.Model meeting generation system executable command is integrated in the hyperchannel layering, and is complete if the task groove was not filled in the stand-by period, then abandons this man-machine interaction.
Model is integrated in the hyperchannel layering of the present invention's definition, as its name suggests, is based on the thought of layering, and channel information is become four layers of Physical layer, morphology layer, grammer layer and semantic layers etc. from concrete facility information to the semantic abstraction that finally will be filled to the task groove.Physical layer information is that its form has diversity, and is directly related with concrete interactive device from the raw information of interactive device input.Such as from phonetic entry be character string information, and be sensor information from data glove input.The morphology layer is crucial one deck, and it is to processings that unitize of the raw information from mechanical floor, same meaning and the different input of form to unify be identical information representation, thereby provide and device-independent information to the grammer layer.In the morphology layer, the voice messaging of voice channel process sound identification module and voice parsing module carry out abstract, generate voice referent constraint information; Simultaneously, the gesture information of gesture passage is through behind gesture identification module and gesture parsing module abstract, generation gesture referent constraint information.The grammer layer mainly will decompose according to the syntax gauge of man-machine interaction from the information of morphology layer, be decomposed into the form that meets each list item of task groove, prepare for follow-up semanteme merges.Censure to sum up and mainly carry out at the grammer layer.And the grammer layer also extracts command information from voice referent constraint information.At semantic layer, utilize the task guiding mechanism exactly, carry out the filling of task groove and perfect, though task is relevant with concrete application, the filling of task groove and improve and be independent of application.
In fact, the man-machine interaction process can be divided into two kinds of strategies, " seed of garden balsam " and " phlegmatic temperament " two kinds.The impetuous integration as long as the hyperchannel input supports integration to a certain degree just to begin to handle, this process can be regarded as event driven.The integration of phlegmatic temperament just begins to handle after then will arriving and having had the perhaps more complete input of whole inputs.For example, when carrying out man-machine interaction, shorttempered strategy is, phonetic entry " is rotated it ", and the hyperchannel layering is integrated model and just started working, and begins to carry out information processing.And the strategy of phlegmatic temperament is, phonetic entry " is rotated it ", gives directions gesture to make simultaneously and gives directions certain object, so that model can be confirmed referent, this moment, model just started.Just, phlegmatic temperament is the full detail that a man-machine interaction is provided disposable.Because discontinuous situation often appears in the user's voice input, the bigger time interval appears in the middle of the order of a complete mobile object.Receive the restriction of speech recognition engine simultaneously, the present invention uses " seed of garden balsam " strategy, adopts voice driven, when voice channel receives first statement, just begins the man-machine interaction process one time.
The process that referent is confirmed is just censured the process of end.In the present invention, censuring end will be foundation with voice referent constraint information and gesture referent constraint information simultaneously.The present invention is based on following two hypothesis: the semanteme in (1) phonetic entry is clearly; The present invention mainly pays close attention to the indication ambiguity that solves in the hyperchannel denotion end; Therefore suppose that the semanteme in the phonetic entry is clearly, do not have fuzzy sets such as " upper left corner ", " centre ", " in the past "; (2) so that " oneself is the denotion of " center ", and denotion can be divided into three types: be the center with oneself, be the center with the object of reference, be the center with he people.It all is to be the center with oneself that among the present invention all are censured, and not existing " object of selecting his left side " this is the situation at center with other viewpoints.
The present invention adopts the integrated strategy of voice driven, after a statement is identified, triggers the hyperchannel integration process.The hyperchannel layering is integrated in the model, and at first, voice referent constraint information is packed into the voice constraint set.Can be then that according to gesture referent constraint information all model objects in the virtual environment distribute identity, all model objects are divided into give directions four types of object, focal object, activation object and quiet objects.Said indication object is the object that is positioned at the indication zone that current indication gesture limited; The referent of said focal object in last once man-machine interaction process, being determined; Said activation object be positioned at visual range except that giving directions object and the model object that activates the object, said quiet object be positioned at visual scope not remove indication object and the model object the activation object.The corresponding initialization coupling of each type model object matrix is respectively and gives directions matrix, focussing matrix, activated matrix and quiet matrix.
The present invention adopts the method for perceived shape in censuring the end process, perceived shape is by user's control and interactive object solid for information about can be provided.When the current gesture of system identification is when giving directions gesture; Generation is attached to the cone on the virtual hand forefinger finger tip (just giving directions the indication zone that gesture limited); Through collision detection record cast object and cone reciprocal process, generate various statistic data.Then the statistic weighted mean is generated and give directions priority.Once give directions after mutual the completion, obtain the doublet vector corresponding with this indication gesture, first element of this doublet is for giving directions the object vector, and second element is for giving directions priority.
The present invention has defined time series T RankWith distance sequence D RankTwo kinds of statistics.Time in perceived shape is long more, and near more apart from indication center (virtual hand forefinger finger tip), then the priority of this model object is high more.
T RankComputation process be shown below T wherein ObjectRepresent the time of certain model object in cone, T PeriodThe life period (being the duration of giving directions gesture) of representing cone in certain reciprocal process.
T rank = T object T period , 0<T rank≤1
D RankComputation process be shown below D wherein ObjectRepresent the distance of certain model object center, D to the indication center MaxBe that model object in cone is to the maximum distance at indication center.
D rank = 1 - D object D max , 0<D rank≤1
Give directions priority P RankObtained by above-mentioned two kinds of statistic weighted means, its computing method are following:
P rank=T rank*λ+D rank*(1-λ),0≤λ≤1
Because interactive device is not designed with cooperation way work, the integration of striding passage just must rely on temporal correlation.Therefore calculate the indication priority P through perceived shape RankAfter, should write down the current time and integrate use for the hyperchannel of after-stage.Input has the setting of stand-by period because the task groove is for further information, and the numerical value of this stand-by period then will consider, further gesture information input and accomplish with voice messaging and to censure the end needed time of process.
Above-mentioned obtain giving directions priority and give directions the object vector after, searchings of will in giving directions matrix, focussing matrix, activated matrix, quiet matrix, comparing one by one is in the state that four model objects in the matrix have correspondence.In each stage, when censuring end, then be through adaptation function Match (o, e) quantitative model object state of living in for the model object that is arranged in same matrix.
The structure of adaptation function is following:
Match ( o , e ) = [ Σ S ∈ { P , F , A , E } P ( o | S ) * P ( S | e ) ] * Semantic ( o , e ) * Temp ( o , e )
Wherein, o representation model object, e are represented to censure.P representes the indication state, and F representes focus state, and A representes state of activation, and E representes quiet state, and S representes the state of current object.Be below Match (o, each ingredient e):
(1) P (o|S) and P (S|e)
The selected probability of object o is used to weigh the gesture passage to censuring the influence of summing up during the given cognitive state S of P (o|S) expression.Concrete computing method are: P (o|P)=P Rank
Figure BDA0000092466880000103
(M is the number of focal object),
Figure BDA0000092466880000104
(N is for activating the number of object),
Figure BDA0000092466880000105
(L representes the number of all model objects in the virtual environment).P (S|e) is that the referent state is the probability of S when censuring to e.
(2)Semantic(o,e)
Semantic (o, the e) semantic compatibility between representation model object o and the denotion e is used to weigh voice channel to censuring the influence of summing up, and it is constructed as follows:
Semantic ( o , e ) = Σ k Attr k ( o , e ) K
The present invention all puts identifier and semantic type under attribute Attr kIn, Attr when o and e all have attribute k and both values not to wait k(o is 0 e), and all the other situation are 1.K is the attribute sum of referent.
(3)Temp(o,e)
(o, e) the time compatibility between representation model object o and the denotion e is used for the measurement time to censuring the influence of summing up to Temp.It is a piecewise function:
When o and e with in once mutual the time, Temp (o, computation process e) is following:
Temp(o,e)=exp(-|Time(o)-Time(e)|)
When o and e are in distinct interaction, Temp (o, computation process e) is following:
Temp(o,e)=exp(-|OrderIndex(o)-OrderIndex(e)|)
Wherein Time (o) is for giving directions the gesture time of origin, and Time (e) is for censuring time of origin, and unit is second; The precedence of OrderIndex (o) expression o in giving directions the gesture sequence, the precedence of OrderIndex (e) expression e in censuring sequence.Be in focusing, activation or quiet status object Temp (o, e)=1.
When censure with the model object that is in a certain state (promptly being positioned at a certain matrix) after the contrast coupling, referent promptly obtains affirmation.
In the described multimodal human-computer interaction method based on voice and gesture; Model object in the said virtual environment is divided into gives directions four types of object, focal object, activation object and quiet objects; Said indication object is the object that is positioned at the indication zone that current indication gesture limited; The referent of said focal object in last once man-machine interaction process, being determined; Said activation object be positioned at visual range except that giving directions object and the model object that activates the object; Said quiet object be positioned at visual scope not except that giving directions object and the model object that activates the object; In step 3, above-mentioned voice referent constraint information and gesture referent constraint information are compared with the characteristic information of above-mentioned indication object, focal object, activation object, quiet object in order one by one, determine the referent of man-machine interaction.
In the described multimodal human-computer interaction method based on voice and gesture; In said step 2; From above-mentioned voice messaging, extract voice referent constraint information and from above-mentioned gesture information, extract gesture referent constraint information and realize in the following manner: make up the hyperchannel layering and integrate model; Said hyperchannel layering is integrated model and is included four layers; Be respectively Physical layer, morphology layer, grammer layer and semantic layer; Wherein, said Physical layer receives voice messaging and the gesture information of being imported by voice channel and gesture passage respectively, and said morphology layer includes speech recognition parsing module and gesture identification parsing module; Said speech recognition parsing module resolves to voice referent constraint information with the voice messaging of Physical layer, and said gesture identification parsing module resolves to gesture referent constraint information with the gesture information of Physical layer.
In the described multimodal human-computer interaction method based on voice and gesture; In the said step 3; The characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and the virtual environment is compared; Determine the referent of man-machine interaction; The confirming of said referent realized on said grammer layer; The command information of from above-mentioned voice referent constraint information, extracting referent realizes in the following manner: said grammer layer extracts command information from voice referent constraint information, it is to realize in the following manner that command information is acted on referent: said semantic layer acts on referent with the command information that the grammer layer is extracted.
In the described multimodal human-computer interaction method based on voice and gesture; Said hyperchannel layering is integrated model and is also included the task groove; Said task groove comprises order list item and referent list item; It is to carry out in the following manner that the command information that wherein said semantic layer is extracted the grammer layer acts on referent: the command information that said semantic layer is extracted the grammer layer is inserted the order list item; Referent is inserted the referent list item, and said task groove is filled complete, and model production system executable command is integrated in said hyperchannel layering.
In the described multimodal human-computer interaction method based on voice and gesture; Do not fill under the complete situation at said task groove; Stand-by period is set, and said task groove is filled complete in the stand-by period, then continues this man-machine interaction; Said task groove is not filled complete in the stand-by period, then abandons this man-machine interaction.
In the described multimodal human-computer interaction method based on voice and gesture; Said order list item includes action list item and parameter list item; When extracting the command information to referent in the said voice referent constraint information, said command information comprises action message and parameter information.
In the described multimodal human-computer interaction method, in the said step 1, when voice channel receives first statement, begin one time the man-machine interaction process based on voice and gesture.
In the described multimodal human-computer interaction method based on voice and gesture; In the said step 1; When voice channel receives a statement; The input of time-out time with the gesture information that receives the gesture passage is set, exceeds set time-out time, then abandon this man-machine interaction process like the input of gesture information.
Although embodiment of the present invention are open as above; But it is not restricted to listed utilization in instructions and the embodiment; It can be applied to various suitable the field of the invention fully, for being familiar with those skilled in the art, can easily realize other modification; Therefore under the universal that does not deviate from claim and equivalency range and limited, the legend that the present invention is not limited to specific details and illustrates here and describe.

Claims (9)

1. the multimodal human-computer interaction method based on voice and gesture is characterized in that, may further comprise the steps:
Step 1, structure voice channel and gesture passage, and the referent of man-machine interaction is carried out the input of voice messaging and gesture information through voice channel and gesture passage respectively;
Step 2, from above-mentioned voice messaging, extract voice referent constraint information; From above-mentioned gesture information, extract gesture referent constraint information; Wherein, said gesture referent constraint information comprises that any point in the indication zone that current indication gesture limited arrives the distance statistics amount at the indication center of giving directions gesture and the time statistic that above-mentioned indication gesture is kept;
Step 3, the characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and the virtual environment is compared; Determine the referent of man-machine interaction; From above-mentioned voice referent constraint information, extract command information to referent; Command information is acted on referent, accomplish a man-machine interaction.
2. the multimodal human-computer interaction method based on voice and gesture as claimed in claim 1; It is characterized in that; Model object in the said virtual environment is divided into gives directions four types of object, focal object, activation object and quiet objects; Said indication object is the object that is positioned at the indication zone that current indication gesture limited; The referent of said focal object in last once man-machine interaction process, being determined; Said activation object be positioned at visual range except that giving directions object and the model object that activates the object, said quiet object be positioned at visual scope not remove indication object and the model object the activation object, in step 3; Above-mentioned voice referent constraint information and gesture referent constraint information are compared with the characteristic information of above-mentioned indication object, focal object, activation object, quiet object in order one by one, determine the referent of man-machine interaction.
3. the multimodal human-computer interaction method based on voice and gesture as claimed in claim 1 is characterized in that, in said step 2,
From above-mentioned voice messaging, extract voice referent constraint information and from above-mentioned gesture information, extract gesture referent constraint information and realize in the following manner:
Make up the hyperchannel layering and integrate model; Said hyperchannel layering is integrated model and is included four layers; Be respectively Physical layer, morphology layer, grammer layer and semantic layer; Wherein, said Physical layer receives voice messaging and the gesture information of being imported by voice channel and gesture passage respectively, and said morphology layer includes speech recognition parsing module and gesture identification parsing module; Said speech recognition parsing module resolves to voice referent constraint information with the voice messaging of Physical layer, and said gesture identification parsing module resolves to gesture referent constraint information with the gesture information of Physical layer.
4. the multimodal human-computer interaction method based on voice and gesture as claimed in claim 3 is characterized in that, in the said step 3,
The characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and the virtual environment is compared, determine the referent of man-machine interaction, the confirming of said referent realize on said grammer layer,
The command information of from above-mentioned voice referent constraint information, extracting referent realizes in the following manner:
Said grammer layer extracts command information from voice referent constraint information,
Command information is acted on referent to be realized in the following manner:
Said semantic layer acts on referent with the command information that the grammer layer is extracted.
5. the multimodal human-computer interaction method based on voice and gesture as claimed in claim 4 is characterized in that, said hyperchannel layering is integrated model and also included the task groove, and said task groove comprises order list item and referent list item,
The command information that wherein said semantic layer is extracted the grammer layer acts on referent and carries out in the following manner:
Said semantic layer is inserted the order list item with the command information that the grammer layer is extracted, and referent is inserted the referent list item, and said task groove is filled complete, and model production system executable command is integrated in said hyperchannel layering.
6. the multimodal human-computer interaction method based on voice and gesture as claimed in claim 5; It is characterized in that, do not fill under the complete situation, the stand-by period is set at said task groove; Said task groove is filled complete in the stand-by period; Then continue this man-machine interaction, said task groove is not filled complete in the stand-by period, then abandons this man-machine interaction.
7. the multimodal human-computer interaction method based on voice and gesture as claimed in claim 5; It is characterized in that; Said order list item includes action list item and parameter list item; When extracting the command information to referent in the said voice referent constraint information, said command information comprises action message and parameter information.
8. the multimodal human-computer interaction method based on voice and gesture as claimed in claim 1 is characterized in that, in the said step 1, when voice channel receives first statement, begins one time the man-machine interaction process.
9. the multimodal human-computer interaction method based on voice and gesture as claimed in claim 1; It is characterized in that; In the said step 1, when voice channel receives a statement, the input of time-out time with the gesture information of reception gesture passage is set; Input like gesture information exceeds set time-out time, then abandons this man-machine interaction process.
CN 201110278390 2011-09-19 2011-09-19 Multichannel human-computer interaction method based on voice and gestures Active CN102339129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110278390 CN102339129B (en) 2011-09-19 2011-09-19 Multichannel human-computer interaction method based on voice and gestures

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110278390 CN102339129B (en) 2011-09-19 2011-09-19 Multichannel human-computer interaction method based on voice and gestures

Publications (2)

Publication Number Publication Date
CN102339129A true CN102339129A (en) 2012-02-01
CN102339129B CN102339129B (en) 2013-12-25

Family

ID=45514896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110278390 Active CN102339129B (en) 2011-09-19 2011-09-19 Multichannel human-computer interaction method based on voice and gestures

Country Status (1)

Country Link
CN (1) CN102339129B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102824092A (en) * 2012-08-31 2012-12-19 华南理工大学 Intelligent gesture and voice control system of curtain and control method thereof
CN103422764A (en) * 2013-08-20 2013-12-04 华南理工大学 Door control system and control method thereof
CN103987169A (en) * 2014-05-13 2014-08-13 广西大学 Intelligent LED table lamp based on gesture and voice control and control method thereof
CN104423543A (en) * 2013-08-26 2015-03-18 联想(北京)有限公司 Information processing method and device
CN104615243A (en) * 2015-01-15 2015-05-13 深圳市掌网立体时代视讯技术有限公司 Head-wearable type multi-channel interaction system and multi-channel interaction method
CN104965592A (en) * 2015-07-08 2015-10-07 苏州思必驰信息科技有限公司 Voice and gesture recognition based multimodal non-touch human-machine interaction method and system
CN105511612A (en) * 2015-12-02 2016-04-20 上海航空电器有限公司 Multi-channel fusion method based on voice/gestures
CN105867595A (en) * 2015-01-21 2016-08-17 武汉明科智慧科技有限公司 Human-machine interaction mode combing voice information with gesture information and implementation device thereof
CN106933585A (en) * 2017-03-07 2017-07-07 吉林大学 A kind of self-adapting multi-channel interface system of selection under distributed cloud environment
CN107122109A (en) * 2017-05-31 2017-09-01 吉林大学 A kind of multi-channel adaptive operating method towards three-dimensional pen-based interaction platform
CN107967112A (en) * 2012-10-19 2018-04-27 谷歌有限责任公司 Inaccurate gesture of the decoding for graphic keyboard
CN108334199A (en) * 2018-02-12 2018-07-27 华南理工大学 The multi-modal exchange method of movable type based on augmented reality and device
CN108399427A (en) * 2018-02-09 2018-08-14 华南理工大学 Natural interactive method based on multimodal information fusion
CN109147928A (en) * 2011-07-05 2019-01-04 沙特阿拉伯石油公司 Employee is shown as by augmented reality, and system, computer media and the computer implemented method of health and fitness information are provided
CN109992095A (en) * 2017-12-29 2019-07-09 青岛有屋科技有限公司 The control method and control device that the voice and gesture of a kind of intelligent kitchen combine
CN111968470A (en) * 2020-09-02 2020-11-20 济南大学 Pass-through interactive experimental method and system for virtual-real fusion
CN112069834A (en) * 2020-09-02 2020-12-11 中国航空无线电电子研究所 Integration method of multi-channel control instruction
WO2022110564A1 (en) * 2020-11-25 2022-06-02 苏州科技大学 Smart home multi-modal human-machine natural interaction system and method thereof
WO2023197485A1 (en) * 2022-04-13 2023-10-19 北京航空航天大学 Contact processing method and system for virtual hand force sense interaction

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281435A1 (en) * 2009-04-30 2010-11-04 At&T Intellectual Property I, L.P. System and method for multimodal interaction using robust gesture processing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281435A1 (en) * 2009-04-30 2010-11-04 At&T Intellectual Property I, L.P. System and method for multimodal interaction using robust gesture processing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XIAOWU CHEN, NAN XU: "A Multimodal Reference Resolution Approach in Virtual Enviroment", 《SPRINGER-VERLAG BERLIN HEIDELBERG》, 31 December 2006 (2006-12-31) *
张国华, 老松杨, 凌云翔, 叶挺: "指挥控制中的多人多通道人机交互研究", 《国防科技大学学报》, vol. 32, no. 5, 31 December 2010 (2010-12-31), pages 153 - 159 *
马翠霞,戴国忠: "基于手势和语音的草图技术研究", 《第五届中国计算机图形学大会》, 26 September 2004 (2004-09-26), pages 302 - 305 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147928A (en) * 2011-07-05 2019-01-04 沙特阿拉伯石油公司 Employee is shown as by augmented reality, and system, computer media and the computer implemented method of health and fitness information are provided
CN102824092A (en) * 2012-08-31 2012-12-19 华南理工大学 Intelligent gesture and voice control system of curtain and control method thereof
CN107967112A (en) * 2012-10-19 2018-04-27 谷歌有限责任公司 Inaccurate gesture of the decoding for graphic keyboard
CN103422764A (en) * 2013-08-20 2013-12-04 华南理工大学 Door control system and control method thereof
CN104423543A (en) * 2013-08-26 2015-03-18 联想(北京)有限公司 Information processing method and device
CN103987169B (en) * 2014-05-13 2016-04-06 广西大学 A kind of based on gesture and voice-operated intelligent LED desk lamp and control method thereof
CN103987169A (en) * 2014-05-13 2014-08-13 广西大学 Intelligent LED table lamp based on gesture and voice control and control method thereof
CN104615243A (en) * 2015-01-15 2015-05-13 深圳市掌网立体时代视讯技术有限公司 Head-wearable type multi-channel interaction system and multi-channel interaction method
CN105867595A (en) * 2015-01-21 2016-08-17 武汉明科智慧科技有限公司 Human-machine interaction mode combing voice information with gesture information and implementation device thereof
CN104965592A (en) * 2015-07-08 2015-10-07 苏州思必驰信息科技有限公司 Voice and gesture recognition based multimodal non-touch human-machine interaction method and system
CN105511612A (en) * 2015-12-02 2016-04-20 上海航空电器有限公司 Multi-channel fusion method based on voice/gestures
CN106933585A (en) * 2017-03-07 2017-07-07 吉林大学 A kind of self-adapting multi-channel interface system of selection under distributed cloud environment
CN106933585B (en) * 2017-03-07 2020-02-21 吉林大学 Self-adaptive multi-channel interface selection method under distributed cloud environment
CN107122109A (en) * 2017-05-31 2017-09-01 吉林大学 A kind of multi-channel adaptive operating method towards three-dimensional pen-based interaction platform
CN109992095A (en) * 2017-12-29 2019-07-09 青岛有屋科技有限公司 The control method and control device that the voice and gesture of a kind of intelligent kitchen combine
CN108399427A (en) * 2018-02-09 2018-08-14 华南理工大学 Natural interactive method based on multimodal information fusion
CN108334199A (en) * 2018-02-12 2018-07-27 华南理工大学 The multi-modal exchange method of movable type based on augmented reality and device
CN111968470A (en) * 2020-09-02 2020-11-20 济南大学 Pass-through interactive experimental method and system for virtual-real fusion
CN112069834A (en) * 2020-09-02 2020-12-11 中国航空无线电电子研究所 Integration method of multi-channel control instruction
CN111968470B (en) * 2020-09-02 2022-05-17 济南大学 Pass-through interactive experimental method and system for virtual-real fusion
WO2022110564A1 (en) * 2020-11-25 2022-06-02 苏州科技大学 Smart home multi-modal human-machine natural interaction system and method thereof
WO2023197485A1 (en) * 2022-04-13 2023-10-19 北京航空航天大学 Contact processing method and system for virtual hand force sense interaction

Also Published As

Publication number Publication date
CN102339129B (en) 2013-12-25

Similar Documents

Publication Publication Date Title
CN102339129B (en) Multichannel human-computer interaction method based on voice and gestures
Nigay et al. Multifeature systems: The care properties and their impact on software design
Marsic et al. Natural communication with information systems
CN105512228A (en) Bidirectional question-answer data processing method and system based on intelligent robot
WO2010006087A9 (en) Process for providing and editing instructions, data, data structures, and algorithms in a computer system
CN106796789A (en) Interacted with the speech that cooperates with of speech reference point
CN110147544A (en) A kind of instruction generation method, device and relevant device based on natural language
CN104267922B (en) A kind of information processing method and electronic equipment
TW201234213A (en) Multimedia input method
Kim et al. Vocal shortcuts for creative experts
CN108419123A (en) A kind of virtual sliced sheet method of instructional video
CN100390794C (en) Method for organizing command set of telecommunciation apparatus by navigation tree mode
Gu et al. Shape grammars: A key generative design algorithm
CN105677716A (en) Computer data acquisition, processing and analysis system
Duy Khuat et al. Vietnamese sign language detection using Mediapipe
CN111967334B (en) Human body intention identification method, system and storage medium
CN107122109A (en) A kind of multi-channel adaptive operating method towards three-dimensional pen-based interaction platform
CN103903618A (en) Voice input method and electronic device
CN102446309B (en) Process pattern based dynamic workflow planning system and method
CN104899042B (en) A kind of embedded machine vision detection program developing method and system
CN106705974A (en) Semantic role tagging and semantic extracting method of unrestricted path natural language
CN116188618A (en) Image generation method and device based on structured semantic graph
CN109491651A (en) Data preprocessing method, device, storage medium and electronic equipment
CN106155668B (en) A kind of graphic representation method of macrolanguage
CN105404449B (en) Can level expansion more pie body-sensing menus and its grammar-guided recognition methods

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant