CN102708377B

CN102708377B - Method for planning combined tasks for virtual human

Info

Publication number: CN102708377B
Application number: CN201210125122.4A
Authority: CN
Inventors: 李淳芃; 宗丹; 夏时洪; 王兆其
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2012-04-25
Filing date: 2012-04-25
Publication date: 2014-06-25
Anticipated expiration: 2032-04-25
Also published as: CN102708377A

Abstract

The invention provides a method for planning combined tasks for virtual human. The method includes: step 1, setting up a behavior map of the virtual human on basis of motion capture data; step 2, finding a key state and decomposing combined tasks into sub-tasks on basis of the key state; step 3, learning optimal control strategies of each sub-task; and step 4, computing an optimal action sequence of the combined tasks on basis of an initial state of the virtual human in the environment. By the method for planning combined tasks for virtual human, computing time and storage spaces required in the action planning process can be reduced, and a planning algorithm can be converged to the optimal control strategies by probability one without any assumption of shapes of value functions of a controller.

Description

Visual human's combined task planing method

Technical field

The present invention relates to planning motion of virtual human field, relate in particular to a kind of visual human's combined task planing method.

Background technology

In recent years, motion of virtual human synthetic technology is a study hotspot in role animation and computer game, has widespread use in fields such as Entertainment, video display animation, computer aided decision making and virtual assemblings.But, how visual human's motion is planned, make visual human there is certain capacity of self-government, remain a challenging problem.

In recent years, scientific worker explores several different methods planning visual human's motion, for example motion planning method based on figure and the motion planning method based on enhancing study etc.(1) in the motion planning method based on figure, the motion fragment of catching is organized by the mode of motion diagram, and wherein node represents motion fragment, and directed edge represents the transition between motion fragment.By at the enterprising line search of motion diagram, meet thereby synthesize the motion of virtual human sequence that user requires.Motion planning method based on figure can retain a large amount of real human motion details, provides effective means for creating role animation true to nature, is therefore widely used in computer animation.(2) based on strengthening the motion planning method of study (RL, Reinforcement Learning) without the signal of supervising and guiding in the external world, apply very extensive.Mode that visual human searches for by trial and error is directly carried out alternately with environment, trains optimal control policy according to the enhancing signal of environmental feedback, the optimum action sequence that acquisition is met consumers' demand, thus the composition problem of motion is converted into the problem concerning study of control strategy.

But the motion planning method based on motion diagram needs the extraneous signal of supervising and guiding, and therefore cannot generate the motion of virtual human with capacity of self-government; And the existing motion planning method based on strengthening study requires sample motion to have the step sliding problem of identical constraint frame to prevent from may occurring in Motion fusion process.Visual human's compound movement or task comprise multi-motion type, thereby comprise multiple constraint frame, and state space increases greatly, and planning process exists dimension blast problem.

In sum, utilize existing motion planning method, or cannot generate the motion of virtual human with capacity of self-government, or in planning when complex task, problem that existence number is too much, computing time is long (being dimension blast).

Summary of the invention

The object of this invention is to provide a kind of visual human's combined task planing method, reduce required computing time and storage space in motion planning process.

According to an aspect of the present invention, provide a kind of visual human's combined task planing method, comprising:

Step 1, based on motion capture data, build visual human's behavior figure;

Step 2, find key state, and based on key state, combined task is decomposed into subtask;

Step 3, learn the optimal control policy of each subtask; With

Step 4, original state based on visual human in environment, the optimum action sequence of calculation combination task.

Optionally, described motion capture data is expressed as:

C＝{c ₁，...，c _M}

Wherein, M is total number of motion fragment, each motion fragment c _i(i=1 ..., M) formed by one group of attitude, be expressed as:

c _i＝{p ₁，...，p _T}

Wherein, T is the frame number of this motion fragment, and each attitude is expressed as:

p _t＝{R，q ₀，...，q _N}(t＝1，...，T)

Wherein,

represent the position in the current attitude root of visual human joint; q ₀represent visual human's root joint towards, with unit quaternion (w, x, y, z) represent; q _n(n=1 ..., N) represent dig up the roots abarticular other joint with respect to father joint towards, N represents the joint number of manikin.

Optionally, step 1 also comprises:

Step 1.1, described motion capture data is divided into motor unit;

Step 1.2, by motor unit cluster, and a class motor unit is defined as to behavior;

The restriction relation of step 1.3, demarcation various actions;

Step 1.4, according to the restriction relation of demarcating, build visual human's behavior figure.

Optionally, step 2 also comprises:

Step 2.1, in state space, carry out sparse sampling, randomly draw n _stindividual two tuples

Step 2.2, for each two tuples

will

as initial state,

as final state:

Utilize trial and error searching method training N _traininferior, find from arrive

successful path;

Calculate each state s accessed frequency n (s) of accumulative total in these paths;

Step 2.3, repeat step below, until obtain the subtask specifying number:

Find key state s _max, meet s _max=arg _smaxn (s), as the final state of subtask;

Calculate each state s via key state s _maxaccessed frequency n (s, s _max);

Calculate

\overset{&OverBar;}{n} (s_{\max}) = {avg}_{s} n (s, s_{\max});

Select to meet

state s add the state set of this subtask.

Optionally, step 2.1 also comprises:

State is defined as:

s＝(B _s，x _i，y _i，z _i，θ)

Wherein, B _snode in expression behavior figure; (x _i, y _i, z _i) be illustrated in the relative position of visual human and other object in theorem in Euclid space, i=1 ..., n, n represents the number of object in environment; θ represent visual human's root joint towards vector to after x-z plane projection with the angle of positive x direction;

Action definition is:

a＝(B _a，x _mid，z _mid)

Wherein, B _arepresent current action; (x _mid, z _mid) represent middle the touchdown point displacement of fragment, be the change in displacement of motion fragment centre moment with respect to initial time.

Optionally, step 3 also comprises:

Learning rate, the discount factor of step 3.1, definition learning model;

Step 3.2, definition one step return function, and initialization accumulation return function;

Step 3.3, choose an original state and choose an optimum action according to existing value function for this original state arbitrarily, state is updated to next state, and revise and expect that accumulative total returns function;

Step 3.4, judge to expect whether accumulation return value restrains, if do not restrained, repeated execution of steps 3.3.

Optionally, step 3.2 comprises:

Define a step return matrix R:

R (s, a) = \{\begin{matrix} \min R, & if s_{1} = null; \\ \max R, & if s_{1} = s_{goal}; \\ - ω_{T} \cdot T (s, a) + ω_{P} \cdot P (a), & otherwise . \end{matrix}

(formula 4)

Wherein, state s ₁represent that visual human selects next step state after action a under state s; (s a) has described from state s to s T ₁physical difference, with visual human position and towards variable quantity represent, its value is less more level and smooth; P (a) has described the use preference of visual human to action a, and its value is more more this action of tendency selection; ω _tand ω _pbe respectively weighting coefficient; Max R and min R represent the upper bound and the lower bound of R.

Optionally, step 4 also comprises:

Step 4.1, input using given original state as first subtask control strategy, obtain the optimum action sequence of this subtask;

Step 4.2, original state using the final state of first subtask controller as follow-up subtask controller, obtain optimum action sequence corresponding to follow-up subtask successively;

Step 4.3, order are spliced the optimum action sequence of all subtasks, obtain the optimum action sequence of original combined task.

Optionally, step 4.3 comprises:

Note with

for 2 motion fragments to be spliced, wherein, i and j are respectively the totalframes of two motion fragments, and the position in visual human's root joint is adopted to linear interpolation, to the motion fragment after adopting hypercomplex number sphere linear interpolation, synthesizing in joint are:

\tilde{M} = M_{1} &CirclePlus; M_{2} = {p_{1}^{1}, . . ., p_{i - k}^{1}, p_{1}, . . ., p_{k}, p_{k + 1}^{2}, . . ., p_{j}^{2}}

(formula 7)

Wherein,

R (p_{t}) = α (t) \cdot R (p_{i - k + t}^{1}) + [1 - α (t)] \cdot R (p_{t}^{2})

(formula 8)

q (p_{t}) = slerp (q (p_{i - k + t}^{1}), q (p_{t}^{2}), α (t))

(formula 9)

Wherein, R (p _t) expression attitude p _troot joint position, q (p _t) expression attitude p _teach joint towards, in addition,

α (t) = 2 {(\frac{t - 1}{k - 1})}^{3} - 3 {(\frac{t - 1}{k - 1})}^{2} + 1,

1≤t≤k (formula 10)

Wherein, fusion coefficients α (t) meets: in the time of t≤1, and α (t)=1; In the time of t>=k, α (t)=0; And α (t) has C everywhere ₁continuity.

With respect to prior art, the advantage of the visual human's combined task planing method based on key state provided by the invention is: combined task is resolved into multiple subtasks by (1), and solve in the sub-state space of small-scale, greatly reduce and calculate required computing time and storage space.(2) for visual human's combined task, owing to dividing and rule, each subtask is planned, instruct visual human to arrive quickly to setting the goal thereby therefore can obtain more accurate controller.(3) this algorithm is without the shape of controller value function is made to any hypothesis, only needs to ensure that each state-action is to can, by repeated accesses continually, ensureing to converge to optimal control policy with probability one.

Brief description of the drawings

What Fig. 1 showed is visual human's combined task planing method process flow diagram according to an embodiment of the invention;

What Fig. 2 showed is the data flow diagram that Fig. 1 is corresponding;

That Fig. 3 shows is behavior figure according to an embodiment of the invention;

What Fig. 4 showed is the principle schematic of different motion planing method;

What Fig. 5 showed is the schematic diagram of finding according to an embodiment of the invention key state;

What Fig. 6 showed is subtask selection course schematic diagram according to an embodiment of the invention;

What Fig. 7 showed is the building-up process schematic diagram of motion fragment according to an embodiment of the invention.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing, the present invention is described in more detail.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

Visual human's task is complicated abundant, and for completing certain task, visual human need to sequentially carry out multiple steps conventionally.For example: in virtual assembling, virtual people has been the task of fixing model machine, need to take first somewhither screw, then take somewhither screwdriver, finally complete fitting operation to target location.The present invention is defined as atomic task to have motion integrated semantic, that can not segment again.For example " visual human picks up screwdriver from worktable ", " visual human is installed to screw on model machine " can be regarded two different atomic tasks as.Be combined task by the task definition being combined by some subtasks, wherein subtask can be that atomic task can be also combined task.Above-mentioned " visual human fixes model machine " is a combined task.

In the present invention, mission planning refers to resolves into a series of processes of carrying out motion sequence by visual human's a Given task.The Given task of input often represents by dbjective state, and the motion sequence of output is required to meet certain constraint condition.

Inventor finds after deliberation, utilizes the existing planning motion of virtual human method based on strengthening study to plan that combined task is difficult.Its reason is, combined task comprises multiple subtasks, and each subtask has different motion features.Existing method is building when motion model, a dimension using each motion feature in state, and computing time and storage space are exponential growth along with the growth of state number, have dimension blast problem.Inventor also finds after deliberation, if combined task is decomposed into some subtasks, is divided and rule and is planned in each subtask, can reduce problem solving scale, thereby can effectively solve dimension blast problem.

The present invention proposes a kind of visual human's combined task hierarchical reconfiguration planning method based on key state.The method can be divided into two-layer: level strengthens learning model and strengthens learning model.The level on upper strata strengthens learning model, by sparse sampling in state space and search for the successful path of some local task, using the state of accessed frequency maximum as key state, and combined task is decomposed into some subtasks; The enhancing learning model of bottom, by abstract motion fragment for behavior, by abstract environmental information be state, by trial and error search strategy, divide and rule and plan subtask.When resultant motion, only need follow the control strategy of each subtask, select successively motion fragment order splicing.To be described in detail below.

According to one embodiment of present invention, provide a kind of visual human's combined task planing method based on key state, as shown in Figure 1; Be somebody's turn to do the data flow diagram of the visual human's combined task planing method based on key state as shown in Figure 2.In Fig. 2, combined task is decomposed the step S30 in operation corresponding diagram 1 partly, the step S40 in the operation corresponding diagram 1 of subtask planning part, the step S50 in the operation corresponding diagram 1 of flying splice composite part.

With reference to figure 1, visual human's combined task planing method that the present embodiment provides comprises:

S10, gathers motion, obtains dummy role movement capture-data;

S20, structure visual human's behavior figure;

S30, finds key state, and original combined task is decomposed into some subtasks;

S40, learns the optimal control policy of each subtask;

S50, the original state of given visual human in environment, the optimum action sequence of calculation combination task.

Wherein, data preprocessing phase (being that controller model builds the stage) comprises step S10 and S20; The controller training stage comprises step S30 and S40; Motion synthesis phase comprises S50.Build the stage at controller model, user only need define a step return function with meaning directly perceived, just can on higher level, control visual human's motion; In the controller training stage, by finding the method for key state, complicated combined task is resolved into multiple subtasks, and solve in the sub-state space of small-scale, greatly reduce required computing time and storage space; At motion synthesis phase, user only need select optimum action (find under this state and can obtain that action corresponding to greatest hope accumulation return value) just can obtain the optimum action sequence of combined task according to control strategy successively.This process does not relate to computation process consuming time, therefore can meet the demand of real-time application.To introduce in detail one by one each step below.

s10 obtains role movement data

Utilize the various optics of selling on Vehicles Collected from Market, electromagnetic capturing movement equipment, the capture device VICON8 that such as VICON company produces etc., gather dummy role movement data sample.

The exercise data sequence table gathering is shown:

C＝{c ₁，...，c _M}

c _i＝{p ₁，...，p _T}

Wherein, T is the frame number of this motion fragment.

Each attitude can be expressed as:

p _t＝{R，q ₀，...，q _N}(t＝1，...，T)

Wherein, represent the position in the current attitude root of visual human joint; q ₀represent visual human's root joint towards, represent by unit quaternion (w, x, y, z), it is 1 constraint that four components of unit quaternion meet quadratic sum; q _n(n=1 ..., N) represent dig up the roots abarticular other joint with respect to father joint towards, N represents the joint number of manikin.

The exercise data sample packages gathering, containing multi-motion type, for same class motion, requires it to have similar reference attitude and stops attitude.If exercise data does not meet this condition, exercise data can be done to certain biasing mapping.The biasing mapping of motion is that some amendments are made in the part of a motion fragment, makes it retain the continuity of archeokinetic details and motion as far as possible.

For example, suppose motion fragment c _i={ p ₁..., p _tprocess, the first frame that makes to process rear motion fragment is initial targeted attitude p _start.

First, record the root joint R of each attitude in motion fragment c ^twith joint towards wherein, R ^trepresent the root joint position of t frame,

represent n joint in t frame towards.

Suppose R ^startfor p _startroot joint position.Note

for p _startin n joint towards.

Because the first frame that requires to process rear motion fragment is p _start, therefore calculate this motion fragment the first frame attitude and p _startroot joint position discrepancy delta R and joint are as follows respectively towards discrepancy delta q:

ΔR＝R ^start-R ¹.

Δ q_{n} = q_{n}^{start} - q_{n}^{1} .

(formula 1)

If the first frame attitude of this motion fragment is directly revised as to p _start, can make the motion fragment after synthesizing occur saltus step, cause moving unsmooth.For making the motion transition nature after synthetic, the front H frame of adjusting caused attitude difference and be distributed to equably motion fragment is gone.As an example of the h frame of amendment motion fragment c example (h=1 ..., H), the root joint position of its amended h frame and joint are towards as follows respectively:

R ^h＝R ^h+α·ΔR.

(formula 2)

q_{n}^{h} = q_{n}^{h} + α \cdot Δ q_{n} .

Wherein

like this, the first frame that obtains motion fragment after treatment is p _start.If require the last frame of motion fragment for stopping targeted attitude p _end, disposal route is roughly the same, repeats no more.

s20 builds visual human's behavior figure

Behavior comprises independent and complete semanteme conventionally, can regard certain technical performance of visual human as.For the capacity of defining virtual people from higher level, we by abstract motion capture data for behavior.

Behavior is made up of one group of motion fragment that comprises identical constraint frame.Constraint frame refers to the frame that comprises particular constraints; Constraint can be specified by user, is a kind of constraint as visual human's pin lands, and it is another kind of constraint that visual human's hand lifts.For example, the constraint frame of the running fragment of a complete cycle comprises following 3 frames: (1) both feet land simultaneously and left foot front; (2) both feet land simultaneously and right crus of diaphragm front; (3) both feet land simultaneously and left foot front.Again for example, a constraint frame that captures the fragment of object comprises following 2 frames: (1) hand lifts; (2) hand resets.

Different behaviors comprises different constraint frames.For example, walking and running be two kinds of behaviors, wherein, the constraint frame of walking comprises following 3 frames: (1) both feet land simultaneously and left foot front, both hands freely throw away and the right hand front; (2) both feet land simultaneously and right crus of diaphragm front, both hands freely throw away and left hand front; (3) both feet land simultaneously and left foot front, both hands freely throw away and the right hand front.The constraint frame of running comprises 3 frames: (1) both feet land simultaneously and left foot front, both hands lift that elbow is clenched fist and the right hand front; (2) both feet land simultaneously and right crus of diaphragm front, both hands lift that elbow is clenched fist and left hand front; (3) both feet land simultaneously and left foot front, both hands lift that elbow is clenched fist and the right hand front.

For defining virtual people's behavior restriction relation, we have built behavior figure, with reference to figure 3.Behavior figure is a kind of behavior organizational form based on digraph.Wherein, each node represents a class behavior, and directed edge representative connects the transition fragment of two class behaviors.Wherein, solid line limit represents the transition fragment of different rows between being, dotted line limit represents the transition fragment of identical behavior, if do not have limit to connect between two nodes, representing can not transition between this two class behavior.For the motion fragment of any two needs splicing, because the termination attitude of first motion fragment and the reference attitude of second motion fragment can not be identical, if direct splicing there will be saltus step, visually there is shake, therefore need these two motion fragments to do certain fusion.

Continue with reference to figure 3, wherein defined visual human's a behavior figure.Wherein, node " walking " represents that visual human normally walks, node " walking & & left hand " represents that visual human walks while left hand by screw, node " the walking & & right hand " represents that visual human walks the while right hand by screwdriver, node " walking & & both hands " represents to walk visual human, and left hand is by screw, the right hand by screwdriver simultaneously, and node " installation " represents that visual human installs screw.

Behavior figure has described the capacity of visual human in environment and the transition transfer constraint of various actions.Whether " transition shift constraint " refers to and between two class behaviors, can sequentially carry out and saltus step do not occur.Visual human's motion planning, can regard the once traversal on behavior figure as.

It should be noted that the behavior figure in the present embodiment regards the motion fragment with same movement constraint frame as identical behavior.For example, striding with trot is the same class behavior with different motion parameter, and the walking fragment with different strides belongs to Same Vertices in behavior figure.The advantage of doing is like this, behavior figure small scale, thereby train the required time of optimal control policy shorter.

In sum, building visual human's behavior figure can comprise: step 1, input motion are caught fragment; Step 2, capturing movement fragment is divided into motor unit; Step 3, by motor unit cluster, and a class motor unit is defined as to behavior; The restriction relation of step 4, demarcation various actions; Step 5, according to the restriction relation of demarcating, build visual human's behavior figure.

Can, by the restriction relation of the formal define behavior of two tuples, for example, represent that with [B1, B2] visual human is executing after behavior B1, and then act of execution B2.

original combined task is decomposed into some subtasks by S30

Level strengthens study (HRL, Hierarchical Reinforcement Learning) essence be strengthen study basis on increase " abstract " mechanism, overall task is decomposed into the subtask in different levels, each subtask is solved in the less subproblem space of scale, thereby greatly reduce the scale that solves of problem.

As shown in Figure 4, represent the principle schematic of different motion planing method.(a) in Fig. 4 represents the motion planning method based on strengthening study, plans taking motion fragment as unit; (b) in Fig. 4 represents to strengthen based on level the motion planning method of study, taking subtask as unit plans; (c) in Fig. 4 represents the combined task hierarchical reconfiguration planning method based on key state, upper strata strengthens learning method by level and finds key poses by multiple combined task layering subtasks, and lower floor is divided and rule and planned each subtask by enhancing learning method.Wherein, transverse axis represents the time, and the longitudinal axis represents state, and the point (comprising solid dot and hollow dots) on every curve represents a state in corresponding moment.

Inventor finds after deliberation, and level strengthens study as a kind of half Markov process, can in continuous time step (being the time period), move.If this motion sequence completing in step-length in continuous time is defined as to subtask, level enhancing study is unit taking subtask to the planning of motion of virtual human so.Each subtask can be both an atomic task, can be also another subtask, by subtask, upper strata to lower floor subtask or atomic task (being elemental motion) call formation heterarchical architecture, as shown in (b) in Fig. 4.

For combined task (being goal task) is decomposed into some subtasks, state is defined as visual human's residing situation in virtual environment by the present invention, characterized by one group of physical quantity.For example, state can represent the residing position of visual human and towards; Also can represent the interaction feature of visual human and Objects In A Virtual Environment.The state is here different from attitude, and the category of state is wider compared with attitude.For example, the former can also comprise the interaction feature of visual human and Objects In A Virtual Environment.

Each physical quantity of sign state can be a dimension, and for example, positional information is the first dimension, and orientation information is the second dimension, and the interaction feature of visual human and screwdriver is third dimension degree, and the interaction feature of visual human and model machine is fourth dimension degree.

The dividing mode of subtask is not unique, but always has some states frequently to appear on all motion sequences that are successfully completed goal task, and these states can be the subtask of some order splicings by original combined division of tasks.The state space of each subtask is only relevant by the dimension relevant with this subtask, is a subset in virgin state space.For example, if certain subtask is to take screwdriver, " interaction feature of visual human and screwdriver " this dimension in state is just relevant to this subtask so.Pick up screwdriver if certain subtask is visual human, in virgin state, the dimension relevant to this subtask has " interaction feature of visual human and screwdriver ", and " interaction feature of visual human and model machine " this dimension and this subtask are irrelevant.

Ancestral task is decomposed into some subtasks, there is state space, set of actions, value function and control strategy separately the subtask of every one deck, can be obtained the optimal control policy of each subtask, finally be obtained the optimal control policy of whole combined task by study from bottom to top.According to above-mentioned analysis, inventor proposes a kind of two-layer plan model based on key state, for planning visual human's combined task, as shown in (c) in Fig. 4.To describe in detail below.

s301 carries out sparse sampling in state space

Level strengthens learning model and conventionally consists of the following components: state set S, set of actions A, step return function R and a control strategy π.

State is defined as:

s＝(B _s，x _i，y _i，z _i，θ)

Wherein, B _snode in expression behavior figure; (x _i, y _i, z _i) be illustrated in the relative position of visual human and other object in theorem in Euclid space, i=1 ..., n, n represents the number of object in environment; θ represent visual human's root joint towards vector to after x-z plane projection with the angle of positive x direction.

Action definition is:

a＝(B _a，x _mid，z _mid)

Wherein, B _arepresent current motion fragment; (x _mid, z _mid) represent the middle touchdown point displacement of this motion fragment, the change in displacement of middle moment of motion fragment with respect to initial time, for example the position in current visual human's root joint is with respect to the change in location in initial time visual human root joint, and virtual human and environment bumps when avoiding resultant motion.Touchdown point displacement in the middle of record, and judge that whether visual human carries out in the process of certain action and Environment Collision at certain state, makes the motion of synthesizing meet environmental constraints.

When the state space that virtual environment is formed carries out sparse sampling, sampling interval is specified by user.In the present embodiment, environment space size is 20 × 20 × 5 (rice ³), sampling interval is Δ x=Δ y=Δ z=1, and angular range is 2 π (radians), and sampling interval is Δ θ=π/6.

State set after sampling is S (s ∈ S), and state number is M.

Set of actions after sampling is A (a ∈ A), and action number is N.

s302 finds key state, and divides combined task and state space based on key state

Each subtask is defined as to tlv triple o:<I, μ, β >, wherein,

for visual human carries out the original state set of this subtask;

μ: s × ∪ A _s→ [0,1], represents the inner strategy of this subtask, ∪ A _sthe all optional set of actions of expression state s, above formula represents that visual human is under state s, with probability P ∈ [0,1] from set ∪ A _saction of middle selection;

β: s → [0,1], represents inside, subtask, and state s is that the probability of final state is P ∈ [0,1].

Subtask is selected, and and if only if that visual human's current state belongs to original state set I, visual human selects action according to inner strategy μ in the time of subtasking, in the time of a certain final state that visual human's current state is subtask, whole subtask is carried out and is finished.

Therefore, visual human's combined task planning problem can be regarded as the select progressively process of subtask.The quality of subtask strategy can be weighed with long-term return function, for example, expects accumulation return value.Select optimum action different from traditional enhancing learning method, what this method was selected at every turn is optimum subtask; Here, optimum action is an action fragment, and optimum subtask is a series of actions fragment.

By V ^π(s, o) is defined as visual human under current state s, and the expectation accumulation return value of selecting subtasking o to obtain, has:

V ^π(s, o)=E{r _t+ γ r _t+1+ ... | ε (o π, s, t) } (formula 3)

Wherein, r _trepresent that visual human, under moment t state s, carries out a step return value of o; γ represents discount factor, and 0≤γ≤1 represents that following return value is on present impact, and discount factor is less, represents that visual human more pays close attention to the impact of nearest action, and discount factor is larger, represents that visual human pays close attention to the action in the long period very much; O π represents that visual human arrives after final state according to the inner strategy of o, then selects next action by tactful π; ε (o π, s, t) represents that visual human carries out the event of o under moment t state s.

The existing motion planning method based on enhancing study is the dimension in state using the motion feature of task, and in the time of planning combined task, state space is very large, has the problem of dimension blast.Therefore combined task is decomposed into some subtasks, and subtask is planned respectively, can greatly reduce computing time and storage space.

Inventor finds after deliberation, and visual human, can frequent some state of access in the repeatedly trial that completes some local task, and these states can be regarded as the key state of original combined task.For example, as shown in Figure 5, node represents state, and limit represents a successful access path from given initial state to final state, and many successful paths all state of process are the key states of this Solve problems.Can decompose by the method for extracting key state visual human's combined task.

Utilize key state to decompose original combined task, a kind of feasible method is by random initial state and the final state of specifying in state space, and find the successful path from given initial state to final state, add up respectively each state and appear at the number of times on these successful paths, find the state of accumulative total access times maximum as key state, and then be some subtasks by original combined task division, the state space that is simultaneously subtask by virgin state spatial division.

Be expressed as follows with the corresponding false code of content of above-mentioned steps S30:

Step 1, in state space, randomly draw n _stindividual two tuples

carry out sparse sampling;

Step 2, for each two tuples

will

as initial state,

as final state:

Step 2.1, utilize trial and error searching method training N _traininferior, find from

arrive

successful path;

Step 2.2, calculate each state s accessed frequency n (s) of accumulative total in these paths; Step 3, repeat step below, until the number of subtask meet the requirements (for example, be n subtask if user specifies by original combined Task-decomposing, subtask number is n):

Step 3.1, find key state s _max, meet s _max=arg _smax n (s), as the final state of subtask;

Step 3.2, calculating n (s, s _max), represent that each state s is via key state s _maxaccessed number of times;

Step 3.3, calculating

\overset{&OverBar;}{n} (s_{\max}) = {avg}_{s} n (s, s_{\max});

Step 3.4, selection meet

state s add the state set (being state space) of this subtask.

s40 learns the optimal control policy of each subtask

After obtaining some subtasks, we adopt enhancing learning training to obtain the optimal control policy (the Q matrix below optimal control policy correspondence) of each subtask.

Strengthen study, claim again intensified learning, without the given signal of supervising and guiding of user, visual human by with the interaction feedback learning optimal control policy of environment.The basic thought that strengthens study is: if a certain action obtains the positive return of environment, just system selects the trend of this action to strengthen so later; Otherwise system selects the trend of this action just can weaken.Strengthen the target expected returns (or minimization expectation cost) that maximizes exactly of study.

the parameter such as learning rate, discount factor of S401 definition learning model

Strengthening study is a kind of machine learning of increment type, and learning rate α controls the speed of study, 0≤α≤1; Learning rate is larger, restrains sooner, but easily produces vibration, and learning rate is less, restrains slower.

The implication of discount factor γ, with step S302, repeats no more.

Except learning rate and discount factor, also need to define maximum study number of times K, greatest iteration step number E, maximum update times φ.Wherein, start iteration from given initial state, until find the sequence process of dbjective state, be referred to as once to learn.

s402 define a step return function (be visual human in the time of each state, take each action a step return value), and initialization accumulation return function Q

With the step return matrix R that gives a definition.(s a) has defined the step return value of visual human in the time that state s (the state set is here a subset in virgin state space) performs an action a to element R in matrix.This value is larger, and the instant return that visual human obtains is larger, otherwise less; If it is cost in fact that this value, for negative, illustrates.The upper bound and the lower bound of one step return value are specified by user.

R is defined as follows:

R (s, a) = \{\begin{matrix} \min R, & if s_{1} = null; \\ \max R, & if s_{1} = s_{goal}; \\ - ω_{T} \cdot T (s, a) + ω_{P} \cdot P (a), & otherwise . \end{matrix}

(formula 4)

Wherein, state s ₁represent that visual human selects next step state after action a under state s.(s a) has described from state s to s T ₁physical difference, with visual human position and towards variable quantity represent, its value is less more level and smooth; P (a) has described the use preference of visual human to action a, and its value is more more this action of tendency selection; ω _tand ω _pbe respectively weighting coefficient; Max R and min R represent the upper bound and the lower bound of R.Formula 4 represents:

The a if visual human cannot perform an action under state s, a step return value is min R; It should be noted that, the situation that a that cannot perform an action under state s comprises has: state s and state s ₁unreasonable, or due to behavior figure in behavior restriction relation contradiction (the behavior node from state s can not be transitioned into action a).

The a if visual human can perform an action under state s, and s ₁for dbjective state, a step return value is max R; Conventionally max R is made as enough greatly, to guide visual human close towards dbjective state quickly;

The a if visual human can perform an action under state s, but s ₁dbjective state, a step return value be state transitions return T (s, a) and the weighted sum of preference return P (a).Due to state transitions return T, (s, a) has described the level and smooth degree of motion transition, and more its value is less, so ω _tbefore have negative sign.

Strengthen in study at level, the return function of every one deck is generally selected according to the task object of equivalent layer.For example, for taking the subtask of screwdriver, target is to capture screwdriver at a distance.But consider that, in the learning process of task strategy, the number of times of high-level realization of goal is less, if only using realize target as the sole mode that obtains return, the results of learning of controller will be very poor.Whether therefore a step return function of the present embodiment definition, not only returns the realization of target, also to bumping with virtual environment and the level and smooth degree of splicing of motion fragment is returned.If collision has occurred virtual human and environment or motion fragment assembly is unsmooth, return value is less; If visual human has grabbed screwdriver, return value is larger.

Initialization accumulation return function Q is null matrix, and the line number of matrix Q is identical with matrix R with columns.

In addition, in the time of planning visual human's combined task, user only needs definition status space, motion space and builds a step return function, can obtain the optimal control policy of combined task, thereby it is synthetic on higher level, to control visual human's motion.

s403 chooses arbitrarily an original state and chooses for this original state according to existing value function an optimum action, is updated to next state by state, and revises and expect to add up to return function

(s a), for visual human takes to move the obtainable expectation accumulation of a return value at state s, remembers state s to note Q ₁for next state, the mode that adopts iteration to upgrade obtains expecting accumulation return matrix Q:

Q (s, a) = (1 - α) \cdot Q (s, a) + α \cdot {R (s, a) + γ \cdot \max Q (s_{1}, \cup A_{s_{1}})}

(formula 5)

Wherein, discount factor γ, learning rate α defines with step S401; expression state s ₁optimum move corresponding expectation accumulation return value.

But, if all choose optimum action at every turn, easily make strategy be absorbed in local optimum, therefore, introduce ε greed search strategy, in the time of each selection, choose action corresponding to greatest hope accumulation return value with probability ε, choose other action with probability (1-ε).

s404 judges to expect whether accumulation return value restrains, or whether iterations is greater than given large iterations

Be expressed as follows with the corresponding false code of content of above-mentioned steps S40:

The parameter such as learning rate, discount factor of step 0, definition learning model; (corresponding step S401)

Step 1, definition R matrix, and initialization Q is null matrix; (corresponding step S402)

Step 2, repeat process below (the k time study):

Step 2.1, select an original state s arbitrarily; (corresponding step S403)

Step 2.2, repetition (the e time iteration):

Step 2.2.1, the current Q matrix of foundation are chosen action a, and (s a), obtains next state s to obtain a step return value R ₁; (corresponding step S403)

Step 2.2.2, according to formula 5 upgrade Q (s, a); (corresponding step S403)

Step 2.2.3, current state is upgraded: s=s ₁; (corresponding step S403)

If step 2.2.4 is s=s _goal, or iterative steps e>=E, finish; Otherwise, execution step 2.2.1.(corresponding step S404)

If step 3 k >=K, or Q matrix exceedes φ time and do not upgrade, and finishes; Otherwise, execution step 2.(corresponding step S404)

In theory, only need each state-action to (s a) can, by repeated accesses continually, just can ensure that this algorithm converges to optimum expectation accumulated value function Q with probability one ^*.

the original state of the given visual human of S50 in environment, calculates optimum action sequence

the input of S501 using given original state as first subtask control strategy, obtains this the optimum action sequence of subtask

First arbitrary original state s of given visual human ₀, find under this state and can obtain that subtask o corresponding to greatest hope accumulation return value according to formula (3) ₁=arg _omaxV ^π(s ₀, o).

For subtask o ₁, according to the expectation accumulation return matrix Q of its correspondence, obtain o ₁optimum action sequence.Because Q is initialized as null matrix, Q while therefore convergence (s, a)>=0,

(s, a) larger, a that performs an action under expression state s is more reasonable, can more quickly arrive dbjective state for Q.

For some state s, (, a) there is two or more maximal values in s to Q, illustrates that the expectation accumulation return value of taking these actions to obtain is identical.For this situation, system is selected arbitrary action randomly, and current state is updated to next state s ₁.If reached dbjective state, stop; Otherwise, reselect optimum action, and new state more, until arrive the dbjective state s of this subtask _goaltill:

The action sequence of gained is the optimum action sequence μ under original state ₁=μ (s ₀)={ a ₀, a ₁... }, by the optimum action sequence order splicing obtaining, and can carry out Motion fusion in fragment head and the tail junction, generate with given original state s ₀the optimum action sequence μ of first subtask control strategy starting ₁.

s502 using the final state of first subtask controller as follow-up subtask controller at the beginning of beginning state, obtains optimum action sequence corresponding to follow-up subtask successively

Original state using the final state of first subtask controller as follow-up subtask controller, obtains the optimum action subsequence μ of follow-up subtask ₂, μ ₃deng, as shown in Figure 6.

s503 order is spliced the optimum action sequence of all subtasks, obtains original combined task excellent action sequence

Note

with

for 2 motion fragments to be spliced.Wherein, i and j are respectively the totalframes of two motion fragments.Seamlessly transit for realizing, the position in visual human's root joint adopted to linear interpolation, to the motion fragment after adopting hypercomplex number sphere linear interpolation, synthesizing in joint be:

\tilde{M} = M_{1} &CirclePlus; M_{2} = {p_{1}^{1}, . . ., p_{i - k}^{1}, p_{1}, . . ., p_{k}, p_{k + 1}^{2}, . . ., p_{j}^{2}}

(formula 7)

Wherein,

R (p_{t}) = α (t) \cdot R (p_{i - k + t}^{1}) + [1 - α (t)] \cdot R (p_{t}^{2})

(formula 8)

q (p_{t}) = slerp (q (p_{i - k + t}^{1}), q (p_{t}^{2}), α (t))

(formula 9)

α (t) = 2 {(\frac{t - 1}{k - 1})}^{3} - 3 {(\frac{t - 1}{k - 1})}^{2} + 1,

1≤t≤k (formula 10)

Wherein, fusion coefficients α (t) meets: in the time of t≤1, and α (t)=1; In the time of t >=k, α (t)=0.And α (t) has C everywhere ₁continuity.

In the present embodiment, said process as shown in Figure 7, synthesizes front motion fragment M ₁there is i frame, M ₂there is j frame.If merging window selected is k frame, and the frame number of final synthetic motion fragment is i+j-k.Front i-k frame and the M of the motion fragment after synthetic ₁front i-k frame just the same, rear j-k frame and M ₂rear j-k frame just the same, middle k frame utilizes linear interpolation to obtain.

The invention provides a kind of general, efficient method and plan visual human's combined task, thereby synthesize the motion of virtual role with a kind of control device of higher level.The method by abstract goal task be subtask in different levels, and training obtains the control strategy of each subtask in less subproblem space, thereby reduces the scale that solves of problem, and accelerates the speed that solves of problem.So-called " strategy ", for example refers to visual human's state, to the mapping relations of action (, one to one, the mapping relations of one-to-many), and visual human selects the foundation of action, thereby can from environment, obtain greatest hope accumulation return value.

The visual human's combined task planing method based on key state that the present invention proposes, its advantage is: combined task is resolved into multiple subtasks by (1), and solve in the sub-state space of small-scale, greatly reduce required computing time and storage space.(2) for visual human's combined task, owing to dividing and rule, each subtask is planned, instruct visual human to arrive quickly to setting the goal thereby therefore can obtain more accurate controller.(3) this algorithm is without the shape of controller value function is made to any hypothesis, only needs to ensure that each state-action is to can, by repeated accesses continually, ensureing to converge to optimal control policy with probability one.

Should be noted that and understand, in the situation that not departing from the desired the spirit and scope of the present invention of accompanying claim, can make various amendments and improvement to the present invention of foregoing detailed description.Therefore, the scope of claimed technical scheme is not subject to the restriction of given any specific exemplary teachings.

Claims

1. visual human's combined task planing method, comprising:

Step 1, based on motion capture data, build visual human's behavior figure;

Step 3, learn the optimal control policy of each subtask; With

Step 4, original state based on visual human in environment, the optimum action sequence of calculation combination task;

Wherein, described step 1 also comprises:

Step 1.1, described motion capture data is divided into motor unit;

The restriction relation of step 1.3, demarcation various actions; With

Step 1.4, according to the restriction relation of demarcating, build visual human's behavior figure;

Described step 2 also comprises: by the successful path of sparse sampling in state space search mission, find described key state according to the accessed frequency of each state;

Described step 3 also comprises:

Learning rate, the discount factor of step 3.1, definition learning model;

Step 3.3, choose an original state and choose an optimum action according to existing value function for this original state arbitrarily, state is updated to next state, and revise and expect that accumulative total returns function; With

Step 3.4, judge to expect whether accumulation return value restrains, if do not restrained, repeated execution of steps 3.3;

Described step 4 also comprises:

2. visual human's combined task planing method according to claim 1, wherein, described motion capture data is expressed as:

C＝{c ₁,...,c _M}

c _i＝{p ₁,...,p _T}

p _t＝{R,q ₀,...,q _N}(t＝1,...,T)

Wherein, R ∈ R ³, the position in the expression current attitude root of visual human joint; q ₀represent visual human's root joint towards, with unit quaternion (w, x, y, z) represent; q _n(n=1 ..., N) represent dig up the roots abarticular other joint with respect to father joint towards, N represents the joint number of manikin.

3. visual human's combined task planing method according to claim 1, wherein, step 2 also comprises:

Step 2.2, for each two tuples

will

as initial state, as final state:

Utilize trial and error searching method training N _traininferior, find from

arrive successful path;

Calculate each state s accessed frequency n (s) of accumulative total in these paths; With

Step 2.3, repeat step below, until obtain the subtask specifying number:

Calculate each state s via key state s _maxaccessed frequency n (s, s _max);

Calculate

\overset{&OverBar;}{n} (s_{\max}) = {avg}_{s} n (s, s_{\max});

Select to meet state s add the state set of this subtask.

4. visual human's combined task planing method according to claim 3, wherein, step 2.1 also comprises:

State is defined as:

s＝(B _s,x _i,y _i,z _i,θ)

Action definition is:

a＝(B _a,x _mid,z _mid)

5. visual human's combined task planing method according to claim 1, wherein, step 3.2 comprises:

Define a step return matrix R:

R (s, a) = \{\begin{matrix} \min R, & if s_{1} = null; \\ \max R, & if s_{1} = s_{goal} \\ - ω_{T} \cdot T (s, a) + ω_{P} \cdot P (a), & otherwise . \end{matrix};

(formula 4)

Wherein, state s ₁represent that visual human selects next step state after action a under state s; (s a) has described from state s to s T ₁physical difference, with visual human position and towards variable quantity represent, its value is less more level and smooth; P (a) has described the use preference of visual human to action a, and its value is more more this action of tendency selection; ω _tand ω _pbe respectively weighting coefficient; Max R and min R represent the upper bound and the lower bound of R; S _goalrepresent dbjective state.

6. visual human's combined task planing method according to claim 1, wherein, step 4.3 comprises:

Note

with

M = M_{1} &CirclePlus; M_{2} = {p_{1}^{1}, . . ., p_{i - k}^{1}, p_{1}, . . ., p_{k}, p_{k + 1}^{2}, . . . p_{j}^{2}}

(formula 7)

Wherein,

R (p_{t}) = α (t) \cdot R (p_{i - k + t}^{1}) + [1 - α (t)] \cdot R (p_{t}^{2})

(formula 8)

q (p_{t}) = slerp (q (p_{i - k + t}^{1}), q (p_{t}^{2}), α (t))

(formula 9)

α (t) = 2 {(\frac{t - 1}{k - 1})}^{3} - 3 {(\frac{t - 1}{k - 1})}^{2} + 1,1 \leq t \leq k

(formula 10)