US8917907B2 - Continuous linear dynamic systems - Google Patents
Continuous linear dynamic systems Download PDFInfo
- Publication number
- US8917907B2 US8917907B2 US13/406,011 US201213406011A US8917907B2 US 8917907 B2 US8917907 B2 US 8917907B2 US 201213406011 A US201213406011 A US 201213406011A US 8917907 B2 US8917907 B2 US 8917907B2
- Authority
- US
- United States
- Prior art keywords
- action
- feature
- models
- sparse coding
- transition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G06K9/00765—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
- G06F18/295—Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
-
- G06K9/00335—
-
- G06K9/6297—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/84—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
- G06V10/85—Markov-related models; Markov random fields
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
Definitions
- the present patent document is directed towards systems and methods for segmentation and recognition of actions, including action transitions.
- Vision-based action recognition has wide application.
- vision-based action recognition may be used in driving safety, security, signage, home care, robot training, and other applications.
- PbD Programming-by-Demonstration
- a task to train is often decomposed into primitive action units.
- a human demonstrates a task that is desired to be repeated by a robot. While the human demonstrates the task, the demonstration process is captured by sensors, such as a video or videos using one or more cameras. These videos are segmented into individual unit actions, and the action type is recognized for each segment. The recognized actions are then translated into robotic operations for robot training.
- the image features ideally should satisfy a number of criteria; such as, for example, they should be able to identify actions in different demonstration environments. Second, they should support continuous frame-by-frame action recognition. And, they should have low computational costs.
- the classifier used for such applications should satisfy several conditions. First, it should be able to model temporal image patterns by actions in controlled environments. Second, it should be capable of continuous action recognition. And third, it should be able to be generalized from small training data sets. Existing techniques typical do not satisfy all of these criteria.
- Continuous action recognition is a challenging computer vision task because multiple actions are performed sequentially without clear boundary, which requires the analysis between action segmentation and classification to be performed simultaneously.
- action boundary segmentation and action type recognition may be approached separately, or the temporal recognition problem may be converted into identifying representative static template, such as those that identify unique hand shape, hand orientation or hand location matching for hand gesture/sign language recognition.
- SLDS Switching Linear Dynamic System
- SLDS Linear Dynamic Systems
- LDS Linear Dynamic Systems
- SLDS provides an intuitive framework for describing the continuous but non-linear dynamic of real-world motion, and is proven effective for texture analysis and synthesis, recognizing bee dancing and human actions with accurate motion representation, such as kinematic model parameters, joint trajectory, and joint angle trajectory.
- SLDS applies the learned dynamics in individual actions to estimate the transition between multiple actions, which leads to at least three limitations.
- FIG. 1( a ) depicts a graphical representation of a Linear Dynamic System (LDS).
- LDS Linear Dynamic System
- FIG. 1( b ) depicts a graphical representation of a Switching Linear Dynamic System (SLDS).
- SLDS Switching Linear Dynamic System
- FIG. 1( c ) depicts a graphical representation of a Continuous Linear Dynamic System (CLDS) according to embodiments of the present invention.
- CLDS Continuous Linear Dynamic System
- FIG. 2 depicts a method for training a Continuous Linear Dynamic System according to embodiments of the present invention.
- FIG. 3 depicts a block diagram of a Continuous Linear Dynamic System (CLDS) trainer for developing a CLDS according to embodiments of the present invention.
- CLDS Continuous Linear Dynamic System
- FIG. 4 depicts a block diagram of a feature extractor for generating a feature for a set of input sensor data according to embodiments of the present invention.
- FIG. 5 depicts a method for using a Continuous Linear Dynamic System (CLDS) to detect and label actions in input sensor data according to embodiments of the present invention.
- CLDS Continuous Linear Dynamic System
- FIG. 6 depicts a block diagram of a Continuous Linear Dynamic System (CLDS) model detector for detecting and labeling actions according to embodiments of the present invention.
- CLDS Continuous Linear Dynamic System
- FIG. 7 illustrates a flowchart of the Embedded Optical Flow feature and the Continuous Linear Dynamic System framework for simultaneously action primitive segmentation and recognition from continuous video according to embodiments of the present invention.
- FIGS. 8( a ) and 8 ( b ) graphically depict results of continuous action recognition and segmentation for two in-house datasets according to embodiments of the present invention.
- FIG. 9 graphically depicts results of continuous action recognition and segmentation for the IXMAS dataset according to embodiments of the present invention.
- FIG. 10 depicts a block diagram illustrating an exemplary system which may be used to implement aspects of the present invention.
- aspects of the present invention include systems and methods for segmentation and recognition of action primitives. Embodiments also include using an embedded optical flow feature for model training and/or detection purposes. Embodiments of the present invention include methods that have been encoded upon one or more computer-readable media with instructions for one or more processors or processing units to perform. The method may include a plurality of instructions that are executed by the processor.
- the framework utilizes higher-order information to explicitly model the transition within single action primitive or between successive action primitives.
- the framework referred to as the Continuous Linear Dynamic System (CLDS) framework, comprises two sets of Linear Dynamic System (LDS) models, one to model the dynamics of individual primitive actions and the other to model the transition between actions.
- LDS Linear Dynamic System
- the inference process estimates the best decomposition of a whole sequence by continuously alternating between the two set of models.
- an approximate Viterbi algorithm may be used in the inference process.
- both action type and action boundary may be accurately recognized.
- a computer-readable medium or media comprises one or more sequences of instructions which, when executed by one or more processors, causes steps for recognizing a sequence of actions comprising: segmenting input sensor data into time frames, generating a feature for each time frame, and for a time frame, selecting an action associated with a maximized value of an objective function that comprises a first set of feature-dependent models for intra-action transitions and a second set of feature-dependent models for inter-action transitions.
- the feature-dependent models are based upon linear dynamic system models.
- the objective function models a hidden state at a current time frame as being directly dependent upon a hidden state for a sequence of one or more time frames prior to the current time frame, an estimated action for the sequence of one or more time frames prior to the current time frame, and an estimated action of the current time frame.
- the feature-dependent models for inter-action transitions may include models to model transitions between the same action.
- the feature-dependent models may include models to cover transitions for when an action repeats.
- the input sensor data is a video and the time frame data may be image frames.
- a feature generated for an image frame may be an embedded optical flow feature.
- a computer-readable medium or media comprises one or more sequences of instructions which, when executed by one or more processors, causes steps to be performed that comprise modeling actions and modeling one or more action transitions.
- the actions models may comprises a plurality of action dynamic models, wherein each dynamic model comprises a label associated with an action that it models, and one or more transition models model between action transitions, wherein each transition dynamic model comprises a label associated with a transition between actions that it models.
- an action may be selected that is associated with the dynamic model that maximizes an objective function, wherein a transition probability from one action to another action is non-scalar and is calculated using at least one transition dynamic model.
- the EOF feature is based on embedding optical flow at interest points using Locality-constrained Linear Coding with weighted average pooling.
- the EOF feature is histogram-like but presents excellent linear separability.
- the EOF feature is able to take advantage of both global and local information by being spatially “global” and temporally “local.” EOF is spatially global in that it considers the distribution of optical flow information from each of a set of frames, and it is also temporally local in that it represents individual frames.
- the temporal evolution of EOF may be modeled by a sequential classifier.
- a computer-implemented method for generating an EOF comprises obtaining a set of local motion features, or optical flows, for an image frame from a video or other sensor. For each local motion feature from a set of local motion features, a sparse coding vector is generated and the EOF is formed to represent the image frame by a weighted pooling of the sparse coding vectors, wherein the weighting is based upon a distribution of local motion features in the image frame.
- the weighted pooling may be done by pooling sparse coding vectors for the image frame by weighting each sparse coding vector in relation to a posterior of a local motion feature corresponding to the sparse coding vector.
- the weighted pooling may be done by weighting each sparse coding vector by an inverse proportional to a square root of a posterior of a local motion feature corresponding to the sparse coding vector. For example, in embodiments, the equation,
- y C ⁇ ⁇ P o - 1 2 ⁇ ( X ) , may be used to generate the feature, where y represents the image feature for the image frame, C represents a matrix of spare coding vectors for the image frame, P o (X) represents a matrix of posterior values for local motion features.
- the pooled sparse coding vectors are also normalized to form the image feature.
- LLC Locality-constrained Linear Coding
- codebook may be used to generate spares coding vectors.
- at least some of the local motion vectors, or optical flows, may be used to generate a codebook.
- the method may include extracting frames from a video.
- the method may include extracting feature points from the frames in order to obtain optical flows for the frames.
- the EOF features may be used to train a model for action detection, may be used by a trained model for detecting one or more actions, or both.
- connections between components within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
- LDS Linear Dynamic System
- a Linear Dynamic System (LDS) model is a time-series state-space model that comprises a linear dynamics model and a linear observation model.
- FIG. 1( a ) depicts a graphical representation of an LDS model.
- the Markov chain represents the evolution of a continuous hidden states x with prior density x 0 ⁇ ( ⁇ 0 , ⁇ 0 ).
- the state x t is obtained by the product of the transition matrix F and the previous state x t ⁇ 1 corrupted by additive white noise v t ⁇ (0, Q).
- an LDS model is defined by the tuple ⁇ ( ⁇ 0 , ⁇ 0 ), (F,Q), (H,R) ⁇ .
- P r (s) represent the prior of label s.
- x t , s) can be obtained from Kalman smoothing.
- An SLDS model describes the dynamics of a more complex process by switching among a set of LDS models over time.
- Exact calculation of all possible S setting is intractable, and reported inferencing algorithm includes Viterbi, Variational and Generalized Psuedo Bayesian, which are known to those of ordinary skill in the art. For instance, the approximate Viterbi algorithm solves the inference problem by sequentially decomposing it into sub-problems. According to Equation (3), we can have
- switching between different LDSs is a discrete first-order Markov process controlled by the state variables s t as illustrated in FIG. 1( b ), which is a graphical model of an SLDS.
- s t i) in Equation (5) and an initial state distribution m 0 .
- a transition probability is a single scalar value based upon training data and is independent of the input observations, which may be an image feature.
- embodiments of the present invention explicitly model the temporal transitions between neighboring action primitives. In this way, the probability of transition between two primitives depends not only on the cooccurrence statistics of the two, but also depends on the observation.
- between action transitions are modeled using a set of LDSs, and an inference process outputs an optimal decomposition of a video into continuous alternating between the intra-action LDSs and inter-action LDSs.
- This novel approach may be referred herein as Continuous LDS (CLDS).
- a modified Viterbi algorithm is presented below for inferencing both the action type and action boundary in the CLDS framework.
- Equation (6) SLDS applies a state transition matrix of the next hypothesized LDS model s t+1 to estimate the transition from the current action label s t .
- F is trained to model the dynamics within an action; therefore, the capacity of any F is limited when used to model the transition between actions.
- the duration information of actions has been considered in the duration SLDS framework (dSLDS), where the semi-Markov transition probability is calculated with one added counter variable that keeps increasing (for intra action) or resetting (for inter action).
- dSLDS duration SLDS framework
- it has been proposed to use a second-order SLDS with the following transition equation: x t+1 F ( s t+1 ) x t +F ( s t ) x t ⁇ 1 +v t+1 s t+1 .
- these additional LDSs may be referred to as the Transition LDS, and the LDSs for action primitives may be referred to as the Action LDS.
- the Transition LDS since a Transition LDS is trained from portions of action transitions, intuitively it serves as a binary classifier for boundary/non-boundary decision, and effectively it helps bridging the gap between neighboring action primitives.
- a continuous video may be decomposed by continuously alternating between Action LDSs and Transition LDSs; hence, the name for this new framework is aptly Continuous LDS.
- FIG. 1( c ) depicts a graphical representation of a Continuous Linear Dynamic System (CLDS) according to embodiments of the present invention.
- the graphical model in FIG. 1( c ) depicts three levels.
- the first level, y represents that observation, which, in embodiments, may be an image feature representing a video frame.
- the next level, x represents the hidden state or latent variable.
- the top level, s represents the action label.
- the arrows represent the relationships between these elements. It shall be noted that unique to CLDS, is the direct influence 120 that the estimated action label for the prior time sequence, s t , has on the hidden state for a current time frame under consideration, namely x t+1 .
- the next section explains embodiments of the inference process in CLDS.
- CLDS gains more flexibility during the inference process.
- an observation-dependent transition probability (through the hidden layer X) may be calculated using the corresponding set of LDSs, determined by both the current and the next action labels.
- the inference process of CLDS may also use approximate Viterbi algorithm, where F is selected by
- F i is from Action LDS i
- F ij is from Transition LDS i to j.
- F ii ⁇ F i ; rather F ii is from Transition LDS i to i, (i.e., the same action is repeating).
- Q, H, and R may be selected similarly.
- CLDS can easily identify action boundaries, which are the instances when state transits between actions, and the corresponding frames are action boundaries.
- the overall objective function may be maximized by recursively maximizing
- x t , s t+1 , s t ) may be obtained using Equation (8) and Equation (9) respectively.
- the Transition LDS and Action LDS may be trained jointly by expectation-maximization (EM) algorithm, which is well known to those of ordinary skill in the art.
- EM expectation-maximization
- the optimal decomposition of the example sequences into intra/inter-action transitions are obtained; and in the M step, the LDS model parameters are learned by Maximum a Posteriori estimation. Since the action boundaries are usually given in training, it is straightforward to partition the sequences accordingly to initialize the EM process.
- approximated decomposition is sufficient if the portions between two boundaries are treated as intra-action segments, and the portions that cover the offset (ending) of a first action and the onset (starting) of the following action are used as inter-action transition segments.
- FIG. 2 depicts a general methodology for training a Continuous Linear Dynamic System according to embodiments of the present invention.
- the process commences by segmenting ( 205 ) input sensor data into time frames.
- the input sensor data may be video data; however, it shall be noted that the CLDS may be used for other applications and may use other sensor data.
- a feature for each time frame is generated ( 210 ).
- an image feature for the image frame is generated to represent the image frame.
- the image feature is an embedded optical flow feature, which is described in commonly assigned and co-pending U.S. patent application Ser. No. 13/405,986 further identified above.
- the CLDS training process use the features and know labels to train a CLDS, which comprises one or more intra-action models (Action LDS) and one or more inter-action models (Transition LDS), for continuous action segmentation and recognition.
- the Transition LDS and Action LDS may be trained jointly by expectation-maximization (EM) algorithm, which is well known to those of ordinary skill in the art.
- EM expectation-maximization
- the optimal decomposition of the example sequences into intra/inter-action transitions are obtained, and in the M step, the LDS model parameters are learned by Maximum a Posteriori estimation. Since the action boundaries are usually given in training, it is straightforward to partition the sequences accordingly to initialize the EM process.
- approximated decomposition is sufficient if the portions between two boundaries are treated as intra-action segments, and the portions that cover the offset (ending) of a first action and the onset (starting) of the following action are used as inter-action transition segments.
- FIG. 3 depicts a block diagram of a Continuous Linear Dynamic System (CLDS) trainer for developing a CLDS according to embodiments of the present invention.
- CLDS model trainer 305 utilizes the methods discussed above to train a CLDS model.
- the CLDS model trainer 305 comprises a frame extractor 310 , a feature extractor 315 , and a model trainer 320 .
- the frame extractor 310 received input sensor data 345 , such as a video, and segments the data into time frames.
- the feature extractor 315 receives the segmented data from the frame extractor 310 and extracts a feature to represent each segmented data frame.
- the feature may be an embedded optical flow feature.
- FIG. 4 depicts a block diagram of a spatial feature extractor 315 for generating an embedded optical flow (EOF) feature according to embodiments of the invention.
- EEF embedded optical flow
- the process flow commences by receiving input sensor data, which for this depicted embodiment is video frames 430 .
- input sensor data which for this depicted embodiment is video frames 430 .
- a spatial pyramid technique may be applied, where the image frame may be divided into two or more sections and one or more of the subsequent steps may be processed by sections.
- optical flows are extracted from the frames.
- the optical flow may be computed using, by way of example and not limitation, Harris interest point detection to detect feature point from the frames and a Lucas-Kanade method to calculate the optical flow for each feature point. Other methods for obtaining optical flows which are known to those of ordinary skill in the art may also be employed.
- motion information is represented based on optical flows because they are fast to compute and robust against many transforms. It shall be noted that for purposes of explanation two-dimensional (2D) examples are depicted herein; however, it shall be noted that the systems and methods may be applied to different dimensionalities.
- the LLC processor 420 receives the optical flows from the optical flow extractor 415 and uses the optical flows to generate higher-order codes.
- Embodiments of the EOF methodology utilize an optical flow codebook.
- the optical flow codebook 422 may be a previously generated codebook.
- a small set of frames may be used to provide training samples to build the optical flow codebook.
- a codebook may be generated if the optical flow in the testing/detecting video exhibits significantly different distributions from the training frames. Such situations usually happen with the change of task environment factors, such as the set of actions to be recognized, the parameters of camera or sensor, the complexity of background, etc.
- LLC Locality-constrained Linear Coding
- the LLC coding-based process converts each optical flow vector x i into a corresponding higher-dimensional code c i ⁇ M.
- the coding step solves the following criteria in general:
- VQ vector quantization
- B i contains the k nearest neighbors of x i from B.
- embodiments herein utilize different pooling strategies to further improve the performance of measuring the distance between two image features by linearly combining the per-point contribution, and linear distance measure is sufficient.
- Existing histogram-type features based on average-pooling use non-linear distance metric to achieve good performance, such as the ⁇ 2 distance based on KL divergence.
- the Bhattacharyya distance has also been used to measure the distance between two distribution of local image patches, which leads to the following criteria:
- a challenge is how to obtain P o (x i ) that is not readily available as that in Gaussian mixture model (GMM).
- GMM Gaussian mixture model
- computing P o (x i ) may be performed as follows. The method starts from uniform prior for each descriptor x i and basis vector b j , i.e.,
- Equation (4) reduces to a Dirac Delta function.
- this may be approximated using Hard-VQ where P(b j
- P o - 1 2 ⁇ ( X ) [ P o - 1 2 ⁇ ( x 1 ) , P o - 1 2 ⁇ ( x 2 ) , ... ⁇ , P o - 1 2 ⁇ ( x N ) ] T .
- the pooler 425 normalizes each feature, which may be performed as follows:
- EOF feature is quite different from prior approaches. As compared to HOOF, embodiments of the methodologies of the current patent document do not generate a histogram to represent the frame. And, as compared to STIP, embodiments of the methodologies of the current patent document do not use temporal interest point but use spatial interest point, which allow for the temporal pattern to be modeled by classifier.
- histogram-type feature based on optical flow have been used in describing motion for action recognition. Since histograms are non-Euclidean, modeling the evolution of histogram requires non-linear sequential models, such as a Non-Linear Dynamic System using Binet-Cauchy kernel. Unlike traditional histogram-type feature that is based on Vector Quantization (VQ), coding based image feature has better linear separability as proven by recent works in image classification. Since EOF is based on the LLC coding, its temporal evolution can be accurately described using linear models. It shall be noted that there has been no prior work on using first-order coding criteria (such as LLC) to represent a frame for action recognition. It shall also be noted that the weighted pooling methods for the EOF feature outperforms the original LLC feature.
- VQ Vector Quantization
- the EOF time series may be used by the model trainer 320 to train a CLDS for continuous action segmentation and recognition.
- the model trainer 320 comprises a module for labeling within action portions 325 that works in conjunction with Action LDS model trainer 330 to train the Action LDS models for intra-actions.
- the model trainer 320 comprises a module for labeling between action portions 335 that works in conjunction with Transition LDS model trainer 340 to train the Transition LDS models for inter-actions.
- the model trainer 320 may train the Transition LDS models and Action LDS models jointly by expectation-maximization (EM) algorithm, which is well known to those of ordinary skill in the art.
- EM expectation-maximization
- FIG. 5 depicts a method for using a Continuous Linear Dynamic System (CLDS) to detect and label actions in a video according to embodiments of the present invention.
- the process commences by segmenting ( 505 ) input sensor data into time frames.
- the input sensor data may be video data and the segmented time frames may be image frames, although other sensor data and configurations may be used.
- a feature for each image frame is generated ( 510 ).
- an image feature for the frame may be an embedded optical flow as previously discussed.
- These images features are then input into a CLDS model to perform ( 515 ) continuous segmentation and recognition.
- the CLDS model may use one or more of the inference methods discussed above or known to those of ordinary skill in the art.
- FIG. 6 depicts a block diagram of a Continuous Linear Dynamic System (CLDS) model detector 605 according to embodiments of the present invention.
- the CLDS model detector 605 receives input sensor data 625 and outputs a sequence of labels 630 .
- the CLDS model detector 605 comprises a frame extractor 610 , a feature extractor 615 , and a CLDS model decoder 620 .
- CLDS model detector performs one or more methods for continuous segmentation and recognition, which include but are not limited to the methods discussed above.
- the frame extractor 610 received input sensor data 625 , such as a video, and segments the data into time frames.
- the feature extractor 615 receives the segmented data from the frame extractor 610 and extracts a feature to represent each of the segmented data.
- the feature may be an embedded optical flow feature.
- the CLDS model decoder 620 receives the features, which acts as observations for the CLDS model.
- the trained CLDS model uses these features to estimate action labels, including transition labels.
- the CLDS model may use one or more of the inference methods discussed above or known to those of ordinary skill in the art.
- FIG. 7 illustrates a flowchart of the Embedded Optical Flow feature and the Continuous Linear Dynamic System for simultaneously action primitive segmentation and recognition from continuous video according to embodiments of the present invention.
- a CLDS model 705 comprises an action dynamic system 715 and transition dynamic system 720 , which are used to help segment and recognize actions.
- the CLDS model 705 dynamically recognizes actions, including transitions, by dynamically switching between action LDS modeling (e.g., actions 730 and 740 ) and transition LDS modeling (e.g., 735 and 745 ).
- LDCF Latent Dynamic Conditional Random Field
- SLDS Switching Linear Dynamic System
- FIG. 8( a ) graphically depicts results of continuous action recognition and segmentation for Set I of the in-house datasets according to embodiments of the present invention.
- Set II captured the actions of moving/grasping/placing three objects and screwing a fourth object with a screw driver. These actions can be grouped into three primitives, “move-arm,” “move-obj,” and “screw.” Set II is more challenging than Set I in that the location of the object varies after manipulation and the same primitive action may repeat while the boundaries need to be identified. There were 310 actions performed in 6 sequences, and leave-one-out cross-validation was used. The performance was tested with/without the change of object location obtained by color-based detection ( ⁇ (x,y) in Table 3).
- FIG. 8( b ) graphically depicts example results for Set II according to embodiments of the present invention.
- CLDS instead uses additional models (the Transition LDS) to describe the transitions between successive actions. Hence, it distinguishes the dynamics before, during, and after action primitives better than SLDS. At the same time, CLDS does not have the duration constraint during inference; hence, it is more flexible than dSLDS for action types with inaccurate duration model.
- Such capacity makes CLDS a suitable framework for many continuous time series data analysis problems, such as robotic PbD, sign language recognition, gesture recognition, etc.
- the system includes a central processing unit (CPU) 1001 that provides computing resources and controls the computer.
- the CPU 1001 may be implemented with a microprocessor or the like, and may also include a graphics processor and/or a floating point coprocessor for mathematical computations.
- the system 1000 may also include system memory 1002 , which may be in the form of random-access memory (RAM) and read-only memory (ROM).
- An input controller 1003 represents an interface to various input device(s) 1004 , such as a keyboard, mouse, or stylus.
- a scanner controller 1005 which communicates with a scanner 1006 .
- the system 1000 may also include a storage controller 1007 for interfacing with one or more storage devices 1008 each of which includes a storage medium such as magnetic tape or disk, or an optical medium that might be used to record programs of instructions for operating systems, utilities and applications which may include embodiments of programs that implement various aspects of the present invention.
- Storage device(s) 1008 may also be used to store processed data or data to be processed in accordance with the invention.
- the system 1000 may also include a display controller 1009 for providing an interface to a display device 1011 , which may be a cathode ray tube (CRT), a thin film transistor (TFT) display, or other type of display.
- the system 1000 may also include a printer controller 1012 for communicating with a printer 1013 .
- a communications controller 1014 may interface with one or more communication devices 1015 , which enables the system 1000 to connect to remote devices through any of a variety of networks including the Internet, a local area network (LAN), a wide area network (WAN), or through any suitable electromagnetic carrier signals including infrared signals.
- bus 1016 which may represent more than one physical bus.
- various system components may or may not be in physical proximity to one another.
- input data and/or output data may be remotely transmitted from one physical location to another.
- programs that implement various aspects of this invention may be accessed from a remote location (e.g., a server) over a network.
- Such data and/or programs may be conveyed through any of a variety of machine-readable medium including magnetic tape or disk or optical disc, or a transmitter, receiver pair.
- Embodiments of the present invention may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed.
- the one or more non-transitory computer-readable media shall include volatile and non-volatile memory.
- alternative implementations are possible, including a hardware implementation or a software/hardware implementation.
- Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations.
- the term “computer-readable medium or media” as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Psychiatry (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Description
may be used to generate the feature, where y represents the image feature for the image frame, C represents a matrix of spare coding vectors for the image frame, Po (X) represents a matrix of posterior values for local motion features. In embodiments, the pooled sparse coding vectors are also normalized to form the image feature.
x t+1 =Fx t +v t+1, (1)
y t+1 =Hx t+1 +w t. (2)
P(s t+1 |y t+1 ,x t+1 ,x t ,s t)∝P(y t+1 |x t+1 ,s t+1)P(x t+1 |x t ,s t+1)Pr(s t+1 |s t) (5)
x t+1 =F(s t+1)x t +v t+1), (6)
y t+1 =H(s t+1)x t+1 +w t+1(s t+1), (7)
x t+1 =F(s t+1)x t +F(s t)x t−1 +v t+1 s t+1.
x t+1 =F(s t+1 ,s t)x t +v t+1(s t+1 ,s t), (8)
y t+1 =H(s t+1 ,s t)x t+1 +w t+1(s t+1 ,s t), (9)
P(S T |Y T ,X T)
in which the first term P(st+1|yt+1, xt+1, xt, st)∝P(yt+1, st+1, st)P(xt+1|xt, st+1, st)P(st+1|st). P(yt+1|xt+1, st+1, st) and P(xt+1|xt, st+1, st) may be obtained using Equation (8) and Equation (9) respectively.
| TABLE 1 |
| Embodiment of CLDS Inference |
| input: Action LDS θi, Transition LDS θij, i, j = 1 ~ K types, observation |
| Y |
| output: Label sequence s, Boundary sequence d |
| 1: ← K × (T + 1) matrix, ← K × T matrix, ← K × T matrix, |
| X ← K × (T + 1) cell, ← K × (T + 1) cell |
| 2: init (i, 0), X(i, 0) and <(i, 0) with θi, i = 1 ~ C |
| {Transition probability} |
| 3: for t = 0 to T − 1 do |
| 4: for i = 1 to K do |
| 5: Predict and filter LDS state estimation xs t+1|t,i and Σs t+1|t,i |
| with X(i, t), (i, t) and θi |
| 6: Ps ← P(st+1 = i|yt+1, xs t+1|t,i, st = i) (i, t) |
| 7: for j = 1 to K do |
| 8: Predict and filter LDS state estimation xj t+1|t,i,j and |
| Σj t+1|t,i,j with X(j, t), (j, t) and θij |
| 9: Pb(j) ← P(st+1 = i|yt+1, xj t+1|t,i,j, st = j) (j, t) |
| 10: end for |
| 11: j* ← argmaxj Pb(j) |
| 12: if Pb(j*) < Ps then |
| 13: (i, t + 1) ← Ps, (i, t + 1) ← i, (i, t + 1) ← 0, |
| X(i, t + 1) ← xs t+1|t,i, (i, t + 1) ← Σs t+1|t,i |
| 14: else |
| 15: (i, t + 1) ← Pb(j*), (i, t + 1) ← j*, (i, t + 1) ← 1, |
| X(i, t + 1) ← xj* t+1|t,i,j, (i, t + 1) ← Σj* t+1|t,i,j |
| 16: end if |
| 17: end for |
| 18: end for |
| {Trace back} |
| 19: i* ← argmaxi (i, T) |
| 20: for t = T to 1 do |
| 21: s(t) ← i*, d(t) ← (i*, t) |
| 22: i* ← (i*, t) |
| 23: end for |
The generative model from bj to xi is initially assumed to be a Gaussian, i.e.,
is the EOF feature from image I, where
In embodiments, the
| TABLE 2 |
| Continuous recognition accuracy using public dataset |
| method | Idiap | IXMAS | ||
| CRF [Ref. 4] | 69.32% | 60.58% | ||
| LDCRF [Ref. 5] | 87.91% | 57.82% | ||
| SLDS [Ref. 6] | 71.07% | 53.61% | ||
| dSLDS [Ref. 7] | 80.71% | 73.42% | ||
| CLDS | 95.46% | 78.78% | ||
| TABLE 3 |
| Continuous recognition accuracy using in-house set |
| Set I | Set II |
| method | average | average w Δ(x, y) | wo Δ(x, y) | ||
| CRF [Ref. 4] | 86.75% | 63.26% | 63.26% | ||
| LDCRF [Ref. 5] | 89.20% | 72.16% | 71.09% | ||
| SLDS [Ref. 6] | 85.10% | 73.58% | 66.47% | ||
| dSLDS [Ref. 7] | 87.38% | 82.82% | 76.18% | ||
| CLDS | 91.09% | 91.35% | 79.38% | ||
Claims (15)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/406,011 US8917907B2 (en) | 2011-02-28 | 2012-02-27 | Continuous linear dynamic systems |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201161447502P | 2011-02-28 | 2011-02-28 | |
| US13/406,011 US8917907B2 (en) | 2011-02-28 | 2012-02-27 | Continuous linear dynamic systems |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20120219186A1 US20120219186A1 (en) | 2012-08-30 |
| US8917907B2 true US8917907B2 (en) | 2014-12-23 |
Family
ID=46719019
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US13/406,011 Expired - Fee Related US8917907B2 (en) | 2011-02-28 | 2012-02-27 | Continuous linear dynamic systems |
| US13/405,986 Expired - Fee Related US8774499B2 (en) | 2011-02-28 | 2012-02-27 | Embedded optical flow features |
Family Applications After (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US13/405,986 Expired - Fee Related US8774499B2 (en) | 2011-02-28 | 2012-02-27 | Embedded optical flow features |
Country Status (1)
| Country | Link |
|---|---|
| US (2) | US8917907B2 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021098362A1 (en) * | 2019-11-19 | 2021-05-27 | 腾讯科技(深圳)有限公司 | Video classification model construction method and apparatus, video classification method and apparatus, and device and medium |
Families Citing this family (40)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2484133B (en) * | 2010-09-30 | 2013-08-14 | Toshiba Res Europ Ltd | A video analysis method and system |
| CN104159114A (en) * | 2013-05-13 | 2014-11-19 | 哈尔滨点石仿真科技有限公司 | Method for calculating optical flow at high moving speed among image frames |
| CN104881640B (en) * | 2015-05-15 | 2018-06-15 | 华为技术有限公司 | A kind of method and device for obtaining vector |
| CN105224944B (en) * | 2015-09-08 | 2018-10-30 | 西安交通大学 | Image characteristic extracting method based on the sparse non-negative sparse coding of code book block |
| CN105654092B (en) | 2015-11-25 | 2019-08-30 | 小米科技有限责任公司 | Feature extracting method and device |
| US11210299B2 (en) | 2015-12-01 | 2021-12-28 | Amer Sports Digital Services Oy | Apparatus and method for presenting thematic maps |
| US11215457B2 (en) | 2015-12-01 | 2022-01-04 | Amer Sports Digital Services Oy | Thematic map based route optimization |
| US11137820B2 (en) | 2015-12-01 | 2021-10-05 | Amer Sports Digital Services Oy | Apparatus and method for presenting thematic maps |
| US11144107B2 (en) | 2015-12-01 | 2021-10-12 | Amer Sports Digital Services Oy | Apparatus and method for presenting thematic maps |
| US11284807B2 (en) | 2015-12-21 | 2022-03-29 | Amer Sports Digital Services Oy | Engaging exercising devices with a mobile device |
| US11587484B2 (en) | 2015-12-21 | 2023-02-21 | Suunto Oy | Method for controlling a display |
| US11541280B2 (en) | 2015-12-21 | 2023-01-03 | Suunto Oy | Apparatus and exercising device |
| GB2545668B (en) | 2015-12-21 | 2020-05-20 | Suunto Oy | Sensor based context management |
| US11838990B2 (en) | 2015-12-21 | 2023-12-05 | Suunto Oy | Communicating sensor data in wireless communication systems |
| US10254845B2 (en) * | 2016-01-05 | 2019-04-09 | Intel Corporation | Hand gesture recognition for cursor control |
| CN105787428A (en) * | 2016-01-08 | 2016-07-20 | 上海交通大学 | Method for lip feature-based identity authentication based on sparse coding |
| CN106027978B (en) * | 2016-06-21 | 2019-02-05 | 南京工业大学 | Video monitoring abnormal behavior method for smart home old-age care |
| CN106326838A (en) * | 2016-08-09 | 2017-01-11 | 惠州学院 | Behavior recognition system based on linear dynamic system |
| CN106650597B (en) * | 2016-10-11 | 2019-09-03 | 汉王科技股份有限公司 | A kind of biopsy method and device |
| DE102017009171B4 (en) | 2016-10-17 | 2025-05-22 | Suunto Oy | Embedded computing device |
| US11703938B2 (en) | 2016-10-17 | 2023-07-18 | Suunto Oy | Embedded computing device |
| US10445582B2 (en) * | 2016-12-20 | 2019-10-15 | Canon Kabushiki Kaisha | Tree structured CRF with unary potential function using action unit features of other segments as context feature |
| US10354129B2 (en) * | 2017-01-03 | 2019-07-16 | Intel Corporation | Hand gesture recognition for virtual reality and augmented reality devices |
| CN106952287A (en) * | 2017-03-27 | 2017-07-14 | 成都航空职业技术学院 | A kind of video multi-target dividing method expressed based on low-rank sparse |
| US10467768B2 (en) * | 2017-04-07 | 2019-11-05 | Intel Corporation | Optical flow estimation using 4-dimensional cost volume processing |
| CN107067037B (en) * | 2017-04-21 | 2020-08-04 | 河南科技大学 | Method for positioning image foreground by using LL C criterion |
| CN107220616B (en) * | 2017-05-25 | 2021-01-19 | 北京大学 | Adaptive weight-based double-path collaborative learning video classification method |
| CN108551412B (en) * | 2018-05-03 | 2021-05-25 | 网宿科技股份有限公司 | Monitoring data noise reduction processing method and device |
| CN109902547B (en) * | 2018-05-29 | 2020-04-28 | 华为技术有限公司 | Action recognition method and device |
| CN108881899B (en) * | 2018-07-09 | 2020-03-10 | 深圳地平线机器人科技有限公司 | Image prediction method and device based on optical flow field pyramid and electronic equipment |
| CN109344692B (en) * | 2018-08-10 | 2020-10-30 | 华侨大学 | Motion quality evaluation method and system |
| CN109195011B (en) * | 2018-10-25 | 2022-01-25 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and storage medium |
| CN109583335B (en) * | 2018-11-16 | 2023-04-07 | 中山大学 | Video human behavior recognition method based on temporal-spatial information fusion |
| CN109376696B (en) * | 2018-11-28 | 2020-10-23 | 北京达佳互联信息技术有限公司 | Video motion classification method and device, computer equipment and storage medium |
| TWI729596B (en) * | 2018-12-21 | 2021-06-01 | 芬蘭商亞瑪芬體育數字服務公司 | Sensor data management |
| CN112733577A (en) * | 2019-10-28 | 2021-04-30 | 富士通株式会社 | Method and device for detecting hand motion |
| CN111950393B (en) * | 2020-07-24 | 2021-05-04 | 杭州电子科技大学 | A time-series action segment segmentation method based on boundary search agent |
| CN112784812B (en) * | 2021-02-08 | 2022-09-23 | 安徽工程大学 | A squat action recognition method |
| CN113705329B (en) * | 2021-07-07 | 2025-06-13 | 浙江大华技术股份有限公司 | Re-identification method, training method of target re-identification network and related equipment |
| CN115797818B (en) * | 2021-09-08 | 2025-12-30 | 香港大学 | Video Timing Action Nomination Generation Method and System |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6993462B1 (en) | 1999-09-16 | 2006-01-31 | Hewlett-Packard Development Company, L.P. | Method for motion synthesis and interpolation using switching linear dynamic system models |
| US6999601B2 (en) | 1999-09-16 | 2006-02-14 | Hewlett-Packard Development Company, Lp | Method for visual tracking using switching linear dynamic systems models |
| US20110116711A1 (en) | 2009-11-18 | 2011-05-19 | Nec Laboratories America, Inc. | Locality-constrained linear coding systems and methods for image classification |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8396286B1 (en) * | 2009-06-25 | 2013-03-12 | Google Inc. | Learning concepts for video annotation |
-
2012
- 2012-02-27 US US13/406,011 patent/US8917907B2/en not_active Expired - Fee Related
- 2012-02-27 US US13/405,986 patent/US8774499B2/en not_active Expired - Fee Related
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6993462B1 (en) | 1999-09-16 | 2006-01-31 | Hewlett-Packard Development Company, L.P. | Method for motion synthesis and interpolation using switching linear dynamic system models |
| US6999601B2 (en) | 1999-09-16 | 2006-02-14 | Hewlett-Packard Development Company, Lp | Method for visual tracking using switching linear dynamic systems models |
| US20110116711A1 (en) | 2009-11-18 | 2011-05-19 | Nec Laboratories America, Inc. | Locality-constrained linear coding systems and methods for image classification |
Non-Patent Citations (11)
| Title |
|---|
| Fan, X, et al., "Generative Models for Maneuvering Target Tracking", Accepted by IEEE Transactions on Aerospace and Electronic Systems, Dec. 2008. |
| Fox et al., "An HDP-HMM for Systems with State Persistence", Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. * |
| Huang et al., "Variable Duration Motion Texture for Human Motion Modeling" PRICAI 2006, LNAI 4099, pp. 603-612, 2006. * |
| Mesot et al., "A Simple Alternative Derivation of the Expectation Correction Algorithm", IEEE Signal Processing Letters, vol. 16, No. 2, Feb. 2009, 121-124. * |
| Mikolajczyk, K., et al., "Action Recognition with Motion-Appearance Vocabulary Forest" Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference, pp. 1-8, Jun. 23-28, 2008. |
| Mikolajczyk, K., et al., "Scale & Affine Invariant Interest Point Detectors", International Journal of Computer Vision 60(1), 63-86, 2004. |
| Natarajan, P., et al., "Graphical Framework for Action Recognition using Temporally Dense STIPs", Motion and Video Computing 2009, WMVC '09 workshop, Dec. 8-9, 2009. |
| Oh et al., "Learning and Inference in Parametric Switching Linear Dynamic Systems", Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05). * |
| Schuldt, C., et al., "Recognizing Human Actions: A Local SVM Approach", Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference, pp. 32-36, vol. 3, Aug. 23-26, 2004. |
| Wang, J., et al., "Locality-constrained Linear Coding for Image Classification", Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference, pp. 3360-3367, Jun. 13-18, 2010. |
| Wang, X., et al., "Feature Context for Image Classification and Object Detection", Computer Vision and Pattern Recognition, (CVPR), 2011 IEEE Conference, pp. 961-968, Jun. 20-25, 2011. |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021098362A1 (en) * | 2019-11-19 | 2021-05-27 | 腾讯科技(深圳)有限公司 | Video classification model construction method and apparatus, video classification method and apparatus, and device and medium |
| US11967152B2 (en) | 2019-11-19 | 2024-04-23 | Tencent Technology (Shenzhen) Company Limited | Video classification model construction method and apparatus, video classification method and apparatus, device, and medium |
Also Published As
| Publication number | Publication date |
|---|---|
| US20120219186A1 (en) | 2012-08-30 |
| US8774499B2 (en) | 2014-07-08 |
| US20120219213A1 (en) | 2012-08-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8917907B2 (en) | Continuous linear dynamic systems | |
| US11544964B2 (en) | Vision based target tracking that distinguishes facial feature targets | |
| US11270124B1 (en) | Temporal bottleneck attention architecture for video action recognition | |
| CN108229280B (en) | Time domain motion detection method and system, electronic device, computer storage medium | |
| EP3447727B1 (en) | A method, an apparatus and a computer program product for object detection | |
| US8345984B2 (en) | 3D convolutional neural networks for automatic human action recognition | |
| US8433101B2 (en) | System and method for waving detection based on object trajectory | |
| CN113326835B (en) | Action detection method and device, terminal equipment and storage medium | |
| US11574500B2 (en) | Real-time facial landmark detection | |
| Minhas et al. | Incremental learning in human action recognition based on snippets | |
| CN118609067A (en) | Training a neural network for vehicle re-identification | |
| Lo Presti et al. | Gesture modeling by hanklet-based hidden markov model | |
| CN102799900B (en) | Target tracking method based on supporting online clustering in detection | |
| US7369682B2 (en) | Adaptive discriminative generative model and application to visual tracking | |
| JP2015167017A (en) | Self-learning object detectors for unlabeled videos using multi-task learning | |
| CN105022982A (en) | Hand motion identifying method and apparatus | |
| US11410449B2 (en) | Human parsing techniques utilizing neural network architectures | |
| US10062013B2 (en) | Method of image processing | |
| CN102938070A (en) | Behavior recognition method based on action subspace and weight behavior recognition model | |
| US20240303848A1 (en) | Electronic device and method for determining human height using neural networks | |
| CN116957051A (en) | Remote sensing image weak supervision target detection method for optimizing feature extraction | |
| Zarbakhsh et al. | Low-rank sparse coding and region of interest pooling for dynamic 3D facial expression recognition | |
| US8405531B2 (en) | Method for determining compressed state sequences | |
| Lin et al. | Human centric visual analysis with deep learning | |
| Dong et al. | Foreground detection with simultaneous dictionary learning and historical pixel maintenance |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: EPSON RESEARCH AND DEVELOPMENT, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WANG, JINJUN;XIAO, JING;REEL/FRAME:027769/0184 Effective date: 20120227 |
|
| AS | Assignment |
Owner name: SEIKO EPSON CORPORATION, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:EPSON RESEARCH AND DEVELOPMENT, INC.;REEL/FRAME:028136/0995 Effective date: 20120229 |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551) Year of fee payment: 4 |
|
| FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
| FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20221223 |