CN112861758A - Behavior identification method based on weak supervised learning video segmentation - Google Patents

Behavior identification method based on weak supervised learning video segmentation Download PDF

Info

Publication number
CN112861758A
CN112861758A CN202110207458.4A CN202110207458A CN112861758A CN 112861758 A CN112861758 A CN 112861758A CN 202110207458 A CN202110207458 A CN 202110207458A CN 112861758 A CN112861758 A CN 112861758A
Authority
CN
China
Prior art keywords
video
frame
length
segment
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110207458.4A
Other languages
Chinese (zh)
Other versions
CN112861758B (en
Inventor
李策
盛龙帅
姜中博
李欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology Beijing CUMTB
Original Assignee
China University of Mining and Technology Beijing CUMTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology Beijing CUMTB filed Critical China University of Mining and Technology Beijing CUMTB
Priority to CN202110207458.4A priority Critical patent/CN112861758B/en
Publication of CN112861758A publication Critical patent/CN112861758A/en
Priority to NL2029182A priority patent/NL2029182B1/en
Application granted granted Critical
Publication of CN112861758B publication Critical patent/CN112861758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a behavior identification method based on weak supervised learning video segmentation, wherein the method comprises the following steps: dividing the whole video into N sections with unknown quantity, distributing class labels and length labels for each section, and generating frame labels for calculating frame-by-frame cross entropy loss by using a Viterbi algorithm for video sections; finding the best action segmentation point in the initial video segmentation obtained by the Viterbi algorithm, and decomposing the initial video segmentation to obtain a visual model, a length model and a context model; connecting input data sequences in forward propagation by using a single-layer GRU network with 256 circulating gate units and softmax output to obtain a posterior probability and a length model; defining an auxiliary function and finding out an optimal segmentation point; finally, the maximum possible segmentation of the complete video is obtained by the length model and the auxiliary function. And the weak surveillance video is fully utilized to segment the actions in the complete video.

Description

Behavior identification method based on weak supervised learning video segmentation
Technical Field
The invention relates to the field of video behavior identification, in particular to a behavior identification method based on weak supervised learning video segmentation.
Background
In recent years, the generation of a large amount of video data attracts the research on the identification of video behaviors. The overall development trend in the field of behavior recognition is also converted from static scenes to dynamic scenes, the detection and recognition of single-motion targets are converted into the detection and analysis of multi-motion targets, and the transition from individual simple behaviors to complex actions is even converted into group behavior and action recognition and detection. The Breakfast, Salad and other video behavior data sets are discussed continuously on the computer vision top-level conference paper, and the classification and time segmentation of activities in the data sets have become popular contents of video behavior recognition research.
Video behavior recognition mainly depends on two labeling modes of a data set, namely full supervision and weak supervision. The full supervision consumes a large amount of labor cost to delimit and mark action frames and classes in the video, the weak supervision only provides a sequence of action classes in a section of the video, but does not provide a specific starting frame and ending frame of each action, and temporal action segmentation and marking are learned from action sets formed by action labels. The traditional video behavior recognition algorithms include dynamic time warping, CDP algorithm, HMM and Viterbi algorithm, and the basic deep learning video behavior recognition methods include a dual-stream method, LSTM, GRU, C3D and I3D, etc., which have good effects in detecting motion categories in videos but are not efficient for weak surveillance video segmentation.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a behavior recognition method based on weak supervised learning video segmentation, which utilizes a frame label generated by a Viterbi algorithm to calculate the cross entropy loss L frame by frame, uses the random gradient descent of the gradient Delta L of the cross entropy loss to update network parameters, uses a single-layer GRU network with 256 cycle gate units and softmax output to connect an input data sequence in forward propagation, and calculates the posterior probability to obtain the maximum segmentation number of the complete video and accurately segment different actions in a section of video.
The technical scheme adopted by the invention is as follows:
the method comprises the following steps that (1) the whole video is divided into N sections, a class label c and a length label L are distributed to each section, and a frame label is generated for the video sections obtained through division by using a Viterbi algorithm and is used for calculating the cross entropy loss L frame by frame; based on the cross entropy loss L of all video frames, updating GRU network parameters by using the random gradient descent of the gradient delta L;
step (2), obtaining initial video segmentation by Viterbi algorithm in step (1)
Figure BDA0002949823010000011
Finding the optimal action division point, wherein i is the video segment number, i is 1
Figure BDA0002949823010000012
Decomposing to obtain a visual model, a length model and a context model;
step (3), connecting input data sequences in forward propagation by using a single-layer GRU network with 256 cyclic gate units and softmax output to obtain a visual model in the step (2) by dividing posterior probability by class probability, and obtaining a length model in the step (2) by using class-based Poisson distribution;
step (4), defining an auxiliary function and finding out an optimal segmentation point;
and (5) obtaining the maximum segmentation number of the complete video by the length model obtained in the step (3) and the auxiliary function defined in the step (4).
The method has the advantages that the method is different from a common video behavior identification method for identifying the action category in the video, only the action category in the video is subjected to weak supervision labeling, the action identification method based on weak supervision learning video segmentation is used for segmenting the action in the complete video, and the action in the video can be accurately identified.
Drawings
The invention is further illustrated with reference to the following figures and examples.
FIG. 1 is a flow chart of a graph convolution neural network-based skeletal data behavior identification method according to an embodiment of the present invention;
FIG. 2 is a video of frames 81, 171, 366, 480 and 703 of a tea brewing action in accordance with an embodiment of the present invention;
fig. 3 is an overall network structure according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described with reference to the drawings are illustrative and are intended to be illustrative of the invention and should not be construed as limiting the invention.
First, a data set used by a behavior recognition method based on weak supervised learning video segmentation is introduced. The Breakfast dataset is a large-scale dataset for motion segmentation, comprising approximately 1712 videos for Breakfast, equivalent to about 360 ten thousand frames of 67 hours of video. Generally containing 10 behavioral actions related to breakfast activities, such as pancakes or omelets, each with detailed annotations like stirring or pouring. There are 48 action classes for this data set, with an average of 6.9 action instances per video. The 48 action classes have finer labels to label the beginning and ending frames of each action in the video in text form, typically for action detection and segmentation. The video length varies from a few seconds to a few minutes, and actions in the video are almost all more densely labeled, with only 7% of the frames belonging to background frames. Fig. 2 is an example of frames in specific actions during a tea brewing process in the data set, with weakly supervised learning only providing class labels and no specific frame delimitation.
Fig. 2 is a flow chart according to one embodiment of the invention.
Setting X as X for video containing T frames1,..,xt,...,xTDividing the whole video into N segments, assigning class labels to each segment, and outputting video segment class labels cN={c1,...ci,...,cNAnd length label of video segment li∈{l1,...,lNIn which c isiIs the category of the ith video, liFor the length of the ith video, i belongs to { 1., N }; will be allocated to frame xtIs defined as cn(t)Where n (T) is the segment number of the tth frame, and T e { 1., T }. Inferring most likely segmentations in video
Figure BDA0002949823010000021
Wherein
Figure BDA0002949823010000022
Can be calculated by the following formula:
Figure BDA0002949823010000023
wherein, p (c)i,li| X) represents the probability of the motion category and the motion length of the ith video in video X,
Figure BDA0002949823010000024
represents a category of the predicted ith segment of video,
Figure BDA0002949823010000025
representing the length of the predicted ith video segment, and obtaining the category label of the video segment by formula (1)
Figure BDA0002949823010000026
And length
Figure BDA0002949823010000027
Sequence of video frames X and class label c thereofNForwarding through a neural network, cNProvided as ground truth class labels, only the length label of each video segment needs to be inferred during the training process. Will be allocated to frame xtClass label of is called cn(t)Labeling motion class and length in video using Viterbi algorithm
Figure BDA0002949823010000031
Class label c written frame by framen(1),...,cn(t),...,cn(T)Wherein c isn(t)∈{c1,...ci,...,cNAnd calculating the cross entropy loss of all video frames:
Figure BDA0002949823010000032
wherein, p (c)n(t)|xt) Representing video frame xtCorresponding action class cn(t)Probability of (c), -logp (c)n(t)|xt) Representing a frame xtCross entropy loss of (2). The GRU network parameters are updated with a random gradient descent of their gradient Δ L based on the cross entropy loss L of all video frames for updating equation (6).
A buffer is used to store a recently processed sequence of video frames and their inferred frame tags, and K frames of the buffer are sampled and added to the loss function.
Figure BDA0002949823010000033
Wherein x iskK frame representing a sequence of buffered video frames, K being the total number of buffered video frames, ckRepresenting a frame xkA corresponding category label.
FIG. 1 is an overall system architecture according to the present invention, video x with T frames1,...,xTIs input into the GRU network, the GRU network is connected to the input data sequence for forward propagation and then viterbi decoding, the frame-by-frame class labels generated by the viterbi algorithm being used to calculate the frame-by-frame cross entropy loss.
Decomposing the function according to equation (1):
Figure BDA0002949823010000034
assuming that the video frames are independent of each other, argmax in equation (4) can be converted to equation (5), as follows:
Figure BDA0002949823010000035
n (t) is the number of the division frame t, and defines p (x)t|cn(t)) Indicates a category label of cn(t)In case of a video frame xtProbability of p (x)t|cn(t)) As a visual model, p (l)n|cn) Represents the n-th video motion type as cnIn the case of action length lnProbability of p (l)n|cn) As a length model, p (c)n|cn-1) Representing a category of motion c in a videon-1C is the action category of the video segment following the video segmentnProbability of (c), p (c)n|cn-1) Is a context model.
Connecting an incoming sequence of T frame-containing video frames X ═ X in forward propagation using a single layer GRU network with 256 cyclic gate units and softmax outputs1,..,xt,...,xT}。p(c|xt) Is the t-th frame video xtThe softmax score of the GRU network of action category of (1), then the visual model p (x)tC) can be represented by a posterior probability p (c | x)t) Divided by p (c), can be represented as follows:
Figure BDA0002949823010000041
where p (c) is the prior distribution, p (c) is the normalized frame frequency of the action in the training set, and during the data training process, the number of frames marked by the class label c used by all the video frame sequences is calculated, and then the normalization is performed to obtain the estimated value of p (c). If a certain class label sequence cNIncluding classes that have never been seen, then classes that have not been seen are used
Figure BDA0002949823010000042
And (4) showing. Where # classes represents the total number of categories.
The length model is implemented using class-dependent poisson distribution:
Figure BDA0002949823010000043
λcthe average length of the action class c is represented,
Figure BDA0002949823010000044
denotes λcTo the power of l, λ is updated for each iterationcL! Is a factorial of l; training sample (X, c)N) Definition of lambda when including classes never seencN/T, where N denotes the number of video segments and T denotes the number of video X frames.
Defining a helper function Q (t, l, c, g), where t denotes the video frame number, l denotes the length of the last segment, c denotes the class label of the last segment, and g denotes the context of the random syntax of the non-terminator, such that the best split point between actions in the video can be found by equation (5), the helper function generating the best probability score for the video segment before t frames that satisfies the following condition, assuming no new segment when l > 1:
Q(t,l,c,g)=Q(t-1,l-1,c,g)·p(xt|c) (8)
assume a new video segment at the tth frame when l is 1:
Figure BDA0002949823010000045
Figure BDA0002949823010000046
a context of a random syntax representing possible non-terminators,
Figure BDA0002949823010000047
the possible class labels are represented by a list of possible classes,
Figure BDA0002949823010000048
it is shown that the possible lengths are,
Figure BDA0002949823010000049
indicates the restriction condition is composed of
Figure BDA00029498230100000410
The context g of the random syntax of the class label c and the non-terminator can be obtained, and simultaneously the existence g' is satisfied, the possible class can be obtained by g
Figure BDA00029498230100000411
And possibly the context of a random syntax of a non-terminator
Figure BDA00029498230100000412
Figure BDA00029498230100000413
Representing the context of a random syntax in a possible non-terminator
Figure BDA00029498230100000414
The category label in the case is c. At all possible lengths
Figure BDA00029498230100000415
And all of
Figure BDA00029498230100000416
Go on maximize operation, let go through assume class c from
Figure BDA00029498230100000417
Transition to g.
From equation (8) and equation (9), the maximum possible segmentation N of the complete video for l > 1 and l ═ 1 can be obtained as:
Figure BDA00029498230100000418
by tracking the maximum parameter of equation (9)
Figure BDA0002949823010000051
And
Figure BDA0002949823010000052
can obtain the best class label
Figure BDA0002949823010000053
And length
Figure BDA0002949823010000054
The results on the Breakfast dataset show that the final motion segmentation frame accuracy in the weakly supervised case is 41.5%.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (2)

1. A behavior identification method based on weak supervised learning video segmentation is characterized by comprising the following steps:
step (1), initially dividing the whole video into N sections, and distributing class labels c to each sectionN={c1,...ci,...,cNAnd length label li∈{l1,...,lNIn which c isiIs the category of the ith video, liGenerating a frame label for the length of the ith video by using a Viterbi algorithm for i ∈ { 1., N }, and using a frame-by-frame cross entropy loss L for calculating the cross entropy loss L of each frame according to the video frequency band, and updating GRU network parameters by using the random gradient descent of the gradient delta L of the cross entropy loss L based on all the video frames;
step (2), obtaining video segmentation by Viterbi algorithm in step (1)
Figure FDA0002949823000000011
Find the best action division point, pair
Figure FDA0002949823000000012
Decomposing to obtain a visual model, a length model and a context model, wherein,
Figure FDA0002949823000000013
represents a category of the predicted ith segment of video,
Figure FDA0002949823000000014
indication deviceMeasuring the length of the ith video;
step (3) connecting the input video frame data sequence in forward propagation by using a single-layer GRU network with 256 cyclic gate units and softmax output to obtain a visual model and a length model p (l) relative to the input video frame data sequencen|cn) Wherein c isnIs a class label of the nth video segment, lnIs the length of the nth video, p (l)n|cn) Indicates the action type as cnIn the case of action length lnThe probability of (d);
step (4), defining an auxiliary function, and finding the optimal division point among video actions;
and (5) obtaining the maximum possible segmentation number of the complete video by the length model in the step (3) and the auxiliary function in the step (4).
2. The method of claim 1, wherein the behavior recognition method based on weak supervised learning video segmentation is characterized in that
The step (1) specifically comprises:
setting X as X for video containing T frames1,..,xt,...,xTDividing the whole video into N segments, assigning class labels to each segment, and outputting video segment class labels cN={c1,...ci,...,cNAnd length label of video segment li∈{l1,...,lNIn which c isiIs the category of the ith video, liFor the length of the ith video, i belongs to { 1., N }; will be allocated to frame xtIs defined as cn(t)Where n (T) is the segment number of the tth frame, and T e { 1., T }. Inferring most likely segmentations in video
Figure FDA0002949823000000015
Wherein
Figure FDA0002949823000000016
Can be calculated by the following formula:
Figure FDA0002949823000000017
wherein, p (c)i,li| X) represents the probability of the motion category and the motion length of the ith video in video X,
Figure FDA0002949823000000018
represents a category of the predicted ith segment of video,
Figure FDA0002949823000000019
indicating the length of the predicted i-th segment of video.
For video frame sequence X and its class label cNForwarding through a neural network, cNThe label is provided as a ground truth class label, so that only the length label of each video segment needs to be deduced in the training process; will be allocated to frame xtClass label of is called cn(t)Using Viterbi algorithm to classify and length (c) the action in videoi,li) Class label c written frame by framen(1),...,cn(t),...,cn(T)Wherein c isn(t)∈{c1,...cl,...,cNAnd calculating the cross entropy loss of all video frames:
Figure FDA0002949823000000021
wherein, p (c)n(t)|xt) Representing video frame xtCorresponding action class cn(t)Probability of (c), -logp (c)n(t)|xt) Representing a frame xtCross entropy loss of (d); based on the cross entropy loss L of all video frames, updating GRU network parameters by using the random gradient descent of the gradient Delta L of the video frames for updating the formula (6);
storing a recently processed video frame sequence and an inferred frame label thereof by using a buffer area, sampling K frames of the buffer area, and adding the K frames into a loss function;
Figure FDA0002949823000000022
wherein x iskK frame representing a sequence of buffered video frames, K being the total number of buffered video frames, ckRepresenting a frame xkA corresponding category label;
the step (2) specifically comprises:
decomposing the function according to equation (1):
Figure FDA0002949823000000023
assuming that the video frames are independent of each other, argmax in equation (4) can be converted to equation (5), as follows:
Figure FDA0002949823000000024
n (t) is the number of the division frame t, and defines p (x)t|cn(t)) Indicates a category label of cn(t)In case of a video frame xtProbability of p (x)t|cn(t)) As a visual model, p (l)n|cn) Represents the n-th video motion type as cnIn the case of action length lnProbability of p (l)n|cn) As a length model, p (c)n|cn-1) Representing a category of motion c in a videon-1C is the action category of the video segment following the video segmentnProbability of (c), p (c)n|cn-1) Is a context model.
The step (3) specifically comprises:
connecting an incoming sequence of T frame-containing video frames X ═ X in forward propagation using a single layer GRU network with 256 cyclic gate units and softmax outputs1,..,xt,...,xT}。p(c|xt) Is the t-th frame video xtThe softmax score of the GRU network of action category of (1), then the visual model p (x)tC) can be determined by the posteriorProbability p (c | x)t) Divided by p (c), can be represented as follows:
Figure FDA0002949823000000031
wherein p (c) is prior distribution, p (c) is normalized frame frequency of action occurrence in training set, in the data training process, the number of frames marked by all video frame sequences using class label c is calculated, and then normalization is carried out to obtain an estimated value of p (c); sequence tags
Figure FDA0002949823000000032
Classes contained never seen are indicated using 1/# classes.
The length model is implemented using class-dependent poisson distribution:
Figure FDA0002949823000000033
λcthe average length of the action class c is represented,
Figure FDA0002949823000000034
denotes λcTo the power of l, λ is updated for each iterationcL! Is a factorial of l; training sample (X, c)N) Definition of lambda when including classes never seencN/T, where N denotes the number of video segments and T denotes the number of video X frames;
the step (4) specifically comprises:
defining a helper function Q (t, l, c, g), where t is the video frame number, l represents the length of the last segment, c represents the class label of the last segment, g represents the context of the random syntax of the non-terminator, finding the best split point between the actions in the video by equation (5), the helper function generating the best probability score for the segment before t frame that satisfies the following condition, assuming no new segment when l > 1:
Q(t,l,c,h)=Q(t-1,l-1,c,h)·p(xt|c) (8)
assume a new video segment at frame t when l is 1:
Figure FDA0002949823000000035
Figure FDA0002949823000000036
a context of a random syntax representing possible non-terminators,
Figure FDA0002949823000000037
the possible class labels are represented by a list of possible classes,
Figure FDA0002949823000000038
it is shown that the possible lengths are,
Figure FDA0002949823000000039
indicates the restriction condition is composed of
Figure FDA00029498230000000310
The context g of the random syntax of the class label c and the non-terminator can be obtained, and simultaneously the existence g' is satisfied, the possible class can be obtained by g
Figure FDA00029498230000000311
And possibly the context of a random syntax of a non-terminator
Figure FDA00029498230000000312
Figure FDA00029498230000000313
Representing the context of a random syntax in a possible non-terminator
Figure FDA00029498230000000314
The category label in the case is c. In all possible waysLength of (2)
Figure FDA00029498230000000315
And all of
Figure FDA00029498230000000316
Go on maximize operation, let go through assume class c from
Figure FDA00029498230000000317
Transition to g.
From equation (8) and equation (9), the maximum possible segmentation N of the complete video for l > 1 and l ═ 1 can be obtained as:
Figure FDA00029498230000000318
by tracking the maximum parameter of equation (9)
Figure FDA00029498230000000319
And
Figure FDA00029498230000000320
can obtain the best class label
Figure FDA00029498230000000321
And length
Figure FDA00029498230000000322
CN202110207458.4A 2021-02-24 2021-02-24 Behavior identification method based on weak supervised learning video segmentation Active CN112861758B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110207458.4A CN112861758B (en) 2021-02-24 2021-02-24 Behavior identification method based on weak supervised learning video segmentation
NL2029182A NL2029182B1 (en) 2021-02-24 2021-09-14 Weakly supervised learning based method for recognizing behavior through video segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110207458.4A CN112861758B (en) 2021-02-24 2021-02-24 Behavior identification method based on weak supervised learning video segmentation

Publications (2)

Publication Number Publication Date
CN112861758A true CN112861758A (en) 2021-05-28
CN112861758B CN112861758B (en) 2021-12-31

Family

ID=75991121

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110207458.4A Active CN112861758B (en) 2021-02-24 2021-02-24 Behavior identification method based on weak supervised learning video segmentation

Country Status (2)

Country Link
CN (1) CN112861758B (en)
NL (1) NL2029182B1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113813609A (en) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 Game music style classification method and device, readable medium and electronic equipment
CN114118167A (en) * 2021-12-04 2022-03-01 河南大学 Action sequence segmentation method based on self-supervision less-sample learning and aiming at behavior recognition
CN114697763A (en) * 2022-04-07 2022-07-01 脸萌有限公司 Video processing method, device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543911A (en) * 2019-08-31 2019-12-06 华南理工大学 weak supervision target segmentation method combined with classification task
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
US10824903B2 (en) * 2016-11-16 2020-11-03 Facebook, Inc. Deep multi-scale video prediction
CN111968150A (en) * 2020-08-19 2020-11-20 中国科学技术大学 Weak surveillance video target segmentation method based on full convolution neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10824903B2 (en) * 2016-11-16 2020-11-03 Facebook, Inc. Deep multi-scale video prediction
CN110543911A (en) * 2019-08-31 2019-12-06 华南理工大学 weak supervision target segmentation method combined with classification task
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
CN111968150A (en) * 2020-08-19 2020-11-20 中国科学技术大学 Weak surveillance video target segmentation method based on full convolution neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MUHAMMAD USMAN RAFIQUE;NATHAN JACOBS: "《Weakly Supervised Building Segmentation from Aerial Images》", 《IGARSS 2019 - 2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM》 *
陈华锋: "《视频人体行为识别关键技术研究》", 《中国博士学位论文全文数据库 (信息科技辑)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113813609A (en) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 Game music style classification method and device, readable medium and electronic equipment
CN113813609B (en) * 2021-06-02 2023-10-31 腾讯科技(深圳)有限公司 Game music style classification method and device, readable medium and electronic equipment
CN114118167A (en) * 2021-12-04 2022-03-01 河南大学 Action sequence segmentation method based on self-supervision less-sample learning and aiming at behavior recognition
CN114118167B (en) * 2021-12-04 2024-02-27 河南大学 Action sequence segmentation method aiming at behavior recognition and based on self-supervision less sample learning
CN114697763A (en) * 2022-04-07 2022-07-01 脸萌有限公司 Video processing method, device, electronic equipment and medium
US11699463B1 (en) 2022-04-07 2023-07-11 Lemon Inc. Video processing method, electronic device, and non-transitory computer-readable storage medium
CN114697763B (en) * 2022-04-07 2023-11-21 脸萌有限公司 Video processing method, device, electronic equipment and medium

Also Published As

Publication number Publication date
CN112861758B (en) 2021-12-31
NL2029182B1 (en) 2023-02-15
NL2029182A (en) 2022-09-19

Similar Documents

Publication Publication Date Title
CN112861758B (en) Behavior identification method based on weak supervised learning video segmentation
CN111814854B (en) Target re-identification method without supervision domain adaptation
Behrmann et al. Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation
Fan et al. Watching a small portion could be as good as watching all: Towards efficient video classification
Zhang et al. Category anchor-guided unsupervised domain adaptation for semantic segmentation
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
EP3832534A1 (en) Video action segmentation by mixed temporal domain adaptation
CN112560432B (en) Text emotion analysis method based on graph attention network
US20210326638A1 (en) Video panoptic segmentation
US20030004902A1 (en) Outlier determination rule generation device and outlier detection device, and outlier determination rule generation method and outlier detection method thereof
US20220172456A1 (en) Noise Tolerant Ensemble RCNN for Semi-Supervised Object Detection
Chen et al. Learning linear regression via single-convolutional layer for visual object tracking
WO2023109208A1 (en) Few-shot object detection method and apparatus
CN116644755B (en) Multi-task learning-based few-sample named entity recognition method, device and medium
CN110458022B (en) Autonomous learning target detection method based on domain adaptation
CN113283368B (en) Model training method, face attribute analysis method, device and medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN116363374B (en) Image semantic segmentation network continuous learning method, system, equipment and storage medium
Xiong et al. Contrastive learning for automotive mmWave radar detection points based instance segmentation
CN114821022A (en) Credible target detection method integrating subjective logic and uncertainty distribution modeling
Viet‐Uyen Ha et al. High variation removal for background subtraction in traffic surveillance systems
CN114118207B (en) Incremental learning image identification method based on network expansion and memory recall mechanism
Raychaudhuri et al. Exploiting temporal coherence for self-supervised one-shot video re-identification
Fonseca et al. Model-agnostic approaches to handling noisy labels when training sound event classifiers
Long et al. Diverse target and contribution scheduling for domain generalization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant