CN107194365B

CN107194365B - Behavior identification method and system based on middle layer characteristics

Info

Publication number: CN107194365B
Application number: CN201710416188.1A
Authority: CN
Inventors: 桑农; 张士伟; 高常鑫; 李乐仁瀚; 邵远杰; 王金; 况小琴; 何翼; 皮智雄; 宾言锐; 都文鹏; 舒娟; 吴建雄
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-06-06
Filing date: 2017-06-06
Publication date: 2020-01-03
Anticipated expiration: 2037-06-06
Also published as: CN107194365A

Abstract

The invention discloses a behavior recognition method and a behavior recognition system based on middle-layer characteristics, wherein the method is realized by the following steps: obtaining a set of candidate component detectors from the sequence of sample images; removing B% of the component detectors with weak discrimination ability in the candidate component detector set to obtain a new candidate component detector set; sorting the component detectors from big to small according to the weight of each component detector in the new candidate component detector set, and selecting P component detectors which are sorted in the front as a middle-layer feature extractor of the A-type behavior category; acquiring a middle-layer feature extractor of each behavior category in the behavior categories, combining the behavior categories into word bags, extracting sample middle-layer features of a sample image sequence by using the word bags, and training a classifier by using the sample middle-layer features to obtain a behavior recognition classifier; and inputting the test image sequence into a behavior recognition classifier to obtain the behavior category of the test image sequence. The method has the advantages of strong identification capability, high identification accuracy and strong practicability, and maintains the relevance among the components.

Description

Behavior identification method and system based on middle layer characteristics

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a behavior identification method and system based on middle-layer features.

Background

Behavior recognition technology is a core technology in the application fields of video security monitoring, man-machine interaction, video retrieval and analysis and the like, and is increasingly paid more attention by the industrial and academic circles. But because there are significant disturbances in the behavior in the video, such as motion blur, scale variation, low resolution, background noise, camera motion, and perspective variation, the behavior analysis is very challenging.

The existing method mainly comprises the following two main lines: the first is low-level spatio-temporal local features such as: spatio-temporal interest points, gradient-based features, and trajectory features. Typically, a large number of local descriptors are extracted from a video training set, then a "bag of words" is constructed, and finally a global expression of a behavior is constructed by using a coding technology such as BOW or FV. The second main line is a high-level template-based feature, a plurality of posture or view-angle behavior modes of behaviors of a specific category are selected by a manual mode or a weak supervision method and the like and combined into a behavior extractor, and finally high-level expression of the behaviors is extracted. However, both of these methods have some disadvantages. The first method, although being robust enough to vary within a class, expresses too low a level to express discriminability of motion patterns at higher levels. In contrast, the second class of methods is very good at extracting high level expression, but it has a greater sensitivity to intra-class variation. To balance between the two types of approaches, many scholars propose to express based on discriminant component behavior. Typically, the part detector is trained using parts, and the detector is used to extract mid-level features from the video.

The existing judgment component mining technology needs manual intervention, but a large amount of manpower and material resources are needed when a large amount of video samples are processed, so that the practical level is difficult to achieve; or heuristic rules defined in advance, but this approach loses the correlation between components, so that the minimum component set cannot have the maximum discrimination capability.

Therefore, the existing behavior recognition has the technical problems of weak recognition capability, large manpower and material resources, impracticality and loss of the relevance between the parts.

Disclosure of Invention

In view of the above defects or improvement needs of the prior art, the present invention provides a behavior recognition method and system based on middle-layer features, thereby solving the technical problems of weak recognition capability, large amount of manpower and material resources, impracticality, and loss of correlation between components in the existing behavior recognition.

To achieve the above object, according to an aspect of the present invention, there is provided a behavior recognition method based on a middle layer feature, including:

(1) extracting a spatiotemporal component set D of the class A behavior category and a spatiotemporal component set N of other behavior categories except the class A from the sample image sequence, and training a component detector by using the spatiotemporal component set D and the spatiotemporal component set N to obtain a candidate component detector set;

(2) combining the middle-layer features of the sample image sequence selected by each component detector in the candidate component detector set to obtain candidate feature vectors, and training a selector by using the candidate feature vectors to obtain weight vectors of the selector;

(3) measuring the discrimination capability of each part detector in the candidate part detector set by using the weight vector of the selector, and removing B% of part detectors with weak discrimination capability in the candidate part detector set to obtain a new candidate part detector set;

(4) sorting the component detectors from big to small according to the weight of each component detector in the new candidate component detector set, and selecting P component detectors which are sorted in the front as a middle-layer feature extractor of the A-type behavior category;

(5) acquiring a middle-layer feature extractor of each behavior category in the behavior categories, combining the behavior categories into word bags, extracting sample middle-layer features of a sample image sequence by using the word bags, and training a classifier by using the sample middle-layer features to obtain a behavior recognition classifier;

(6) and inputting the test image sequence into a behavior recognition classifier to obtain the behavior category of the test image sequence.

Further, the specific implementation of the training component detector is as follows: the positive samples are spatiotemporal components in a set D of spatiotemporal components, the negative samples are spatiotemporal components in a set N of spatiotemporal components, and the component detector is trained with one positive sample and a plurality of negative samples for each spatiotemporal component in the set D of spatiotemporal components.

Further, the candidate feature vector f^cComprises the following steps:

wherein d is_iIn order to be the i-th component detector,

for the 1 st component detector d₁The middle layer features extracted by using the maximum pooling quantization function in the images upsilon in the sample image sequence,

for the 2 nd component detector d₂The middle layer features extracted by using the maximum pooling quantization function in the images upsilon in the sample image sequence,

for the m-th component detector d_mThe middle-layer features are extracted by utilizing a maximum pooling quantization function in an image upsilon in a sample image sequence, i is more than or equal to 1 and less than or equal to m, and m represents a candidate component detector set D^cThe number of middle component detectors.

Further, the weight vector of the selector is:

wherein phi^c(f^c) Representing the selector, w being the weight vector of the selector, b being the bias of the selector, a loss functionC is a penalty factor, y_nClass label, x, representing the nth image in the sequence of sample images_nRepresents the middle level features of the nth image in the sample image sequence, and N represents the total number of images in the sample image sequence.

Further, the specific implementation manner of step (3) is as follows:

using the weight vector of the selector to measure the discriminative power of each component detector in the candidate set of component detectors, and using recursive removal, candidate mid-level feature matrices

When k is 1, F is initialized⁰＝F^cWhen k > 1, the k-th recursion can be represented as follows:

wherein S is^k＝[s₁，s₂…s_m]，s_iE {0,1}, indicates the component select flag bit, if s_i1, the ith component detector is selected, if s_i0, then the ith component detector is not selected, y represents the class label vector of the sample image sequence, w^kRepresents the weight vector after the k-th recursion, F^k-1Represents the candidate mid-level feature matrix after the k-1 recursion, F^kRepresenting the candidate mid-level feature matrix after the k-th recursion,

representing the vector w according to the weights after the k-th recursion^kAnd removing at a rate of τ ═ B%, for a total of H recursions, to obtain a new candidate part detector set.

According to another aspect of the present invention, there is provided a behavior recognition system based on a middle layer feature, including:

the acquisition candidate component detector set module is used for extracting a space-time component set D of the class A behavior category and a space-time component set N of other behavior categories except the class A from the sample image sequence, and training a component detector by using the space-time component set D and the space-time component set N to obtain a candidate component detector set;

the training selector module is used for combining the middle-layer features of the sample image sequence selected by each component detector in the candidate component detector set to obtain candidate feature vectors, and training the selector by using the candidate feature vectors to obtain the weight vectors of the selector;

a removed component detector module, configured to measure the discriminative power of each component detector in the candidate component detector set using the weight vector of the selector, and remove B% of the component detectors in the candidate component detector set that have weak discriminative power to obtain a new candidate component detector set;

the middle-layer feature extractor module is used for sorting the component detectors from big to small according to the weight of each component detector in the new candidate component detector set, and selecting P component detectors which are sorted in the front as the middle-layer feature extractor of the A-type behavior category;

the training classifier module is used for acquiring the middle-layer feature extractors of each behavior category in the behavior categories, combining the middle-layer feature extractors into word bags, extracting the middle-layer features of the samples of the sample image sequence by using the word bags, and training the classifier by using the middle-layer features of the samples to obtain a behavior recognition classifier;

and the behavior recognition module is used for inputting the test image sequence into the behavior recognition classifier to obtain the behavior category of the test image sequence.

Further, the candidate feature vector f^cComprises the following steps:

wherein d is_iIn order to be the i-th component detector,

Further, the weight vector of the selector is:

Further, the specific implementation manner of the removed component detector module is as follows:

wherein S is^k＝[s₁，s₂…s_m]，s_iE {0,1}, indicates the component select flag bit, if s_i1, the ith component detector is selected, if s_i0, then the ith component detector is not selected, y represents the class label vector of the sample image sequence, w^kRepresents the weight vector after the k-th recursion, F^k-1Represents the candidate mid-level feature matrix after the k-1 recursion, F^kRepresenting the candidate mid-level feature matrix after the k-th recursion,representing the vector w according to the weights after the k-th recursion^kAnd removing with a removal rate of tau-B%, and performing H recursions to obtain a new candidate part detector set。

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) the invention combines the middle-layer characteristics of the sample image sequence selected by each component detector in the candidate component detector set to obtain the candidate characteristic vector, trains the selector by using the candidate characteristic vector to obtain the weight vector of the selector, comprehensively considers the correlation among the component detectors, and can ensure that the candidate component detector set has stronger discrimination capability on the whole.

(2) The method measures the discrimination capability of each part detector in the candidate part detector set by using the weight vector of the selector, removes B% part detectors with weak discrimination capability in the candidate part detector set to obtain a new candidate part detector set, removes behavior parts without discrimination capability obviously, and selects parts in the new candidate part detector set, so that the method has stronger generalization capability.

(3) The invention obtains a candidate component detector set from a sample image sequence; removing B% of the component detectors with weak discrimination ability in the candidate component detector set to obtain a new candidate component detector set; acquiring a middle layer feature extractor in the new candidate part detector set; obtaining a word bag, extracting sample middle-layer characteristics of the sample image sequence by using the word bag, and training a classifier by using the sample middle-layer characteristics to obtain a behavior recognition classifier; can be applied to the behavior category. The method has the advantages of strong identification capability, high identification accuracy and strong practicability, and maintains the relevance among the components. The invention can excavate the minimum behavior component detector in a weak supervision mode, can well process illumination change, motion blur, camera motion and view angle change, and can more easily meet the requirements of practical application.

(4) Preferably, the complexity of the candidate component detector set is considered, the component detectors which are obviously not discriminable can be removed iteratively by adopting recursive removal, a new candidate component detector set is obtained, and component detector selection in the new candidate component detector set can enable the invention to have stronger generalization capability.

Drawings

Fig. 1 is a flowchart of a behavior recognition method based on middle-layer features according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, a behavior recognition method based on middle layer features includes:

Further, the step (1) further comprises:

(1-1) extracting a spatio-temporal component set from a sample image sequence using dense sampling and multi-scale sampling, first discarding smooth and static spatio-temporal components, and for each spatio-temporal component, extracting an optical flow Histogram (HOF) descriptor, a whitened gradient Histogram (HOG) descriptor, and a motion frame histogram (MBH) descriptor;

(1-2) extracting a spatiotemporal component set D from class A given by the behavior class, extracting a spatiotemporal component set N from other classes except class A of the behavior class, and constructing a candidate component detector set D with expression capability by using a cross-validation clustering strategy in the spatiotemporal component set D and the spatiotemporal component set N^c；

(1-3) the positive samples are spatio-temporal components in the spatio-temporal component set D, the negative samples are spatio-temporal components in the spatio-temporal component set N, for each spatio-temporal component in the spatio-temporal component set D, training the component detectors using one positive sample and a plurality of negative samples to obtain trained component detectors, all trained component detectors constituting the candidate component detector set D^c。

Preferably, the component detector is a linear discriminant analysis, LDA, or a support vector machine, SVM.

Preferably, dense sampling and multiscale sampling extract a set of spatio-temporal components from a sequence of sample images, 2ⁱ×2^jAnd i, j is {0,1, 2}, the size of the spatio-temporal component used is 80 × 80 × 20 at each scale, and the corresponding sampling interval is set to 20 × 20 × 10, so that a large number of spatio-temporal components without information are generated, and efficiency is reduced, and therefore some spatio-temporal blocks without motion information need to be removed in advance. For each spatiotemporal component p, its average optical flow strength f is first calculated_pAnd gradient size g_pThen for all f_p＞t_fAnd g_p＞t_pSpace-time part of (2), wherein t_f＝0.6×f^max，t_p＝0.7×g^max，f^maxAnd g^maxAnd for the maximum optical flow value and the maximum gradient value in the space-time component, the number of clustering centers during cross validation clustering is set to be K which is S/10, wherein S is the total number of component detectors participating in clustering each time, the component detectors with the number of centers larger than 3 are reserved each time, the cross process is detected at the same time, and the first 5 component detectors are detected to form a new center.

Further, the step (2) further comprises:

(2-1) combining the middle-layer features of the sample image sequence selected by each part detector in the candidate part detector set to obtain a candidate feature vector f of an image upsilon in the sample image sequence^cComprises the following steps:

wherein the content of the first and second substances,

for the i-th component detector d_iIn the image upsilon, the middle-layer features are extracted by utilizing a maximum pooling quantization function, i is more than or equal to 7 and less than or equal to m, and m represents a candidate component detector set D^cThe number of middle component detectors;

(2-2) training selector Φ Using candidate feature vectors^c(f^c) And obtaining the weight vector of the selector,

wherein w is the weight vector of the selector, b is the bias of the selector, the loss function

C is a penalty factor, y_nClass label, x, representing the nth image in the sequence of sample images_nRepresenting the middle layer characteristics of the nth image in the sample image sequence, wherein N represents the total number of the images in the sample image sequence;

component detector d_iWeight w of_iComprises the following steps:

denotes the i-th component detector d_iThe weight vector of the selector is composed of the weights of all component detectors in the candidate set of component detectors.

Preferably, C is 1.

Further, the specific implementation manner of step (3) is as follows:

measuring the discrimination capability of each part detector in the candidate part detector set by using the weight vector of the selector, and removing B% of part detectors with weak discrimination capability in the candidate part detector set, wherein B is 3; removing B% of the weak discriminative component detectors in the set of candidate component detectors by recursive removal

wherein S is^k＝[s₁，s₂…s_m]，s_iE {0,1}, indicates the component select flag bit, if s_i1, the ith component detector is selected, if s_i0, then the ith component detector is not selected, y represents the class label vector for all sample image sequences, w^kRepresents the weight vector after the k-th recursion, F^k-1Represents the candidate mid-level feature matrix after the k-1 recursion, F^kRepresenting the candidate mid-layer feature matrix after the k-th recursion, setting the component detector removal rate tau to 0.03, namely discarding the 3% component detector with the lowest discrimination in the candidate component detector set each time,

representing the vector w according to the weights after the k-th recursion^kAnd a new candidate part detector set D obtained by removing at a removal rate of tau and performing H recursions in total^c[S^H]。

Preferably, H ═ 3.

Preferably, P is 300.

Preferably, the classifier is an SVM classifier.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A behavior recognition method based on middle layer features is characterized by comprising the following steps:

(6) inputting the test image sequence into a behavior recognition classifier to obtain the behavior category of the test image sequence;

the candidate feature vector f^cComprises the following steps:

wherein d is_iIn order to be the i-th component detector,

for the m-th component detector d_mThe middle-layer features are extracted by utilizing a maximum pooling quantization function in an image upsilon in a sample image sequence, i is more than or equal to 1 and less than or equal to m, and m represents a candidate component detector set D^cThe number of middle component detectors;

the weight vector of the selector is:

Φ^c(f^c)＝wf^c+b，

wherein phi^c(f^c) Representing the selector, w being the weight vector of the selector, b being the bias of the selector, a loss function

the specific implementation manner of the step (3) is as follows:

w^k＝SVM(F^k-1,y),

F^k＝F^k-1[S^k]

wherein S is^k＝[s₁,s₂…s_m],s_iE {0,1}, indicates the component select flag bit, if s_i1, the ith component detector is selected, if s_i0, then the ith component detector is not selected, y represents the class label vector of the sample image sequence, w^kRepresents the weight vector after the k-th recursion, F^k-1Represents the candidate mid-level feature matrix after the k-1 recursion, F^kRepresenting the candidate mid-level feature matrix after the k-th recursion,

2. The behavior recognition method based on the middle-layer features as claimed in claim 1, wherein the training component detector is implemented as follows: the positive samples are spatiotemporal components in a set D of spatiotemporal components, the negative samples are spatiotemporal components in a set N of spatiotemporal components, and the component detector is trained with one positive sample and a plurality of negative samples for each spatiotemporal component in the set D of spatiotemporal components.

3. A behavior recognition system based on mid-tier features, comprising:

the behavior recognition module is used for inputting the test image sequence into the behavior recognition classifier to obtain the behavior category of the test image sequence;

the candidate feature vector f^cComprises the following steps:

wherein d is_iIn order to be the i-th component detector,

for the 1 st component detector d₁The middle layer features extracted by using the maximum pooling quantization function in the images upsilon in the sample image sequence,for the 2 nd component detector d₂The middle layer features extracted by using the maximum pooling quantization function in the images upsilon in the sample image sequence,

the weight vector of the selector is:

Φ^c(f^c)＝wf^c+b，

the specific implementation manner of the removed part detector module is as follows:

using the weight vector of the selector to measure the discriminative power of each component detector in the candidate set of component detectors, and using recursive removal, candidate mid-level feature matricesWhen k is 1, F is initialized⁰＝F^cWhen k > 1, the k-th recursion can be represented as follows:

w^k＝SVM(F^k-1,y),

F^k＝F^k-1[S^k]

4. A mid-level feature-based behavior recognition system as defined in claim 3, wherein the training component detector is implemented as: the positive samples are spatiotemporal components in a set D of spatiotemporal components, the negative samples are spatiotemporal components in a set N of spatiotemporal components, and the component detector is trained with one positive sample and a plurality of negative samples for each spatiotemporal component in the set D of spatiotemporal components.