CN106529467B

CN106529467B - Group behavior recognition methods based on multi-feature fusion

Info

Publication number: CN106529467B
Application number: CN201610976817.1A
Authority: CN
Inventors: 陈昌红; 余晓; 干宗良
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2016-11-07
Filing date: 2016-11-07
Publication date: 2019-08-23
Anticipated expiration: 2036-11-07
Also published as: CN106529467A

Abstract

The invention discloses group behavior recognition methods based on multi-feature fusion, it include: to be extracted to the characteristic information of three different levels, it is respectively: feature is mentioned to single people, primary concern is that everyone position, size, motion information in every frame, and the feature that everyone is extracted with convolutional neural networks；Semantic feature extraction is carried out for interpersonal interaction, mainly considers interpersonal external action relationships and relative orientation relationship；Scene information is extracted to environment locating for people in group behavior.Using full link conditional random field models, these characteristic informations are merged, realize the identification to group behavior.This method considers various features information simultaneously, and more comprehensively, more effectively group behavior can be described, and improves the discrimination of group behavior, there is important application value in video monitoring.

Description

Group behavior recognition methods based on multi-feature fusion

Technical field

The invention belongs to image processing techniques and area of pattern recognition, in particular to group behavior based on multi-feature fusion Recognition methods.

Background technique

Activity recognition is the forward position direction being concerned in computer vision field, single people and it is double between row In Study of recognition method, to have been achieved for significantly achieving.And in recent years, due to video monitoring, human-computer interaction, it is based on video Content retrieval demand it is increasing, group behavior identification become gradually computation vision and pattern-recognition research hotspot it One.But number involved in group behavior it is more and be not fixed, the variability of interpersonal interactive relation and scene are answered Polygamy has greatly challenge to the research of group behavior.

In recent years, it works in many group behavior Study of recognition and is dedicated to studying what semantic information identified group behavior It influences, and achieves certain achievement.For the group behavior under analysis video monitoring, what it is with greater need for consideration is semantic information, I.e. interpersonal interaction and everyone institute's role in special group.Choi proposes a kind of semantic descriptions Spatio-Temporal Local (STL), is mainly described between them using the relativeness of interpersonal posture Interbehavior.The it is proposed of this descriptor for capturing semantic relation is the Shape Context algorithm in a pattern classification field Based on, this feature captures other people spatial relations and face direction relations relative to person who attract people's attention.With field Centered on someone in scape, position and the face direction relations of people around are calculated, are finally indicated with histogram.STL feature Interpersonal spatial relation and certain interactive relation can be effectively captured, but its drawback is that not retouching Interpersonal action relationships are stated, so recognition effect is less desirable.Lan proposes that a kind of Action Semantic based on appearance is retouched Symbol (Action Context) is stated, is preferably described adjacent to the external action relationships of people when forefathers' using everyone and surrounding Behavior.This descriptor is relatively good for the movement biggish group behavior recognition effect of diversity ratio, but the variation for visual angle Compare sensitive, causes discrimination not high.Takuhiro combines the advantages of Lan and Choi method, interpersonal in consideration On the basis of action relationships, it is also contemplated that interpersonal relative orientation relationship, it is insensitive to visual angle change, to recognition result It has a certain upgrade, but still it is not ideal enough.It is right from above method as can be seen that their characteristic informations of consideration are relatively simple Group behavior changeable in number and that interaction is complicated, we should extract characteristic information from many aspects, be integrated, in this way can be with Group behavior more comprehensively and is effectively described.

Summary of the invention

For group behavior, since existing number is relatively more, and the behavior that everyone is showed is variant, if It regards them as a group behavior, only extracts interaction feature and analyzed, it is clear that Shortcomings, it is contemplated that more energy Enough characteristic informations for effectively describing group behavior, and comprehensively consider these features, group behavior knowledge could be carried out more significantly Not.It is an object of the invention to propose a kind of group behavior recognition methods based on multi-feature fusion.It is characterized in that this method packet Include following steps:

Step 1 is divided into the feature extraction that three parts carry out different levels, extracts single feature letter for single people respectively Breath extracts interaction feature to interpersonal interactive relation, and interactive relation therein is mainly that interpersonal movement is closed System and relative orientation relationship, and scene information extraction is carried out to environment locating for the people in group behavior；

Step 2, Fusion Features: interpersonal interaction feature is merged with scene information, using containing radial base letter Several support vector machines (Support Vector Machine, SVM) sorting algorithms obtain behavior score, as full link condition The unitary gesture of random field models, the binary gesture to the extracted characteristic information of single people as full link conditional random field models, By all Fusion Features of extraction in a model, group behavior identification is carried out.

As a further improvement of the present invention, the step 1 specifically includes:

Step 1-1, one characteristic information is extracted, mainly considers everyone location information, size information (height letter Breath), motion information (wherein location information and size information are provided in database), these three features mainly reflect each The characteristic feature of people proposes feature to single people using convolutional neural networks (Convolution Neural Network, CNN), It is also the supplement to the three kinds of characteristic feature information in front；

Step 1-2, feature extraction is carried out for interpersonal interactive relation, respectively centered on everyone, by him The people neighbouring with him of surrounding is considered as his context, according to the behavior that itself behavior and the neighbouring people of surrounding are showed, extracts Behavior contextual feature is denoted as AC descriptor, and this descriptor only captures interpersonal action relationships；It is basic herein On, and consider everyone and he around neighbouring people relative orientation relationship, extract opposite contextual feature, be denoted as RAC descriptor；

Step 1-3, scene locating for the people in people's group behavior also provides necessary clue for Activity recognition, to locating for people Environment extract scene information, mainly consider three kinds of scene informations: outdoor, indoor, automobile.The extraction of scene information is divided into two Step carries out, and carries out outdoor, indoor classification to scene using spatial pyramid distribution method first, secondly observes field using eye tracker Scape picture, available area-of-interest, analyzes area-of-interest, and whether see in scene has automobile storage to exist.

As a further improvement of the present invention, the step 2 specifically includes:

Step 2-1, it calculates the unitary gesture of full link conditional random field models: interpersonal interaction feature AC is described Symbol and RAC descriptor merge to obtain new feature vector respectively with scene information, are classified using svm classifier model and are gone For score, then it is converted to probability by softmax respectively, and to both probability by asking max to obtain new probability vector, Using obtained result as the unitary gesture of full link conditional random field models；

Step 2-2, conditional random field models Fusion Features: are linked as complete for all characteristic informations that single people extracts Binary gesture, learnt automatically according to the unitary gesture of model and binary gesture, carry out group behavior identification；

Beneficial effect

In current existing group behavior Study of recognition method, feature is proposed primarily directed to interpersonal interaction, They will regard the owner in scene as a group and analyze, but be often possible in the video monitoring scene of reality There are multiple groups, and each group carries out different activity, such as: a total of 5 people in scene, wherein there is 4 people It is trapped among and talks together, as soon as but there is a people just to pass through on foot from the side, this people and other 4 people are not a groups, Because the behavior that they show is different.Obviously it is unreasonable for owner being regarded as a group to carry out analysis.And Current group behavior research method does not all account for the scene information of people's local environment, but scene information is for Activity recognition Some clues can be provided.Such as: it is understood that behavior occurs in outdoor, has automobile, zebra stripes or traffic lights, then we are just It may determine that this is unlikely to be talk or is lined up behavior, be larger a possibility that going across the road behavior instead；If it is generation Indoors, it may not be possible to being to go across the road or go across the road in waiting.Therefore introducing scene information has group behavior analysis Certain significance.Certainly, it can be found out in our experimental result, consider that scene information is effectively.Therefore I Way be: consider feature, interpersonal interaction feature and the scene information of single people, and utilize full linking bar Part random field models merge these characteristic informations, and realize automatically divide group (divide group foundation be: belong to the same group Everyone have similar position, size and motion information), to achieve the effect that preferably to identify group behavior.

Detailed description of the invention

The main flow chart of Fig. 1 invention.

Fig. 2 proposes feature to single people using convolutional neural networks (CNN).

Annotation track figure that Fig. 3 is obtained according to eye movement test, annotation hotspot graph.

The identification knot of experiment of Fig. 4 context of methods on Collectivity Activity Dataset database Fruit.

Specific embodiment

The invention will be further described with example with reference to the accompanying drawing, it is noted that described example is only intended to Convenient for the understanding of the present invention, and any restriction effect is not played to it.

Group behavior recognition methods based on multi-feature fusion, includes the following steps:

Step 1, point three parts carry out different feature extractions, single characteristic information are extracted for single people respectively, to people Interactively pick-up interaction feature between people, and scene information extraction is carried out to environment locating for the people in group behavior；

Step 2 merges interpersonal interaction feature with scene information, using containing radial basis function Svm classifier algorithm obtains behavior score, and as the unitary gesture of full link conditional random field models, and it is extracted to be directed to single people Binary gesture of the characteristic information as full link conditional random field models, by extracted all Fusion Features in a model, Carry out group behavior identification.

The process of feature extraction includes:

Step 1-1, the characteristic information mentioned for single people mainly considers everyone location information, size information (elevation information), motion information (wherein location information and size information are provided in database), and use convolutional Neural net Network (Convolution Neural Network, CNN) proposes feature to single people；

Step 1-2, feature extraction is carried out for interpersonal interaction, respectively centered on everyone, around him The people neighbouring with him be considered as his context, according to the behavior that itself behavior and the neighbouring people of surrounding are showed, extract behavior Contextual feature is denoted as AC descriptor, on this basis, and consider everyone and he around neighbouring people relative orientation relationship, Opposite contextual feature is extracted, RAC descriptor is denoted as；

Step 1-3, scene information is extracted to environment locating for people, mainly considers three kinds of scene informations: outdoor, indoor, vapour Vehicle.The extraction of scene information is divided into the progress of two steps, carries out outdoor, indoor point to scene using spatial pyramid distribution method first Class, secondly observes scene picture using eye tracker, and available area-of-interest analyzes area-of-interest, sees scene In whether there is automobile storage to exist.

Step 1-1, feature extraction is carried out to each of every frame.Specific operating process has:

(1) the group behavior database Collectivity Activity dataset used in us provides everyone Three dimensional local information, it is possible to obtain everyone location information, size information (elevation information).Using optical flow method (HOF) motion information for extracting everyone indicates the dynamic and static state of people.

(2) feature is mentioned to everyone using convolutional neural networks (CNN).Convolutional neural networks are by multiple convolution, drop Sampling operation may include the non-of whole sub-picture so proposing feature by convolutional neural networks in high-level carry out semantic integration Normal information abundant, compares general characteristic feature, can more effectively describe everyone Global Information.Wherein convolution operation It is to carry out convolution to a neighborhood of image to obtain the neighborhood characteristics of image, it can be such that original signal feature enhances, and reduce Noise.It then is to integrate the characteristic point in neighborhood to obtain new feature followed by down-sampled operation at it, the purpose is to drop Dimension so that intrinsic dimensionality is reduced, and keeps certain invariance (rotation, translation, flexible etc.).As shown in Fig. 2, being exactly that we are adopted The structure of CNN, Cx indicate volume base, and Sx indicates down-sampled layer.Everyone detection block is given in database, due to CNN carries out needing the size of picture to be consistent when feature extraction, is then normalized into everyone detection block identical big It is small, using being normalized into 60 × 60 in our experiment, it can be seen from the figure that we use three convolutional layers, two A down-sampled layer is finally 160 dimensions to the characteristic dimension that everyone mentions.

Step 1-2, feature is proposed for interpersonal interaction.Specific operating process has:

(1) extraction of behavior contextual feature (AC), this feature consider everyone and he around people adjacent to it Behavior expression.HOG feature is mentioned to everyone, is then classified by SVM, the score of every class behavior: A is obtained_i=[S_1i, S_2i,…,S_Ki], wherein S_niIndicate the score that behavior label n is corresponded to by resulting i-th of the people of SVM classifier.It is with i-th of people The people region of (dis ∈ (0.5 × h, 2 × h)) neighbouring around him is considered as context area, to the extracted region by center Contextual feature (the wherein height that h corresponds to everyone):

M: context area is divided into M sub-regions, N_m(i): i-th of people in m-th of subregion.Such as at first There are 2 people close with him in the region sub-context, around him, then the behavior score of the two people is taken out, and takes every The maximum value of a respective behavior score obtains first sub- contextual feature.Behavior contextual feature can be obtained are as follows: AC_i=[A_i, C_i]。

(2) extraction of opposite behavior contextual feature (RAC), RAC, which is considered, not only allows for behavioural characteristic, also captures The relativeness of people from center and people around, such as: people from center is towards the right side, and another person around him is towards a left side, then their opposite is closed System is defined as in the opposite direction.AC descriptor is not due to accounting for relative orientation relationship, so more sensitive to visual angle change, RAC descriptor overcomes this defect, is the improvement to AC descriptor.The extracting method of RAC descriptor is similar with AC's, because Behavior and direction are considered simultaneously, so its behavioural characteristic is K dimension: K=U × V, U: behavior classification number, V: direction classification number. According to HOG feature and the direction of resulting i-th of the people of svm classifier is carried out, obtains everyone opposite behavior score: According to HOG feature and carry out svm classifier The direction of resulting i-th of people.According toThe phase of opposite the behavior descriptor and m-th of sub- context area of i-th of people can be obtained To context descriptor:

The relative descriptors of entire context area are as follows:So the opposite behavior description of i-th of people Symbol are as follows:

Step 1-3., scene information is extracted to environment locating for people in group behavior.Specific operating process has:

(1) it is realized using spatial pyramid matching algorithm and scene is divided into outdoor, interior.

Spatial pyramid method is the statistical picture characteristic point point in different resolution (corresponding pyramidal different levels) Cloth, to obtain the spatial information of image.Firstly, extracting scale invariant feature conversion to every frame picture of all video sequences (SIFT) descriptor of all pictures is carried out Kmeans cluster and generates visual dictionary by descriptor, and dictionary is dimensioned to M= 200.The frequency that all vision words of every frame image occur in different levels is calculated, according to formulaWherein L: The level of spatial pyramid, is set as L=2, so every frame picture may finally be indicated with the feature vector of one 4200 dimension. The classification to scene is finally realized according to spatial pyramid matching.

(2) detecting whether there is automobile in scene.

When human visual system handles more complicated scene, its visual attention can be concentrated to minority in this scenario In several objects, make every effort to obtain the main information in scene, the region that these objects are constituted in the scene within the shortest time Referred to as area-of-interest (ROI).The area-of-interest of image is extracted, and ROI is analyzed, the effect of information processing can be improved Rate.Eye tracker records user's eye movement data, eye movement figure can also be drawn, watch hotspot graph attentively etc., intuitively reflect user To diagram picture really interested region or the position of object.It utilizes eye tracker (Tobii Studio 3.3.1software) Picture is observed, area-of-interest is obtained, and analyze these regions, sees if there is automobile storage and exist, specific practice:

Observer about 65cm at a distance from screen, the time that each width picture is presented is 8s, and is in behind every width picture An existing width gray scale picture, Shi Changwei 2s, the purpose of setting gray scale picture are to alleviate the visual fatigue of observer, are opened in eye movement test Eye movement correction was carried out before beginning.After the completion of eye movement test, we are available to watch trajectory diagram attentively and watches hotspot graph attentively, such as Fig. 3 institute Show.Watch trajectory diagram attentively: record observer watches track attentively in entire experience of the process.Blue circles indicate blinkpunkt, circle Size indicates the length of fixation time, and the more big then fixation time of circle is longer, and the digital representation in circle watches order attentively, blue Lines indicate twitching of the eyelid.Watch hotspot graph attentively: indicating that observer is different to the attention rate of picture everywhere with different colours, so as to straight It sees ground and sees the region that subject most pays close attention to and the region ignored.Color shows that annotation time is longer more deeply feeling, and red indicates most to close The region of note, yellow and green expression are watched attentively horizontal relatively low, and not having coloured region then indicates not watch attentively.With excel The form of table exports eye movement data, eye movement data will record the number of area-of-interest, each region of interest centers cross, Ordinate value, and to the time that each area-of-interest is annotated.Eye movement data is pre-processed, does not consider fixation time Data less than 100ms, that is to say, that if observer is too short for a region fixation time, he may not be to this area Domain is interested.According to eye movement data, the center table of available each ROI region, and combine and watch hotspot graph attentively, with each ROI Centre coordinate centered on, extract 180 × 120 rectangular area around it, divide followed by these rectangular areas Analysis, sees if there is automobile storage and exists, specific algorithm step:

1) SIFT feature is mentioned to extracted each 180 × 120 rectangular area of every frame picture；

2) multiple target object (automobile) pictures are chosen, SIFT feature is extracted, and calculate the Euclidean distance between feature, asks Averagely obtain threshold value；

3) Euclidean distance calculated between the feature and target object feature in extracted region is calculated with threshold value comparison Euclidean distance be less than threshold value, then it is believed that the rectangular area is similar to automobile, and then may determine that in scene there is such target (automobile) object exists.

It obtains required scene information, indicates field with the one 3 vector of binary features S=tieed up [outdoor interior automobile] Scape information, such as: S=[1 0 1], corresponding scene information are as follows: in outdoor, have automobile.

Step 2. carries out Fusion Features using full link conditional random field models, completes group behavior identification:

Step 2-1, calculate unitary gesture: by extracted AC descriptor and RAC descriptor respectively with scene information feature to S fusion is measured, new feature vector Scene_AC, Scene_RAC is respectively obtained.Using SVM classifier to both feature vectors Training obtains behavior score, converts by softmax score vector being converted into matrix, and seek Max to the two:

P_i(y_i): the behavior label of i-th of people is y_iProbability, P_i(y_i|d₁): gained is calculated by feature vector Scene_AC Probability, P_i(y_i|d₂): resulting probability is calculated by feature vector Scene_RAC.Then everyone unitary gesture may be expressed as:

ψ_u(y_i)=- log (P_i(y_i)) (4)

Step 2-2, the binary gesture of computation model carries out Activity recognition by linking conditional random field models entirely.Binary gesture Indicate group in interpersonal distant relationships because in the same group everyone have similar location information, Size information, motion information (be all it is static or all be movement) and certain higher level information (use CNN Feature indicates), then proprietary binary gesture may be expressed as:

ψ_p(y_i,y_j)=u (y_i,y_j)k(f_i,f_j) (5)

k(f_i,f_j) it is Gaussian convolution and the feature mentioned with CNN to everyone: cnn_i, location information: p_i, size letter Breath: s_i, motion information: m_i, weight: w.Gaussian convolution and can be calculated by the following formula gained:

Deduction and study for model can use Maximun Posterior Probability Estimation Method.

Effectiveness of the invention can be further illustrated by following emulation experiment:

Current group behavior Study of recognition method is exactly Collectivity Activity using more database Dataset, because shot under its tangible different scenes, the people of each group's movement is also different, and these video bases All it is the lower monitor video sequence of resolution ratio shot in daily life by hand-held camera in sheet, substantially presents One relatively true video monitoring scene, so there is employed herein this databases as experiment.This group behavior data Library contains 44 video sequences, wherein there are 5 kinds of commonplace group behaviors: it is lined up, talks, on foot, go across the road, etc. To and 8 kinds of postures: forward, backward, to the left, to the right, towards it is left front, towards it is left back, towards before right, towards after right.This number People's location information in the scene and the elevation information of people are additionally provided according to library, is provided convenience place for our research. There is employed herein leaving-one methods to test: due to sharing 44 video sequences in database, we are every time using one of view As test sample, remaining 43 video sequences make in this way as training sample by 44 video sequences frequency sequence It is all tested primary as test sample, finally it is averaged the recognition result as us.

Table 1

Method	Mean (%)	Crossing (%)	Waiting (%)	Queuing (%)	Walking (%)	Talking (%)
							Choi et al.[4]	65.9	55.4	64.6	63.3	57.9	83.6
Choi et al.[5]	70.9	76.4	76.4	78.7	36.8	85.7
							Lan et al.[7]	79.7	68	69	76	80	99
Takuhiro et al.[8]	73.2	63	87	89	49	78
							Our method	79.9	67.5	85.2	99.5	74.8	71.2

The experimental result that we are done on group behavior database can be observed by table 1 and Fig. 4.In table 1, I Give every class behavior discrimination and total average recognition rate, it can be seen that our method is relative to most It is effectively, although our method has only been higher by 0.2% than the Lan method proposed for existing research method.This A little methods all do not account for scene information, and the priori knowledge accumulated from our daily lifes is it is recognised that be impossible indoors Occur going across the road behavior or wait the behavior gone across the road on road side, equally if we know that behavior occurs in outdoor, and has Automobile occurs, then one can consider that this is that the behavior gone across the road is larger.It can be seen that scene information is group's row For identification certain important clue is provided, can be seen that certainly from our experimental result strictly feasible and effective.

Being described above is only a specific embodiment of the invention, it is clear that this field under technical solution of the present invention guidance Anyone made by modification or part replacement, belong to claims of the present invention restriction range.

Claims

1. group behavior recognition methods based on multi-feature fusion, characterized in that this method comprises the following steps:

Step 1, feature extraction: point three parts carry out different feature extractions, extract single characteristic information for single people respectively, Scene information extraction is carried out to interpersonal interactively pick-up interaction feature, and to environment locating for the people in group behavior；

Step 2, Fusion Features: interpersonal interaction feature is merged with scene information, using containing radial basis function Support vector cassification algorithm obtains behavior score, as the unitary gesture of full link conditional random field models, and is directed to single people Binary gesture of the extracted characteristic information as full link conditional random field models, by extracted all Fusion Features in one In model, group behavior identification is carried out；

The step 1 specifically includes:

Step 1-1, the characteristic information mentioned for single people considers everyone location information, size information, motion information, These three information belong to most basic exterior representations information；And feature, this spy are mentioned to single people using convolutional neural networks The method that sign is extracted is to extract feature for picture in its entirety, by multiple convolutional layers, down-sampled layer operation, finally obtained feature It is high-level semantic combination, this method can better describe the behavior of single people, posture information than simple external feature；

Step 1-2, carry out feature extraction for interpersonal interaction, respectively centered on everyone, by around him with His neighbouring people is considered as his context, according to the behavior that itself behavior and the neighbouring people of surrounding are showed, above and below extraction behavior Literary feature is denoted as AC descriptor；This descriptor considers interpersonal action relationships, in order to improve the robust of feature Property, on this basis, consider everyone and he around neighbouring people relative orientation relationship, extract opposite contextual feature, be denoted as RAC descriptor；Concrete operations are as follows:

(1) extraction of behavior contextual feature (AC), this feature consider everyone and he around people adjacent to it behavior Performance；HOG feature is mentioned to everyone, is then classified by SVM, the score of every class behavior: A is obtained_i=[S_1i,S_2i,…, S_Ki], wherein S_niIndicate the score that behavior label n is corresponded to by resulting i-th of the people of SVM classifier, it, will using i-th of people as center The people region of neighbouring (dis ∈ (0.5 × h, 2 × h)) is considered as context area around him, special to the extracted region context Sign, wherein h corresponds to everyone height:

M: context area is divided into M sub-regions, N_m(i): i-th of people in m-th of subregion, such as in first sub- There are 2 people close with him in the region context, around him, then the behavior score of the two people is taken out, and takes each phase The maximum value for answering behavior score obtains first sub- contextual feature, obtains behavior contextual feature are as follows: AC_i=[A_i, C_i]；

(2) extraction of opposite behavior contextual feature (RAC), RAC not only allow for behavioural characteristic, also capture people from center and week Enclose the relativeness of people；When center people is towards the right side, another person around him is towards a left side, then their relativeness is defined as towards phase Opposite direction；AC descriptor is not due to accounting for relative orientation relationship, and the extracting method of RAC descriptor is similar with AC's, because together When consider behavior and direction, so its behavioural characteristic be K dimension: K=U × V, U: behavior classification number, V: direction classification number；Root According to HOG feature and the direction of resulting i-th of the people of svm classifier is carried out, obtains everyone opposite behavior score: According to HOG feature and carry out svm classifier The direction of resulting i-th of people；According toThe opposite behavior descriptor of i-th people and m-th sub- context area it is opposite Context descriptor:

The relative descriptors of entire context area are as follows:So the opposite behavior descriptor of i-th of people Are as follows:

Step 1-3, scene information is extracted to environment locating for people, considers three kinds of scene informations: outdoor, indoor, automobile, scene letter The extraction of breath is divided into the progress of two steps, carries out outdoor, indoor classification to scene using spatial pyramid distribution method first, secondly benefit Scene picture is observed with eye tracker, area-of-interest is obtained, area-of-interest is analyzed, whether detect in scene has automobile In the presence of；These three extracted scene informations are merged, the binary vector tieed up with one 3 indicates, if in scene There is the information for meeting corresponding position, is just indicated with 1, otherwise indicated with 0.

2. group behavior recognition methods according to claim 1, characterized in that the step 2 specifically includes:

Step 2-1, interpersonal interaction feature AC descriptor and RAC descriptor are merged to obtain respectively with scene information new Feature vector classified to obtain behavior score, then distinguish using the support vector cassification algorithm containing radial basis function It is converted to probability by softmax, by both probability by asking max to obtain new probability vector, and the result that will be obtained Unitary gesture as full link conditional random field models；

It calculates unitary gesture: extracted AC descriptor and RAC descriptor is merged with scene information feature vector S respectively, respectively Obtain new feature vector Scene_AC, Scene_RAC；Behavior is obtained to the training of both feature vectors using SVM classifier Score is converted by softmax score vector being converted into matrix, and seeks Max to the two；

Step 2-2, the binary gesture for the extracted characteristic information of single people as full link conditional random field models, realizing will Extracted all Fusion Features carry out group behavior identification in a model.