CN106156775B

CN106156775B - Video-based human body feature extraction method, human body identification method and device

Info

Publication number: CN106156775B
Application number: CN201510148613.4A
Authority: CN
Inventors: 黄锐
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2015-03-31
Filing date: 2015-03-31
Publication date: 2020-04-03
Anticipated expiration: 2035-03-31
Also published as: CN106156775A

Abstract

The invention relates to a video-based human body feature extraction method, a human body recognition method and a device, wherein the video-based human body feature extraction method is used for extracting human body features capable of representing the appearance characteristics of individuals from videos and comprises the following steps: segmenting the video according to the individual stepping cycle to obtain at least one video segment; partitioning each video segment according to the body part of the individual to obtain at least one image block set; and respectively extracting a space-time feature vector from each image block set by using a trained Gaussian mixture model. The method can perform physical division on the videos taken by the person in time and space, and perform effective space-time alignment according to the image block set obtained by division, thereby accurately extracting the human body characteristics.

Description

Video-based human body feature extraction method, human body identification method and device

Technical Field

The invention relates to the field of video content analysis and video monitoring, in particular to a video-based human body feature extraction method, a human body recognition method and a human body recognition device.

Background

In video content analysis and video surveillance, it is often necessary to identify the identity of people appearing in a video. Commonly used recognition methods include face recognition, human body recognition, and the like. Human body identification is to identify the identity of a person by using the appearance characteristics of the whole human body, and is often used for searching for specific persons in a large number of monitoring videos, for example, the public security department appropriately reduces the range of manual searching and investigation of suspects or missing persons according to the appearance characteristics such as clothes. Another application of human body identification is convenient access management or people counting by identifying the person wearing a particular uniform.

The currently common human body recognition method is based on (single or multiple) static images for recognition. Similar to face recognition, features are extracted from an image including the entire human body, and the identity of an unknown person is determined by comparing the features of the unknown person with those of known persons. Since it is difficult for a static image to completely describe the appearance characteristics of a person in different postures, there is a small amount of research work for performing human body recognition based on video. The method in reference 1 is: the method comprises the steps of firstly, segmenting a video sequence containing human displacement into a short video which substantially comprises a human body and a small amount of background in space and a step period in time, regarding the video as a three-dimensional data block, extracting three-dimensional features such as HOG (Histogram of Oriented gradients) from the three-dimensional data block to represent the appearance of the human body, and comparing the features of unknown human characters with the features of known human characters to determine the identity of the unknown human characters.

However, the HOG three-dimensional features extracted by the method are histogram features based on fixed blocks, and there is no way to ensure that the body parts between the front and rear frames of the video are spatially aligned, so that the features of the human body cannot be accurately extracted, and the identification result is easily inaccurate.

Reference documents:

[1]Person Re-identification by Video Ranking,Wang et al.,ECCV 2014.

disclosure of Invention

Technical problem

In view of the above, the technical problem to be solved by the present invention is how to extract human body features that can represent appearance characteristics of a human from a video, so as to improve the accuracy of human body recognition based on the video as much as possible.

Solution scheme

In order to solve the above problem, an embodiment of the present invention provides a video-based human body feature extraction method for extracting human body features capable of representing appearance characteristics of a person from a video, including:

segmenting the video according to the stepping cycle of the person to obtain at least one video segment;

partitioning each video segment according to the body part of the individual to obtain at least one image block set, wherein one image block set comprises image blocks aiming at the same body part in all frame images of the video segments;

and respectively extracting a space-time feature vector from each image block set by using a trained Gaussian mixture model, wherein the space-time feature vector can represent the appearance characteristics of the body part corresponding to the image block set in a video segment associated with the image block set.

In one possible implementation, segmenting the video by a step cycle of the person to obtain at least one video segment includes:

calculating an optical flow intensity signal of each frame image of the video, and obtaining an actual optical flow intensity curve of the video according to the optical flow intensity signal;

carrying out Fourier transform on the optical flow intensity signals of the frame images to obtain regularized signals, and acquiring the main frequency of the regularized signals in a frequency domain;

carrying out inverse Fourier transform on the regularized signals according to the main frequency to obtain an ideal optical flow intensity curve of the video;

and segmenting the video according to the pole values of the actual optical flow intensity curve and the ideal optical flow intensity curve to obtain at least one video segment.

In one possible implementation, the gaussian mixture model is obtained by training:

segmenting a video sample according to the stepping period to obtain at least one training video segment;

partitioning each training video segment according to the body part to obtain at least one training image block set, wherein one training image block set comprises image blocks aiming at the same body part in all frame images of the training video segment;

for each training image block set, classifying the pixels of each image block in the training image block set according to the bottom layer characteristics, and for each type of pixels, training to obtain a Gaussian model of the following formula 1, wherein the Gaussian mixture model comprises a Gaussian model corresponding to each type of pixel point in the training image block set;

Θ＝{(μ_k,σ_k,π_k) K is 1, …, K formula 1,

wherein, theta is the Gaussian model corresponding to each pixel point, K is the classification number, mu_kIs the mean, σ, of the underlying features of the k-th class of pixels_kIs the variance, pi, of the underlying features of the class k pixels_kAnd the weight of the bottom layer characteristics of the k-th class of pixel points.

In a possible implementation manner, extracting a space-time feature vector from each image block set by using a trained gaussian mixture model respectively includes:

aiming at each image block set, extracting the bottom layer characteristics of each pixel point;

obtaining the bottom layer characteristics of each pixel point of the image block set according to the relation between the bottom layer characteristics and a trained Gaussian mixture model to obtain the characteristic vector of each pixel point;

and averaging the calculated feature vectors of the pixel points to obtain the space-time feature vector of the image block set.

In order to solve the above problem, an embodiment of the present invention provides a video-based human body recognition method, including:

according to any one of the human body feature extraction methods based on the videos, space-time feature vectors of known people are extracted from videos related to the known people;

according to any one video-based human body feature extraction method in the embodiment of the invention, a space-time feature vector of a figure to be identified is extracted from a video related to the figure to be identified;

comparing the space-time feature vector of the known person with the space-time feature vector of the person to be identified to determine the identity of the person to be identified.

In order to solve the above problem, an embodiment of the present invention provides a video-based human body feature extraction apparatus for extracting human body features capable of representing appearance characteristics of a person from a video, including:

the time division module is used for segmenting the video according to the stepping cycle of the person to obtain at least one video segment;

the space division module is used for dividing each video segment into blocks according to the body part of the individual to obtain at least one image block set, and one image block set comprises image blocks aiming at the same body part in all frame images of the video segments;

and the feature extraction module is used for extracting a space-time feature vector from each image block set by utilizing a trained Gaussian mixture model, wherein the space-time feature vector can represent the appearance characteristics of the body part corresponding to the image block set in a video segment associated with the image block set.

In one possible implementation, the time division module includes:

the optical flow calculation submodule is used for calculating an optical flow intensity signal of each frame image of the video and obtaining an actual optical flow intensity curve of the video according to the optical flow intensity signal;

the Fourier transform submodule is used for carrying out Fourier transform on the optical flow intensity signals of the frame images to obtain regularized signals and acquiring the main frequency of the regularized signals in a frequency domain;

the inverse Fourier transform submodule is used for carrying out inverse Fourier transform on the regularized signals according to the main frequency to obtain an ideal optical flow intensity curve of the video;

and the segmenting submodule is used for segmenting the video according to the pole values of the actual optical flow intensity curve and the ideal optical flow intensity curve to obtain at least one video segment.

In one possible implementation, characterized in that,

the time division module is also used for segmenting the video sample according to the stepping period to obtain at least one training video segment;

the space division module is used for partitioning each training video segment according to the body part to obtain at least one training image block set, and one training image block set comprises image blocks aiming at the same body part in all frame images of the training video segments;

the device further comprises:

the training module is used for classifying the pixel points of each image block in the training image block set according to the bottom layer characteristics, and for each type of pixel points, training to obtain a Gaussian model of the following formula 1, wherein the Gaussian mixture model comprises the Gaussian model corresponding to each type of pixel point in the training image block set;

Θ＝{(μ_k,σ_k,π_k) K is 1, …, K formula 1,

In one possible implementation, the feature extraction module includes:

the bottom layer feature extraction sub-module is used for extracting the bottom layer features of the pixels aiming at the image block sets;

the feature vector extraction submodule is used for obtaining the feature vector of each pixel point of the image block set according to the relation between the bottom layer features and the trained Gaussian mixture model;

and the feature vector averaging submodule is used for averaging the calculated feature vectors of the pixel points to obtain the space-time feature vector of the image block set.

In order to solve the above problem, an embodiment of the present invention provides a video-based human body recognition apparatus, including: the human body feature extraction device based on the video with any structure in the embodiment of the invention,

the video-based human body feature extraction device is used for extracting the space-time feature vector of a known person from a video related to the known person and extracting the space-time feature vector of a person to be identified from the video related to the person to be identified;

the video-based human body recognition apparatus further includes:

and the comparison module is used for comparing the space-time characteristic vector of the known person with the space-time characteristic vector of the person to be identified so as to determine the identity of the person to be identified.

Advantageous effects

According to the embodiment of the invention, the video can be segmented according to the stepping period, and the video frequency band is segmented according to the body part to obtain the image block set, so that the video taking steps of the person is divided in a physical meaning in time and space, and the effective space-time alignment is carried out according to the image block set obtained by division, so that the human body characteristics can be accurately extracted according to the image block set.

Other features and aspects of the present invention will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the invention and, together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart of a video-based human body feature extraction method according to an embodiment of the present invention;

FIG. 2 is a flowchart of video segmentation in a video-based human body feature extraction method according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating video segmentation in a video-based human body feature extraction method according to an embodiment of the present invention;

FIG. 4 is a flowchart of training a Gaussian mixture model in a video-based human body feature extraction method according to an embodiment of the invention;

FIG. 5 is a flowchart of feature extraction in a video-based human body feature extraction method according to an embodiment of the invention;

FIG. 6 is a flow chart of a video-based human recognition method according to an embodiment of the present invention;

fig. 7 is a block diagram of a video-based human body feature extraction apparatus according to an embodiment of the present invention;

fig. 8 is a block diagram of a video-based human body recognition apparatus according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments, features and aspects of the present invention will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, procedures, components, and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present invention.

Fig. 1 is a flowchart of a video-based human body feature extraction method according to an embodiment of the present invention. The method is used for extracting human body features capable of representing appearance characteristics of a person from a video, and as shown in fig. 1, the method mainly comprises the following steps:

step 101, segmenting the video according to the stepping cycle of the person to obtain at least one video segment.

In particular, segmenting the video by the person's stride cycle may include two steps. First, a video taken by an individual may be segmented into video segments according to a complete stride cycle of the individual, and the resulting video segments may have different lengths, but typically each video segment includes a complete stride cycle, e.g., beginning with a left foot lift and ending with a right foot landing. Then, the video segment including the complete stepping cycle is further subdivided, and the video segment may be further subdivided according to the motion state of each part of the body, for example, into four or more video segments of left leg lift, left leg drop, right leg lift and right leg drop. The following steps are repeated: after extracting a video segment comprising a complete stride period from the video, body part localization may be determined first, and the video segment of the head and torso may not be further subdivided as the head and torso may not vary significantly throughout the stride period (see fig. 3). The above segmentation is only an example, and the length of the video segment may be flexibly divided according to the time length of the actually selected video, the body motion state, and the like.

In one possible implementation, as shown in fig. 2, a specific method of video segmentation may include:

step 201, calculating optical flow (optical flow) intensity signals of each frame image of the video, and obtaining an actual optical flow intensity curve of the video according to the optical flow intensity signals;

step 202, performing Fourier transform (FFT) on the optical flow intensity signals of the frame images to obtain regularized signals;

step 203, acquiring the main frequency of the regularization signal (regularization) in a frequency domain;

step 204, performing inverse Fourier transform (IFFT) on the regularized signal according to the main frequency to obtain an ideal optical flow intensity curve of the video;

step 205, segmenting the video according to the pole values (localoptima) of the actual optical flow intensity curve and the ideal optical flow intensity curve to obtain at least one video segment.

For example, as shown in fig. 2 and 3, first, the video V may be (I)₁,…,I_t) The Optical flow calculation (Optical flow) is carried out on each frame of image to obtain the Optical flow intensity of each frame, and the Optical flow intensities of all the frames are connected to obtain a one-dimensional signal M ═ M (M ═ M-₁,…,m_t) (motion profile), the dashed wavy line shown in fig. 3 is the actual optical flow intensity curve obtained from the optical flow intensity signal M. Then regularizing the motion curve based on an ideal curve with constant frequency, i.e. changing the irregular actual optical flow intensity curve into a regular one, one implementation method is to perform fourier transform (FFT) on the signal M, obtain the main frequency M 'of the signal in the frequency domain space, and perform inverse fourier transform (IFFT) according to the main frequency M' to obtain a solid wavy line, i.e. the ideal optical flow intensity curve M x. Then, the extreme points of the corresponding broken line wavy line are found according to the extreme points of the solid line wavy line, and the whole long video V is divided into a short video segment S with a small segment and a small segment according to the extreme points_iEach video segment comprising from a minimum value to a maximum value or from a maximum valueThe process of maximum to one minimum. In general, the process from one maximum value to the minimum value corresponds to the process of taking one leg, the process from the minimum value to the next maximum value corresponds to the process of landing the leg, and then the other leg, so that a complete stepping cycle generally comprises the process of switching four extreme values. In each video segment S_iAnd the device can be further subdivided according to the movement condition of the body part.

And 102, partitioning each video segment according to the body part of the individual to obtain at least one image block set, wherein one image block set comprises image blocks aiming at the same body part in all frame images of the video segments.

Specifically, the video segment S can be divided into_iEach frame image of (a) is divided into six parts of the body, such as the head, torso, left arm, right arm, left leg, right leg, etc., of the background and foreground. Then, the frame image is divided into image blocks by using any body Part division algorithm, for example, a Deformable Part Model (Deformable Part Model), or a template learned from many alignment examples is directly applied to the frame image, and each frame image is divided into positions and ranges of six parts, such as a head, a trunk, a left arm, a right arm, a left leg, and a right leg. Thus, in the video segment, for the same body part, each image block set comprises a set of image blocks of a plurality of the parts in all frame images of the video segment.

Combining time segmentation and space segmentation, small blocks with specific movement of specific parts can be obtained, so that each image block set can be regarded as a space-time alignment block S_ijk. Where i represents the period over the entire video; j denotes the time period in the whole swing cycle, e.g. if a complete swing cycle of a person (with two legs each swing) is divided into 4 segments, j equals 1, …, 4; k denotes a body part, for example, k 1, …,6 respectively denote that the body part of the whole person is divided into 6 parts of head, trunk, left arm, right arm, left leg, and right leg, so that the video of the person's walking can be divided into 24 space-time alignments (temporal alignment and spatial alignment) of 4 × 6Alignment) block (video block). Each spatio-temporal alignment block is a block having a physical meaning in both time and space, e.g., the spatio-temporal alignment block S of the left arm during the first 1/4 of stride cycle 1₁₁₃Space-time alignment block S for the right leg during the last 1/4 of stride cycle 2₂₄₆. The spatio-temporal alignment blocks (sets of image blocks) of the present embodiment are flexible to accommodate different spatial and temporal configurations, compared to how HOG3D employs a regular grid to divide a video segment into smaller blocks of fixed height, width and length. For example, referring to fig. 3, a video segment S comprising a complete stride period can be segmented according to temporal segmentation and spatial chunking_iThe device is divided into a head space-time alignment block, a trunk space-time alignment block, two right arm space-time alignment blocks RA1H and RA2H, two left arm space-time alignment blocks LA1H and LA2H, two right leg space-time alignment blocks RL1H and RL2H, and two left leg space-time alignment blocks LL1H and LL 2H.

It should be noted that although the above examples illustrate the method of dividing the video into time and space, other divisions may be performed in time and space, and any method may be used as long as the whole stepping process of the person can be divided into small blocks having physical meanings in time and space. In this way, when performing inter-person feature comparisons, comparisons between temporally and spatially corresponding spatio-temporal alignment blocks may be performed.

And 103, extracting a space-time feature vector from each image block set by using a trained Gaussian mixture model, wherein the space-time feature vector can represent appearance characteristics of a body part corresponding to the image block set in a video segment associated with the image block set.

In a possible implementation manner, a Gaussian Mixture Model (GMMs) of this embodiment refers to a combination of multiple Gaussian models corresponding to each image block set, and as shown in fig. 4, the Gaussian mixture model can be obtained through the following training steps:

step 401, segmenting a video sample according to the stepping cycle to obtain at least one training video segment;

step 402, partitioning each training video segment according to the body part to obtain at least one training image block set, wherein one training image block set comprises image blocks aiming at the same body part in all frame images of the training video segment;

step 403, for each training image block set, classifying the pixels of each image block in the training image block set according to the bottom layer characteristics, and for each type of pixels, training to obtain a gaussian model as shown in the following formula 1, where the gaussian model includes a gaussian model corresponding to each type of pixel point in the training image block set;

Θ＝{(μ_k,σ_k,π_k) K is 1, …, K formula 1,

It should be noted that steps 401 to 403, and steps 101 to 103, may be serial or parallel. Specifically, the serial algorithm may also perform body part localization on the basis of the step cycle, may utilize prior knowledge of the human body at different stages of the step cycle to help locate various parts of the body, and further segment the video segment including the complete step cycle according to the body parts. The parallel algorithm can fully utilize computing resources (if the multi-core multi-thread computing resources are available) under the condition of utilizing the existing body part positioning technology (without depending on the estimation of the stepping period), and the two modules are executed in parallel, so that the computing time is reduced.

In a possible implementation manner, as shown in fig. 5, step 103 may specifically include:

step 501, for each image block set, for example: extracting the bottom layer characteristics of each pixel point based on the following formula 2;

in the formula (2), the first and second groups,

wherein the content of the first and second substances,

representing the relative position of the pixel point in the image block set, and I (x, y) representing the color of the pixel point;

and

representing the gradient of the pixel point, and f (x, y) representing the bottom-layer characteristics of the pixel point.

Step 502, obtaining a feature vector of each pixel point of the following formula 3 according to the relation between the bottom layer features and the trained gaussian mixture model, based on the bottom layer features of each pixel point of the image block set.

Φ＝[u₁,v₁,w₁…,u_k,v_k,w_k]In the formula 3, the first step is,

wherein, according to the bottom layer characteristics and the k-th Gaussian model of each pixel point in the image block set, calculating

And the number of the first and second electrodes,

f_iis the underlying characteristic of pixel point i, w_ikIs a pixel point f_iThe posterior probability belonging to the kth gaussian model is also the weight vector of the Fisher vector (vector), and p is the probability distribution function.

The above is only an example of the Fisher vector, and other calculation methods may be used to extract the Fisher vector, or an algorithm similar to the Fisher vector may be used to extract the features according to the pixel bottom layer features and the gaussian mixture model.

Step 503, averaging the feature vectors of the pixel points calculated by the formula 3 to obtain the space-time feature vector of the image block set.

Then, the Spatio-temporal feature vectors of all image block sets of all video segments of the video may be concatenated to obtain the Spatio-temporal feature Vector (STFVs) of the target person represented by the video.

In particular, the entire description of the individual walking is a long feature vector that is a concatenation of spatio-temporal feature vectors in the above patches (sets of image blocks), and in connection with the above example, the spatio-temporal feature vector of the video is a concatenation of spatio-temporal feature vectors of 24 sets of image blocks. And the 24 spatio-temporal feature vectors are all extracted on the same set of image blocks.

The video-based human body feature extraction method can segment a video according to a stepping period, and partition a video band according to body parts to obtain an image block set, so that the video taking steps by individuals is divided in a time and space physically, and therefore, for individuals with different postures in different videos, such as inconsistent steps, non-aligned body parts and the like, effective alignment can be performed according to the divided image block set, and features of unknown people and known people can be accurately extracted according to the image block set.

Fig. 6 is a flowchart of a video-based human body recognition method according to an embodiment of the present invention. As shown in fig. 6, the method mainly can use the human body feature extraction method based on video in the above embodiment to perform feature extraction, and specifically includes:

601, extracting space-time characteristic vectors of known persons from videos related to the known persons according to the human body characteristic extraction method based on the videos;

step 602, extracting a space-time feature vector of a person to be identified from a video related to the person to be identified according to the video-based human feature extraction method;

step 603, comparing the space-time characteristic vector of the known person with the space-time characteristic vector of the person to be identified to determine the identity of the person to be identified.

In particular, in the comparison process, any vector distance function can be adopted to calculate the distance between the space-time feature vector of the known person and the space-time feature vector of the person to be identified. Then, according to the task, it is determined whether the person to be identified is a specific person (1:1 authentication) or which person among a plurality of known persons the person to be identified corresponds to (1: N identification).

How to extract a spatio-temporal feature vector in an image block set and how to perform human body recognition based on the extracted features are described below with an image block set as an example, in this example, a Fisher vector is used.

Firstly, low-level features of all pixel points in a certain image block set are calculated, the bottom-level features of the pixel points comprise relative positions, colors, gradients and the like, and the specifically selected low-level features can be different. The relative position may be calculated by using the coordinate of the upper left corner of the first frame in the image block set as the origin of the relative coordinate system, or may be calculated by using other selected positions as the origin.

If the bottom layer features of the pixel points obtained by referring to formula 2 are 7-dimensional, each pixel point corresponds to a 7-dimensional vector, and if the image block set comprises 1000 points, there are 1000 vectors of the bottom layer features.

It should be noted that in the training phase, the same image block set (with the left arm lifted forward) may include multiple video samples (from different step cycles of the same person, or from different persons), so if 10 video samples are used as training data, a 1000 × 10 vector of bottom features can be obtained. Then, the 10000 vectors can be used to train to obtain a gaussian mixture model of the image block set. For example, there may be 16 different 7-dimensional gaussian models in the image block set, and equation 1 may be Θ { (μ)_k,σ_k,π_k):k＝1,…,16}。

Since each video is divided into 24 image block sets, 24 gaussian mixture models can be obtained, and each gaussian mixture model has 16 gaussian models. See formula 1, each Gaussian modeThe pattern is described using 3 parameters, where the mean μ_k7-dimensional with the low-level features, weight pi_kIs 1-dimensional, variance σ_kMay be 1-dimensional, 7-dimensional, 49-dimensional, etc.

In the testing stage, if the videos of two persons need to be compared, the processes from step 101 to step 103 may be respectively adopted to obtain 24 image block sets (video blobs) of the two persons, and for each image block set, a corresponding gaussian mixture model is adopted, and fisher vector thereof is calculated by referring to formula 3: phi ═ u₁,v₁,w₁…,u_k,v_k,w_k]. Wherein u is_k,v_k7-dimensional, w, as well as the low-level features_kIs 1-dimensional, so each Fisher vector calculated may be a vector of (2 × 7+1) × 16 ═ 240 dimensions (there may be no w)_k). Since there may be many pixel points (for example, the aforementioned 1000 pixel points) in each image block set, each pixel point may calculate a 240-dimensional vector, and Fisher vectors of all pixel points in the same image block set may be averaged to obtain a vector. Finally, Fisher vectors of 24 image block sets in one step cycle of one person are connected in series to obtain a vector with dimensions of 24 multiplied by 240 to 5760, and the vector forms a feature representation of one person.

The comparison in human feature recognition is to find the distance between two 5760-dimensional vectors, and the identity of a person can be determined by comparing the extracted feature representations.

By adopting the video-based human body feature extraction method, the human body can be aligned in time and space according to the image block set so as to accurately extract the features of unknown persons and known persons, thereby improving the accuracy of human body recognition.

Fig. 7 is a block diagram of a video-based human body feature extraction apparatus according to an embodiment of the present invention, as shown in fig. 7, the apparatus is used for extracting human body features capable of representing appearance characteristics of a person from a video, and the apparatus may specifically include: a time division module 71, configured to segment the video according to the stepping cycle of the individual to obtain at least one video segment; a space partitioning module 73, configured to partition each video segment according to the body part of the individual to obtain at least one image block set, where one image block set includes image blocks of the same body part in all frame images of the video segment; a feature extraction module 75, configured to extract a spatio-temporal feature vector from each of the image block sets respectively by using a trained gaussian mixture model, where the spatio-temporal feature vector is capable of representing an appearance characteristic of a body part to which the image block set is directed in a video segment associated with the image block set.

In one possible implementation, the time division module 71 includes:

the optical flow calculation submodule 711 is configured to calculate an optical flow intensity signal of each frame image of the video, and obtain an actual optical flow intensity curve of the video according to the optical flow intensity signal;

the fourier transform submodule 713 is configured to perform fourier transform on the optical flow intensity signal of each frame image to obtain a regularized signal, and acquire a main frequency of the regularized signal in a frequency domain;

the inverse Fourier transform submodule 715 is configured to perform inverse Fourier transform on the regularized signal according to the main frequency to obtain an ideal optical flow intensity curve of the video;

a segmentation submodule 717, configured to segment the video according to the pole values of the actual optical flow intensity curve and the ideal optical flow intensity curve to obtain at least one video segment.

The specific method for time-segmenting the video by each sub-module of the time-segmentation module 71 can refer to fig. 2 and the related description of the above human body feature extraction method based on the video.

In a possible implementation manner, the time division module 71 is further configured to segment the video sample according to the step cycle to obtain at least one training video segment; the space division module 73 is further configured to block each of the training video segments according to the body part to obtain at least one training image block set, where one training image block set includes image blocks of all frame images of the training video segments for the same body part;

the device further comprises: a training module 77, configured to classify, for each training image block set, pixels of each image block in the training image block set according to bottom layer features, and train to obtain, for each type of pixels, a gaussian model as shown in equation 1 below, where the gaussian mixture model includes a gaussian model corresponding to each type of pixel point in the training image block set;

Θ＝{(μ_k,σ_k,π_k) K is 1, …, K formula 1,

In one possible implementation, the feature extraction module 75 includes:

the bottom-layer feature extraction sub-module 751 is used for extracting the bottom-layer features of the pixels aiming at the image block sets;

a feature vector extraction submodule 753, configured to obtain a feature vector of each pixel point of the image block set according to a relationship between the bottom layer feature and a trained gaussian mixture model, where the bottom layer feature of each pixel point is a feature vector of each pixel point;

the feature vector averaging submodule 755 is configured to average the calculated feature vectors of the pixel points to obtain a space-time feature vector of the image block set.

The process of the spatial partitioning module 73 for spatially partitioning the video may refer to fig. 3 and the related description of the above embodiment of the human body feature extraction method based on the video. The feature extraction module 75 extracts an example of spatio-temporal feature vectors, such as Fisher vectors, using a gaussian mixture model, as shown in equations 1 to 3 and their associated descriptions.

In this embodiment, the video-based human body feature extraction device can segment a video according to a stepping cycle, and partition a video frequency band according to a body part to obtain an image block set, thereby realizing division of a video taking a step of an individual in a time and space physically significant manner, and performing effective time-space alignment according to the image block set obtained by the division, thereby accurately extracting human body features according to the image block set.

Fig. 8 is a block diagram of a video-based human body recognition apparatus according to an embodiment of the present invention, and as shown in fig. 8, the video-based human body recognition apparatus may mainly include: the video-based human body feature extraction device 81 with any one of the structures in the above embodiments, specifically, the video-based human body feature extraction device 81 is configured to extract a spatio-temporal feature vector of a known person from a video related to the known person, and extract a spatio-temporal feature vector of a person to be identified from a video related to the person to be identified;

in addition, the video-based human body recognition apparatus may further include: and the comparison module 83 is used for comparing the space-time feature vector of the known person with the space-time feature vector of the person to be identified so as to determine the identity of the person to be identified.

Specifically, the specific process of the video-based human body recognition device for human body recognition can be referred to the related description in the above embodiment of the video-based human body recognition method.

According to the embodiment, the human body feature extraction device based on the video can be used for realizing the alignment of the human body in time and space according to the image block set, accurately extracting the human body features, and comparing the known person with the unknown person according to the extracted human body features to obtain more accurate identification results.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A video-based human body feature extraction method for extracting human body features capable of representing appearance characteristics of a person from a video, comprising:

2. The method of claim 1, wherein segmenting the video into at least one video segment in a stride cycle of the individual comprises:

3. The method according to claim 1 or 2, wherein the gaussian mixture model is trained by:

Θ＝{(μ_k,σ_k,π_k) K is 1, …, K formula 1,

4. The method of claim 3, wherein extracting a spatio-temporal feature vector from each of the image block sets respectively by using a trained Gaussian mixture model comprises:

5. A video-based human body recognition method is characterized by comprising the following steps:

the video-based human feature extraction method of any one of claims 1 to 4, extracting spatio-temporal feature vectors of known persons from videos related to the known persons;

the video-based human feature extraction method according to any one of claims 1 to 4, extracting spatiotemporal feature vectors of a person to be identified from a video related to the person to be identified;

6. A video-based human body feature extraction device for extracting human body features capable of representing appearance characteristics of a person from a video, comprising:

7. The apparatus of claim 6, wherein the time division module comprises:

8. The apparatus according to claim 6 or 7,

the space division module is further configured to divide each training video segment into blocks according to the body part to obtain at least one training image block set, where one training image block set includes image blocks for the same body part in all frame images of the training video segment;

the device further comprises:

Θ＝{(μ_k,σ_k,π_k) K is 1, …, K formula 1,

9. The apparatus of claim 8, wherein the feature extraction module comprises:

10. A video-based human body recognition apparatus, comprising: the video-based human feature extraction apparatus of any one of claims 6 to 9,

the video-based human body recognition apparatus further includes: