CN108537209B

CN108537209B - Adaptive downsampling method and device based on visual attention theory

Info

Publication number: CN108537209B
Application number: CN201810379089.5A
Authority: CN
Inventors: 姬秋敏; 张灵; 陈云华
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2021-08-27
Anticipated expiration: 2038-04-25
Also published as: CN108537209A

Abstract

The invention provides a self-adaptive down-sampling method and a device based on a visual attention theory, wherein the method comprises the following steps: measuring expression variation of the video clip by a global optical flow method; converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain; and determining a down-sampling factor of the video clip according to the expression information quantity index on the frequency domain. The invention converts the expression information quantity of the time domain to the frequency domain, and obtains the self-adaptive down-sampling factor through frequency analysis, thereby better simulating the attention of human beings, and the expression representation capability of the expression frame obtained by sampling is stronger, thereby solving the technical problem that the prior art is not suitable for the continuous self-generation expression recognition without peak expression frame marking.

Description

Adaptive downsampling method and device based on visual attention theory

Technical Field

The invention relates to the technical field of face recognition, in particular to a self-adaptive down-sampling method and device based on a visual attention theory.

Background

On one hand, facial expression recognition has wide application prospects in the fields of medical treatment, public safety, man-machine interaction and the like, so that the research of the facial expression recognition has practical significance; on the other hand, the solution of the expression recognition problem relates to many basic problems in the fields of image processing, computer vision, optimization theory and the like, such as image registration, image segmentation, image feature extraction, machine learning, optimization algorithm and the like, and therefore, the research thereof has important theoretical significance. Dynamic continuous spontaneous (spoken) expression recognition is a research focus of expression recognition in recent years due to its greater research challenge and higher application value compared to static beat (staged) expression recognition. As a latest spontaneous expression data set, an Audio/video Emotion Challenge (AVEC) is recorded about 150 ten thousand frames of video, the facial expression in the video is natural and fine, and the peak expression frame is not labeled by the data set. International competitions against this data set have been carried out in many ways, exposing many of the deficiencies of existing feature representation models in dealing with continuous spontaneous expressions. The continuous spontaneous expression data set has huge video quantity and various change forms, and has no peak expression frame mark, so the core task of the feature representation is the automatic extraction of expression frames. In the aspect of automatic extraction of expression frames, a heuristic down sampling (down sampling) method obtains better performance, but the method needs the matching of peak expression frame labeling and is not suitable for continuous spontaneous expression recognition without peak expression frame labeling.

Disclosure of Invention

The invention provides a self-adaptive down-sampling method and a device based on a visual attention theory, the method converts expression information quantity of a time domain to a frequency domain, and obtains a self-adaptive down-sampling factor through frequency analysis, so that the attention of human beings is better simulated, expression frame expression representation capacity obtained through sampling is stronger, and the method and the device are used for solving the technical problem that the prior art is not suitable for continuous self-generation expression identification without peak expression frame marking.

The invention provides a self-adaptive down-sampling method based on a visual attention theory, which comprises the following steps:

measuring expression variation of the video clip by a global optical flow method;

converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain;

and determining a down-sampling factor of the video clip according to the expression information quantity index on the frequency domain.

Preferably, the measuring the expression variation of the video clip by the global optical flow method specifically includes:

calculating the time characteristic f (n) of each frame in the video clip by a global optical flow formula in a global optical flow method;

temporal characterization of video segments

As expression change amount, theTemporal characteristics of video segments

A set of temporal features f (n) for each frame in a video segment;

preferably, the global optical flow formula is:

wherein, Delta I_n(x) Representing I with respect to a pixel vector x_nAnd I_n-1Optical flow between two frames. 2 is the second order norm.

Preferably, the converting the expression variation of the video segment into a frequency domain space by a discrete cosine transform method, and the obtaining the expression information amount index on the frequency domain specifically includes:

temporal characterization of video segments by discrete cosine transform

Converting the expression information into a frequency domain space to obtain the expression information quantity index on the frequency domain

Preferably, the temporal characteristics of the video segments are transformed by a discrete cosine transform method

The method also comprises the following steps:

temporal characterization of video segments by DC offset removal formula

Removing the direct current offset;

the direct current offset removal formula is as follows:

wherein,

to remove the temporal characteristics after the dc offset,

as a time characteristic

(iii) a desire; temporal characterization by removal of DC offset

Replacing original temporal features of video segments

Preferably, the determining the down-sampling factor of the video segment according to the expression information amount index on the frequency domain specifically includes:

calculating the frequency corresponding to the maximum energy of the video clip by using a main frequency calculation formula to obtain a main frequency;

obtaining a down-sampling factor M by dividing a preset maximum frame number of the video clip by the main frequency;

the main frequency calculation formula is as follows:

wherein, beta is the main frequency,

is an indicator of the amount of expression information in the frequency domain,

is composed of

Of the order of magnitude of (d).

Preferably, the measuring the expression variation of the video segment by the global optical flow method further comprises, before the step of: dividing the acquired video into non-overlapping video segments;

after the down-sampling factor of the video segment is determined according to the expression information quantity index on the frequency domain, the method further comprises the following steps: and sampling each video clip according to the down-sampling factor of the video clip to obtain the sampled video clip of the obtained video.

Preferably, the dividing the acquired video into non-overlapping video segments specifically includes:

sequentially dividing the acquired video I according to the mode that the length of the video clip is N frames to obtain non-overlapping video clips

And if the remaining video clip is less than N frames, the video clip is taken as the last video clip.

Preferably, the method for determining the length N frame of the video segment is as follows:

the video frame rate multiplied by the preset video segment duration is equal to the video segment length N frames.

The invention provides a self-adaptive down-sampling device based on a visual attention theory, which comprises:

a memory to store instructions;

a processor coupled to the memory, the processor configured to perform a method implemented as described above based on instructions stored by the memory.

According to the technical scheme, the invention has the following advantages:

the invention provides a self-adaptive down-sampling method based on a visual attention theory, which comprises the following steps: measuring expression variation of the video clip by a global optical flow method; converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain; and determining a down-sampling factor of the video clip according to the expression information quantity index on the frequency domain. The invention converts the expression information quantity of the time domain to the frequency domain, and obtains the self-adaptive down-sampling factor through frequency analysis, thereby better simulating the attention of human beings, and the expression representation capability of the expression frame obtained by sampling is stronger, thereby solving the technical problem that the prior art is not suitable for the continuous self-generation expression recognition without peak expression frame marking.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a diagram illustrating an embodiment of an adaptive downsampling method based on visual attention theory according to the present invention;

FIG. 2 is a schematic diagram of another embodiment of an adaptive downsampling method based on visual attention theory according to the present invention;

fig. 3 is a schematic diagram of an application example of an adaptive downsampling method based on visual attention theory according to the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of an adaptive downsampling method based on visual attention theory according to the present invention includes:

101. measuring expression variation of the video clip by a global optical flow method;

the expression variation may be a temporal feature of the video segment

102. Converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain;

that is, the expression of the expression variation of the video segment is converted into a frequency domain expression (into a frequency domain space), and a frequency domain parameter (expression information quantity index) is obtained. The expression information amount index in the frequency domain may be a temporal feature of the video segment

Discrete cosine transform of (2).

103. And determining a down-sampling factor of the video segment according to the expression information quantity index on the frequency domain (the down-sampling factor is also called as down-sampling granularity by scholars).

The downsampling factor may specifically be the maximum frame number of the video segment divided by the frequency corresponding to the maximum expression information amount indicator energy in the frequency domain.

After step 103, the video segment may be sampled according to the down-sampling factor to obtain a sampled video segment. Sampling a video segment according to a downsampling factor is a well known technique to those skilled in the art. Of course, other video operations may be performed according to the down-sampling factor, which is not limited herein.

The invention converts the expression information quantity of the time domain to the frequency domain, and obtains the self-adaptive down-sampling factor through frequency analysis, thereby better simulating the attention of human beings, and the expression representation capability of the expression frame obtained by sampling is stronger, thereby solving the technical problem that the prior art is not suitable for the continuous self-generation expression recognition without peak expression frame marking.

The foregoing is a detailed description of an embodiment of an adaptive downsampling method based on visual attention theory according to the present invention, and another embodiment of an adaptive downsampling method based on visual attention theory according to the present invention is described in detail below.

Referring to fig. 2, another embodiment of an adaptive downsampling method based on visual attention theory according to the present invention includes:

201. dividing the acquired video into non-overlapping video segments;

the specific segmentation method comprises the following steps: sequentially dividing the acquired video I according to the mode that the length of the video clip is N frames to obtain non-overlapping video clips

And if the remaining video clip is less than N frames, the video clip is taken as the last video clip. Thus, the number of frames of the first video segments is N frames, and the length of the video segment at the last end is less than or equal to N frames. The video is divided into video segments, dynamic attention of visual information can be increased, for unimportant video segments, the number of video segments needing to be sampled is small, for important video segments, the number of video segments needing to be sampled is large, even the whole video segment needs to be sampled, and in addition, if the acquired video has a plurality of segments needing attention, a plurality of important segments can be identified. If the whole video is not segmented and is taken for identification, the identification accuracy is not high, and only one important segment can be identified. Therefore, the video is divided, so that the accuracy of expression recognition can be improved, and the time for recognition can be reduced.

The method for determining the length N frame of the video clip comprises the following steps: the video frame rate multiplied by the preset video segment duration is equal to the video segment length N frames. The video segment duration is typically set to 1s because, according to vision and attention theory, 1Hz is the maximum limit for the Human Visual System (HVS) to process video. And the frame rate of the video may be 25 frames, 60 frames, etc., when the video clip duration is set to 1s, N may be set to 60 if processing the video displaying 60 frames of pictures every 1 second, and N may be set to 25 if processing the video displaying 25 frames of pictures every 1 second.

202. Calculating the time characteristic f (n) of each frame in the video clip by a global optical flow formula in a global optical flow method;

the global optical flow formula is:

wherein, Delta I_n(x) Representing I with respect to a pixel vector x_nAnd I_n-1Optical flow between two frames. 2 is the second order norm. I is_nRepresenting the image of the nth frame. x is a pixel vector. And f (n) is the temporal characteristic of a single frame.

203. Temporal characterization of video segments

As expression change amounts;

temporal characteristics of video segments

A set of temporal features f (n) for each frame in a video segment;

f (n) can be expressed as:

204. temporal characterization of video segments by DC offset removal formula

Removing the DC offset, removing the time characteristic after removing the DC offset

Replacing original temporal features of video segments

The direct current offset removal formula is as follows:

wherein,

to remove the temporal characteristics after the dc offset,

as a time characteristic

(iii) a desire; an important reason for removing the dc offset is for the actual data

The DC offset will be greater than corresponding to a factor of 0Hz

And therefore may result in the selection of the dc offset as the dominant frequency in the following calculations. To avoid this possible situation, the dc offset needs to be removed.

205. Removing the time characteristic of the DC offset by a discrete cosine transform method

If the DC offset is not removedShifting (step 204), step 205 changes the time characteristic of the video segment into discrete cosine transform

DCT (.) is a discrete cosine transform.

206. Calculating the frequency corresponding to the maximum energy of the video clip by using a main frequency calculation formula to obtain a main frequency;

the main frequency calculation formula is as follows:

wherein, beta is the main frequency,

is composed of

Of the order of magnitude of (d). k is the frequency. It is understood that the axis of the coordinate has frequency on the horizontal axis and energy on the vertical axis, and when k is equal to β, the energy has a maximum value.

207. Obtaining a down-sampling factor M by dividing a preset maximum frame number of the video clip by the main frequency;

i.e. the downsampling factor M is equal to the preset maximum frame number of the video segment divided by the main frequency. Typically, the first few video segments, M ═ N/β, because the previous video segments are N frames. If the last segment is 10 frames, then M is 10/β when the sampling factor of the last segment is obtained.

208. And sampling each video clip according to the down-sampling factor of the video clip to obtain the sampled video clip of the obtained video.

According to the present embodiment, the present invention has the following advantages and effects compared to the prior art:

1. for the prior art, the invention provides an index quantity capable of reflecting the expression information quantity contained in a video frame. Compared with other similar algorithms, the index quantity is more suitable for the expression change condition.

2. The invention establishes the video down-sampling time factor model consistent with the human visual attention mechanism, improves the intelligence of the algorithm and is more in line with the human visual attention.

3. Compared with the prior art, the method solves the problems that the existing automatic expression frame extraction method is low in accuracy, needs peak expression frame labeling matching and the like, and achieves the initial target of self-adaptive down-sampling.

The foregoing is a detailed description of another embodiment of the adaptive downsampling method based on the visual attention theory, and an application example of the adaptive downsampling method based on the visual attention theory is described in detail below. In this application example, the video segment duration is set to 1s, the number of frames N is set to 25, and the frame rate of the processed video is 25 Hz.

Referring to fig. 3, an application example of an adaptive downsampling method based on visual attention theory according to the present invention includes:

the method comprises the following steps: the video is divided into uniform smaller video segments, each segment being dynamically sampled, each segment having its own downsampling factor. Video I is segmented into N non-overlapping segments. Video

Containing N frames, i.e.

The system processes all frame images of the entire video starting from the first frame. First the system processes N frames of video clips at a time, from m₀Starting with 0, the previous N frames form a video clip. Then m₀When the number is equal to N,another video segment is formed from the nth frame to the 2N-1 frame (e.g., the system processes the mth frame in this application case)₀Beginning to form the second video segment) and so on until the end of the video. If the video is finally less than N frames, a video clip is formed. We choose the parameter N-25 such that the duration of each segment is 1 second, since 1HZ is the maximum limit of the HVS according to vision and attention theory.

Based on the change of visual information, the human visual system gets dynamic attention. We consider attention as the number of selections for each video segment. If the visual information does not change much, then there is no attention and fewer frames are selected.

Step two: facial expressions are quantified as a time-varying signal. The frequency of the signal must respond to changes in facial expression. Because the frame rate is high and the face is a front face, the light flux can be used to quantify the facial expression. Delta I_n(x) Is I_nAnd I_n-1The optical flow between two frames, the output of which is a motion vector. Summing all pixels in the image to form a one-dimensional signal:

where f (n) is the temporal feature of a single frame, x is a pixel vector, and the second-order norm represents its magnitude. For the entire video clip

Temporal characteristics

Can be represented by the following formula:

step three: to calculate the dominant frequency, the dc offset is first removed:

where E () is the expected value operator. An important reason for removing the dc offset is for the actual data

The DC offset will be greater than corresponding to a factor of 0Hz

And therefore it is possible to select the dc offset as the primary frequency.

Is the discrete cosine transform result:

here DCT (.) is a discrete cosine transform.

Step four: the frequency for the maximum energy is calculated as follows:

where k is the frequency of the signal and,

is shown as

Magnitude (metric). We down-sample the discrete signal by removing samples that do not vary much in the signal. So we sample at the primary frequency beta.

The downsampling factor M is given by (maximum frequency/dominant frequency). The index of the dominant frequency β can be converted to: 2 pi β/N, the maximum frequency index N (i.e., the number of frames N-25 in this application) corresponds to 2 pi, so the downsampling factor M-N can be derived by dividing 2 pi by 2 pi β/NN/beta. The samples for a video segment are therefore:

when the temporal feature has a high frequency β → N, the downsampling factor is close to 1 and all frames are retained. When the temporal features have low frequencies, the downsampling factor is increased and most of the frames are removed.

An embodiment of the adaptive down-sampling device based on visual attention theory provided by the present invention will be described in detail below:

the invention provides an embodiment of an adaptive down-sampling device based on visual attention theory, which comprises:

a memory to store instructions;

a processor coupled to the memory, the processor configured to perform a method as described above based on instructions stored by the memory.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An adaptive down-sampling method based on visual attention theory is characterized by comprising the following steps:

determining a down-sampling factor of a video clip according to the expression information quantity index on the frequency domain, specifically comprising:

the main frequency calculation formula is as follows:

wherein, beta is the main frequency,

is composed of

Of the order of magnitude of (d).

2. The adaptive down-sampling method based on visual attention theory according to claim 1, wherein measuring the expression variance of the video segment by the global optical flow method specifically comprises:

temporal characterization of video segments

As the expression variation, the temporal characteristics of the video segment

Is the set of temporal features f (n) for each frame in a video segment.

3. The adaptive downsampling method based on visual attention theory according to claim 2, wherein the global optical flow formula is:

wherein, Delta I_n(x) Representing I with respect to a pixel vector x_nAnd I_n-1Optical flow between two frames.

4. The adaptive down-sampling method based on visual attention theory according to claim 2, wherein the converting the expression variation of the video segment into a frequency domain space by a discrete cosine transform method, and the obtaining the expression information amount index on the frequency domain specifically comprises:

temporal characterization of video segments by discrete cosine transform

5. The adaptive downsampling method according to claim 4, wherein the temporal characteristics of the video segments are transformed by discrete cosine transform

The method also comprises the following steps:

temporal characterization of video segments by DC offset removal formula

Removing the direct current offset;

the direct current offset removal formula is as follows:

wherein,

to remove the temporal characteristics after the dc offset,

as a time characteristic

(iii) a desire;

temporal characterization by removal of DC offset

Replacing original temporal features of video segments

6. The adaptive down-sampling method based on visual attention theory according to claim 1, wherein the measuring the expression variation of the video segment by the global optical flow method further comprises: dividing the acquired video into non-overlapping video segments;

7. The adaptive downsampling method based on visual attention theory according to claim 6, wherein the dividing the acquired video into non-overlapping video segments specifically comprises:

8. The adaptive downsampling method based on visual attention theory according to claim 1, wherein the video segment length N frames is determined by:

9. An adaptive down-sampling device based on visual attention theory, comprising:

a memory to store instructions;

a processor coupled to the memory, the processor configured to perform implementing the method of any of claims 1-8 based on instructions stored by the memory.