CN108537209B - Adaptive downsampling method and device based on visual attention theory - Google Patents

Adaptive downsampling method and device based on visual attention theory Download PDF

Info

Publication number
CN108537209B
CN108537209B CN201810379089.5A CN201810379089A CN108537209B CN 108537209 B CN108537209 B CN 108537209B CN 201810379089 A CN201810379089 A CN 201810379089A CN 108537209 B CN108537209 B CN 108537209B
Authority
CN
China
Prior art keywords
video
expression
frequency domain
sampling
video clip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810379089.5A
Other languages
Chinese (zh)
Other versions
CN108537209A (en
Inventor
姬秋敏
张灵
陈云华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201810379089.5A priority Critical patent/CN108537209B/en
Publication of CN108537209A publication Critical patent/CN108537209A/en
Application granted granted Critical
Publication of CN108537209B publication Critical patent/CN108537209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a self-adaptive down-sampling method and a device based on a visual attention theory, wherein the method comprises the following steps: measuring expression variation of the video clip by a global optical flow method; converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain; and determining a down-sampling factor of the video clip according to the expression information quantity index on the frequency domain. The invention converts the expression information quantity of the time domain to the frequency domain, and obtains the self-adaptive down-sampling factor through frequency analysis, thereby better simulating the attention of human beings, and the expression representation capability of the expression frame obtained by sampling is stronger, thereby solving the technical problem that the prior art is not suitable for the continuous self-generation expression recognition without peak expression frame marking.

Description

Adaptive downsampling method and device based on visual attention theory
Technical Field
The invention relates to the technical field of face recognition, in particular to a self-adaptive down-sampling method and device based on a visual attention theory.
Background
On one hand, facial expression recognition has wide application prospects in the fields of medical treatment, public safety, man-machine interaction and the like, so that the research of the facial expression recognition has practical significance; on the other hand, the solution of the expression recognition problem relates to many basic problems in the fields of image processing, computer vision, optimization theory and the like, such as image registration, image segmentation, image feature extraction, machine learning, optimization algorithm and the like, and therefore, the research thereof has important theoretical significance. Dynamic continuous spontaneous (spoken) expression recognition is a research focus of expression recognition in recent years due to its greater research challenge and higher application value compared to static beat (staged) expression recognition. As a latest spontaneous expression data set, an Audio/video Emotion Challenge (AVEC) is recorded about 150 ten thousand frames of video, the facial expression in the video is natural and fine, and the peak expression frame is not labeled by the data set. International competitions against this data set have been carried out in many ways, exposing many of the deficiencies of existing feature representation models in dealing with continuous spontaneous expressions. The continuous spontaneous expression data set has huge video quantity and various change forms, and has no peak expression frame mark, so the core task of the feature representation is the automatic extraction of expression frames. In the aspect of automatic extraction of expression frames, a heuristic down sampling (down sampling) method obtains better performance, but the method needs the matching of peak expression frame labeling and is not suitable for continuous spontaneous expression recognition without peak expression frame labeling.
Disclosure of Invention
The invention provides a self-adaptive down-sampling method and a device based on a visual attention theory, the method converts expression information quantity of a time domain to a frequency domain, and obtains a self-adaptive down-sampling factor through frequency analysis, so that the attention of human beings is better simulated, expression frame expression representation capacity obtained through sampling is stronger, and the method and the device are used for solving the technical problem that the prior art is not suitable for continuous self-generation expression identification without peak expression frame marking.
The invention provides a self-adaptive down-sampling method based on a visual attention theory, which comprises the following steps:
measuring expression variation of the video clip by a global optical flow method;
converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain;
and determining a down-sampling factor of the video clip according to the expression information quantity index on the frequency domain.
Preferably, the measuring the expression variation of the video clip by the global optical flow method specifically includes:
calculating the time characteristic f (n) of each frame in the video clip by a global optical flow formula in a global optical flow method;
temporal characterization of video segments
Figure GDA0003077659920000021
As expression change amount, theTemporal characteristics of video segments
Figure GDA0003077659920000022
A set of temporal features f (n) for each frame in a video segment;
preferably, the global optical flow formula is:
Figure GDA0003077659920000023
wherein, Delta In(x) Representing I with respect to a pixel vector xnAnd In-1Optical flow between two frames. 2 is the second order norm.
Preferably, the converting the expression variation of the video segment into a frequency domain space by a discrete cosine transform method, and the obtaining the expression information amount index on the frequency domain specifically includes:
temporal characterization of video segments by discrete cosine transform
Figure GDA0003077659920000024
Converting the expression information into a frequency domain space to obtain the expression information quantity index on the frequency domain
Figure GDA0003077659920000025
Preferably, the temporal characteristics of the video segments are transformed by a discrete cosine transform method
Figure GDA0003077659920000026
Converting the expression information into a frequency domain space to obtain the expression information quantity index on the frequency domain
Figure GDA0003077659920000027
The method also comprises the following steps:
temporal characterization of video segments by DC offset removal formula
Figure GDA0003077659920000028
Removing the direct current offset;
the direct current offset removal formula is as follows:
Figure GDA0003077659920000029
wherein,
Figure GDA00030776599200000210
to remove the temporal characteristics after the dc offset,
Figure GDA00030776599200000211
as a time characteristic
Figure GDA00030776599200000212
(iii) a desire; temporal characterization by removal of DC offset
Figure GDA00030776599200000213
Replacing original temporal features of video segments
Figure GDA00030776599200000214
Preferably, the determining the down-sampling factor of the video segment according to the expression information amount index on the frequency domain specifically includes:
calculating the frequency corresponding to the maximum energy of the video clip by using a main frequency calculation formula to obtain a main frequency;
obtaining a down-sampling factor M by dividing a preset maximum frame number of the video clip by the main frequency;
the main frequency calculation formula is as follows:
Figure GDA0003077659920000031
wherein, beta is the main frequency,
Figure GDA0003077659920000032
is an indicator of the amount of expression information in the frequency domain,
Figure GDA0003077659920000033
is composed of
Figure GDA0003077659920000034
Of the order of magnitude of (d).
Preferably, the measuring the expression variation of the video segment by the global optical flow method further comprises, before the step of: dividing the acquired video into non-overlapping video segments;
after the down-sampling factor of the video segment is determined according to the expression information quantity index on the frequency domain, the method further comprises the following steps: and sampling each video clip according to the down-sampling factor of the video clip to obtain the sampled video clip of the obtained video.
Preferably, the dividing the acquired video into non-overlapping video segments specifically includes:
sequentially dividing the acquired video I according to the mode that the length of the video clip is N frames to obtain non-overlapping video clips
Figure GDA0003077659920000035
And if the remaining video clip is less than N frames, the video clip is taken as the last video clip.
Preferably, the method for determining the length N frame of the video segment is as follows:
the video frame rate multiplied by the preset video segment duration is equal to the video segment length N frames.
The invention provides a self-adaptive down-sampling device based on a visual attention theory, which comprises:
a memory to store instructions;
a processor coupled to the memory, the processor configured to perform a method implemented as described above based on instructions stored by the memory.
According to the technical scheme, the invention has the following advantages:
the invention provides a self-adaptive down-sampling method based on a visual attention theory, which comprises the following steps: measuring expression variation of the video clip by a global optical flow method; converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain; and determining a down-sampling factor of the video clip according to the expression information quantity index on the frequency domain. The invention converts the expression information quantity of the time domain to the frequency domain, and obtains the self-adaptive down-sampling factor through frequency analysis, thereby better simulating the attention of human beings, and the expression representation capability of the expression frame obtained by sampling is stronger, thereby solving the technical problem that the prior art is not suitable for the continuous self-generation expression recognition without peak expression frame marking.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a diagram illustrating an embodiment of an adaptive downsampling method based on visual attention theory according to the present invention;
FIG. 2 is a schematic diagram of another embodiment of an adaptive downsampling method based on visual attention theory according to the present invention;
fig. 3 is a schematic diagram of an application example of an adaptive downsampling method based on visual attention theory according to the present invention.
Detailed Description
The invention provides a self-adaptive down-sampling method and a device based on a visual attention theory, the method converts expression information quantity of a time domain to a frequency domain, and obtains a self-adaptive down-sampling factor through frequency analysis, so that the attention of human beings is better simulated, expression frame expression representation capacity obtained through sampling is stronger, and the method and the device are used for solving the technical problem that the prior art is not suitable for continuous self-generation expression identification without peak expression frame marking.
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an embodiment of an adaptive downsampling method based on visual attention theory according to the present invention includes:
101. measuring expression variation of the video clip by a global optical flow method;
the expression variation may be a temporal feature of the video segment
Figure GDA0003077659920000041
102. Converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain;
that is, the expression of the expression variation of the video segment is converted into a frequency domain expression (into a frequency domain space), and a frequency domain parameter (expression information quantity index) is obtained. The expression information amount index in the frequency domain may be a temporal feature of the video segment
Figure GDA0003077659920000042
Discrete cosine transform of (2).
103. And determining a down-sampling factor of the video segment according to the expression information quantity index on the frequency domain (the down-sampling factor is also called as down-sampling granularity by scholars).
The downsampling factor may specifically be the maximum frame number of the video segment divided by the frequency corresponding to the maximum expression information amount indicator energy in the frequency domain.
After step 103, the video segment may be sampled according to the down-sampling factor to obtain a sampled video segment. Sampling a video segment according to a downsampling factor is a well known technique to those skilled in the art. Of course, other video operations may be performed according to the down-sampling factor, which is not limited herein.
The invention converts the expression information quantity of the time domain to the frequency domain, and obtains the self-adaptive down-sampling factor through frequency analysis, thereby better simulating the attention of human beings, and the expression representation capability of the expression frame obtained by sampling is stronger, thereby solving the technical problem that the prior art is not suitable for the continuous self-generation expression recognition without peak expression frame marking.
The foregoing is a detailed description of an embodiment of an adaptive downsampling method based on visual attention theory according to the present invention, and another embodiment of an adaptive downsampling method based on visual attention theory according to the present invention is described in detail below.
Referring to fig. 2, another embodiment of an adaptive downsampling method based on visual attention theory according to the present invention includes:
201. dividing the acquired video into non-overlapping video segments;
the specific segmentation method comprises the following steps: sequentially dividing the acquired video I according to the mode that the length of the video clip is N frames to obtain non-overlapping video clips
Figure GDA0003077659920000051
And if the remaining video clip is less than N frames, the video clip is taken as the last video clip. Thus, the number of frames of the first video segments is N frames, and the length of the video segment at the last end is less than or equal to N frames. The video is divided into video segments, dynamic attention of visual information can be increased, for unimportant video segments, the number of video segments needing to be sampled is small, for important video segments, the number of video segments needing to be sampled is large, even the whole video segment needs to be sampled, and in addition, if the acquired video has a plurality of segments needing attention, a plurality of important segments can be identified. If the whole video is not segmented and is taken for identification, the identification accuracy is not high, and only one important segment can be identified. Therefore, the video is divided, so that the accuracy of expression recognition can be improved, and the time for recognition can be reduced.
The method for determining the length N frame of the video clip comprises the following steps: the video frame rate multiplied by the preset video segment duration is equal to the video segment length N frames. The video segment duration is typically set to 1s because, according to vision and attention theory, 1Hz is the maximum limit for the Human Visual System (HVS) to process video. And the frame rate of the video may be 25 frames, 60 frames, etc., when the video clip duration is set to 1s, N may be set to 60 if processing the video displaying 60 frames of pictures every 1 second, and N may be set to 25 if processing the video displaying 25 frames of pictures every 1 second.
202. Calculating the time characteristic f (n) of each frame in the video clip by a global optical flow formula in a global optical flow method;
the global optical flow formula is:
Figure GDA0003077659920000061
wherein, Delta In(x) Representing I with respect to a pixel vector xnAnd In-1Optical flow between two frames. 2 is the second order norm. I isnRepresenting the image of the nth frame. x is a pixel vector. And f (n) is the temporal characteristic of a single frame.
203. Temporal characterization of video segments
Figure GDA0003077659920000062
As expression change amounts;
temporal characteristics of video segments
Figure GDA0003077659920000063
A set of temporal features f (n) for each frame in a video segment;
f (n) can be expressed as:
Figure GDA0003077659920000064
204. temporal characterization of video segments by DC offset removal formula
Figure GDA0003077659920000065
Removing the DC offset, removing the time characteristic after removing the DC offset
Figure GDA0003077659920000066
Replacing original temporal features of video segments
Figure GDA0003077659920000067
The direct current offset removal formula is as follows:
Figure GDA0003077659920000068
wherein,
Figure GDA0003077659920000069
to remove the temporal characteristics after the dc offset,
Figure GDA00030776599200000610
as a time characteristic
Figure GDA00030776599200000611
(iii) a desire; an important reason for removing the dc offset is for the actual data
Figure GDA00030776599200000612
The DC offset will be greater than corresponding to a factor of 0Hz
Figure GDA00030776599200000613
And therefore may result in the selection of the dc offset as the dominant frequency in the following calculations. To avoid this possible situation, the dc offset needs to be removed.
205. Removing the time characteristic of the DC offset by a discrete cosine transform method
Figure GDA00030776599200000614
Converting the expression information into a frequency domain space to obtain the expression information quantity index on the frequency domain
Figure GDA00030776599200000615
If the DC offset is not removedShifting (step 204), step 205 changes the time characteristic of the video segment into discrete cosine transform
Figure GDA00030776599200000616
Converting the expression information into a frequency domain space to obtain the expression information quantity index on the frequency domain
Figure GDA00030776599200000617
DCT (.) is a discrete cosine transform.
206. Calculating the frequency corresponding to the maximum energy of the video clip by using a main frequency calculation formula to obtain a main frequency;
the main frequency calculation formula is as follows:
Figure GDA00030776599200000618
wherein, beta is the main frequency,
Figure GDA00030776599200000619
is an indicator of the amount of expression information in the frequency domain,
Figure GDA00030776599200000620
is composed of
Figure GDA00030776599200000621
Of the order of magnitude of (d). k is the frequency. It is understood that the axis of the coordinate has frequency on the horizontal axis and energy on the vertical axis, and when k is equal to β, the energy has a maximum value.
207. Obtaining a down-sampling factor M by dividing a preset maximum frame number of the video clip by the main frequency;
i.e. the downsampling factor M is equal to the preset maximum frame number of the video segment divided by the main frequency. Typically, the first few video segments, M ═ N/β, because the previous video segments are N frames. If the last segment is 10 frames, then M is 10/β when the sampling factor of the last segment is obtained.
208. And sampling each video clip according to the down-sampling factor of the video clip to obtain the sampled video clip of the obtained video.
According to the present embodiment, the present invention has the following advantages and effects compared to the prior art:
1. for the prior art, the invention provides an index quantity capable of reflecting the expression information quantity contained in a video frame. Compared with other similar algorithms, the index quantity is more suitable for the expression change condition.
2. The invention establishes the video down-sampling time factor model consistent with the human visual attention mechanism, improves the intelligence of the algorithm and is more in line with the human visual attention.
3. Compared with the prior art, the method solves the problems that the existing automatic expression frame extraction method is low in accuracy, needs peak expression frame labeling matching and the like, and achieves the initial target of self-adaptive down-sampling.
The foregoing is a detailed description of another embodiment of the adaptive downsampling method based on the visual attention theory, and an application example of the adaptive downsampling method based on the visual attention theory is described in detail below. In this application example, the video segment duration is set to 1s, the number of frames N is set to 25, and the frame rate of the processed video is 25 Hz.
Referring to fig. 3, an application example of an adaptive downsampling method based on visual attention theory according to the present invention includes:
the method comprises the following steps: the video is divided into uniform smaller video segments, each segment being dynamically sampled, each segment having its own downsampling factor. Video I is segmented into N non-overlapping segments. Video
Figure GDA0003077659920000071
Containing N frames, i.e.
Figure GDA0003077659920000072
The system processes all frame images of the entire video starting from the first frame. First the system processes N frames of video clips at a time, from m0Starting with 0, the previous N frames form a video clip. Then m0When the number is equal to N,another video segment is formed from the nth frame to the 2N-1 frame (e.g., the system processes the mth frame in this application case)0Beginning to form the second video segment) and so on until the end of the video. If the video is finally less than N frames, a video clip is formed. We choose the parameter N-25 such that the duration of each segment is 1 second, since 1HZ is the maximum limit of the HVS according to vision and attention theory.
Based on the change of visual information, the human visual system gets dynamic attention. We consider attention as the number of selections for each video segment. If the visual information does not change much, then there is no attention and fewer frames are selected.
Step two: facial expressions are quantified as a time-varying signal. The frequency of the signal must respond to changes in facial expression. Because the frame rate is high and the face is a front face, the light flux can be used to quantify the facial expression. Delta In(x) Is InAnd In-1The optical flow between two frames, the output of which is a motion vector. Summing all pixels in the image to form a one-dimensional signal:
Figure GDA0003077659920000081
where f (n) is the temporal feature of a single frame, x is a pixel vector, and the second-order norm represents its magnitude. For the entire video clip
Figure GDA0003077659920000082
Temporal characteristics
Figure GDA0003077659920000083
Can be represented by the following formula:
Figure GDA0003077659920000084
step three: to calculate the dominant frequency, the dc offset is first removed:
Figure GDA0003077659920000085
where E () is the expected value operator. An important reason for removing the dc offset is for the actual data
Figure GDA0003077659920000086
The DC offset will be greater than corresponding to a factor of 0Hz
Figure GDA0003077659920000087
And therefore it is possible to select the dc offset as the primary frequency.
Figure GDA0003077659920000088
Is the discrete cosine transform result:
Figure GDA0003077659920000089
here DCT (.) is a discrete cosine transform.
Step four: the frequency for the maximum energy is calculated as follows:
Figure GDA00030776599200000810
where k is the frequency of the signal and,
Figure GDA00030776599200000811
is shown as
Figure GDA00030776599200000812
Magnitude (metric). We down-sample the discrete signal by removing samples that do not vary much in the signal. So we sample at the primary frequency beta.
The downsampling factor M is given by (maximum frequency/dominant frequency). The index of the dominant frequency β can be converted to: 2 pi β/N, the maximum frequency index N (i.e., the number of frames N-25 in this application) corresponds to 2 pi, so the downsampling factor M-N can be derived by dividing 2 pi by 2 pi β/NN/beta. The samples for a video segment are therefore:
Figure GDA00030776599200000813
when the temporal feature has a high frequency β → N, the downsampling factor is close to 1 and all frames are retained. When the temporal features have low frequencies, the downsampling factor is increased and most of the frames are removed.
An embodiment of the adaptive down-sampling device based on visual attention theory provided by the present invention will be described in detail below:
the invention provides an embodiment of an adaptive down-sampling device based on visual attention theory, which comprises:
a memory to store instructions;
a processor coupled to the memory, the processor configured to perform a method as described above based on instructions stored by the memory.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. An adaptive down-sampling method based on visual attention theory is characterized by comprising the following steps:
measuring expression variation of the video clip by a global optical flow method;
converting the expression variable quantity of the video clip into a frequency domain space by a discrete cosine transform method to obtain an expression information quantity index on a frequency domain;
determining a down-sampling factor of a video clip according to the expression information quantity index on the frequency domain, specifically comprising:
calculating the frequency corresponding to the maximum energy of the video clip by using a main frequency calculation formula to obtain a main frequency;
obtaining a down-sampling factor M by dividing a preset maximum frame number of the video clip by the main frequency;
the main frequency calculation formula is as follows:
Figure FDA0003077659910000011
wherein, beta is the main frequency,
Figure FDA0003077659910000012
is an indicator of the amount of expression information in the frequency domain,
Figure FDA0003077659910000013
is composed of
Figure FDA0003077659910000014
Of the order of magnitude of (d).
2. The adaptive down-sampling method based on visual attention theory according to claim 1, wherein measuring the expression variance of the video segment by the global optical flow method specifically comprises:
calculating the time characteristic f (n) of each frame in the video clip by a global optical flow formula in a global optical flow method;
temporal characterization of video segments
Figure FDA0003077659910000015
As the expression variation, the temporal characteristics of the video segment
Figure FDA0003077659910000016
Is the set of temporal features f (n) for each frame in a video segment.
3. The adaptive downsampling method based on visual attention theory according to claim 2, wherein the global optical flow formula is:
Figure FDA0003077659910000017
wherein, Delta In(x) Representing I with respect to a pixel vector xnAnd In-1Optical flow between two frames.
4. The adaptive down-sampling method based on visual attention theory according to claim 2, wherein the converting the expression variation of the video segment into a frequency domain space by a discrete cosine transform method, and the obtaining the expression information amount index on the frequency domain specifically comprises:
temporal characterization of video segments by discrete cosine transform
Figure FDA0003077659910000018
Converting the expression information into a frequency domain space to obtain the expression information quantity index on the frequency domain
Figure FDA0003077659910000019
5. The adaptive downsampling method according to claim 4, wherein the temporal characteristics of the video segments are transformed by discrete cosine transform
Figure FDA00030776599100000110
Converting the expression information into a frequency domain space to obtain the expression information quantity index on the frequency domain
Figure FDA00030776599100000111
The method also comprises the following steps:
temporal characterization of video segments by DC offset removal formula
Figure FDA0003077659910000021
Removing the direct current offset;
the direct current offset removal formula is as follows:
Figure FDA0003077659910000022
wherein,
Figure FDA0003077659910000023
to remove the temporal characteristics after the dc offset,
Figure FDA0003077659910000024
as a time characteristic
Figure FDA0003077659910000025
(iii) a desire;
temporal characterization by removal of DC offset
Figure FDA0003077659910000026
Replacing original temporal features of video segments
Figure FDA0003077659910000027
6. The adaptive down-sampling method based on visual attention theory according to claim 1, wherein the measuring the expression variation of the video segment by the global optical flow method further comprises: dividing the acquired video into non-overlapping video segments;
after the down-sampling factor of the video segment is determined according to the expression information quantity index on the frequency domain, the method further comprises the following steps: and sampling each video clip according to the down-sampling factor of the video clip to obtain the sampled video clip of the obtained video.
7. The adaptive downsampling method based on visual attention theory according to claim 6, wherein the dividing the acquired video into non-overlapping video segments specifically comprises:
sequentially dividing the acquired video I according to the mode that the length of the video clip is N frames to obtain non-overlapping video clips
Figure FDA0003077659910000028
8. The adaptive downsampling method based on visual attention theory according to claim 1, wherein the video segment length N frames is determined by:
the video frame rate multiplied by the preset video segment duration is equal to the video segment length N frames.
9. An adaptive down-sampling device based on visual attention theory, comprising:
a memory to store instructions;
a processor coupled to the memory, the processor configured to perform implementing the method of any of claims 1-8 based on instructions stored by the memory.
CN201810379089.5A 2018-04-25 2018-04-25 Adaptive downsampling method and device based on visual attention theory Active CN108537209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810379089.5A CN108537209B (en) 2018-04-25 2018-04-25 Adaptive downsampling method and device based on visual attention theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810379089.5A CN108537209B (en) 2018-04-25 2018-04-25 Adaptive downsampling method and device based on visual attention theory

Publications (2)

Publication Number Publication Date
CN108537209A CN108537209A (en) 2018-09-14
CN108537209B true CN108537209B (en) 2021-08-27

Family

ID=63478646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810379089.5A Active CN108537209B (en) 2018-04-25 2018-04-25 Adaptive downsampling method and device based on visual attention theory

Country Status (1)

Country Link
CN (1) CN108537209B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114038036A (en) * 2021-11-09 2022-02-11 北京九州安华信息安全技术有限公司 Spontaneous expression recognition method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007286963A (en) * 2006-04-18 2007-11-01 Nippon Telegr & Teleph Corp <Ntt> Video motion pattern analyzing device, and method and program thereof
CN104769611A (en) * 2012-11-06 2015-07-08 诺基亚技术有限公司 Method and apparatus for summarization based on facial expressions
CN107431635A (en) * 2015-03-27 2017-12-01 英特尔公司 The animation of incarnation facial expression and/or voice driven

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007286963A (en) * 2006-04-18 2007-11-01 Nippon Telegr & Teleph Corp <Ntt> Video motion pattern analyzing device, and method and program thereof
CN104769611A (en) * 2012-11-06 2015-07-08 诺基亚技术有限公司 Method and apparatus for summarization based on facial expressions
CN107431635A (en) * 2015-03-27 2017-12-01 英特尔公司 The animation of incarnation facial expression and/or voice driven

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Recognition of facial expressions and measurement of levels of interest from video;M. Yeasin 等;《IEEE Transactions on Multimedia》;20060515;第8卷(第3期);第500-508页 *
基于视觉注意机制的目标检测算法研究;刘亨立;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170215;第I138-3061页 *

Also Published As

Publication number Publication date
CN108537209A (en) 2018-09-14

Similar Documents

Publication Publication Date Title
CN109740499B (en) Video segmentation method, video motion recognition method, device, equipment and medium
CN110379416B (en) Neural network language model training method, device, equipment and storage medium
CN108875931B (en) Neural network training and image processing method, device and system
CN110929780A (en) Video classification model construction method, video classification device, video classification equipment and media
CN112929695B (en) Video duplicate removal method and device, electronic equipment and storage medium
CN109961041B (en) Video identification method and device and storage medium
CN114895817B (en) Interactive information processing method, network model training method and device
WO2022227768A1 (en) Dynamic gesture recognition method and apparatus, and device and storage medium
CN116363261B (en) Training method of image editing model, image editing method and device
CN114187624B (en) Image generation method, device, electronic equipment and storage medium
WO2019047655A1 (en) Method and apparatus for use in determining driving behavior of driverless vehicle
CN109064548B (en) Video generation method, device, equipment and storage medium
CN114332318A (en) Virtual image generation method and related equipment thereof
CN114339409A (en) Video processing method, video processing device, computer equipment and storage medium
CN108537209B (en) Adaptive downsampling method and device based on visual attention theory
CN113177526B (en) Image processing method, device, equipment and storage medium based on face recognition
CN111488779A (en) Video image super-resolution reconstruction method, device, server and storage medium
CN109741300B (en) Image significance rapid detection method and device suitable for video coding
EP4152269A1 (en) Method and apparatus of generating 3d video, method and apparatus of training model, device, and medium
CN116611491A (en) Training method and device of target detection model, electronic equipment and storage medium
CN114783454B (en) Model training and audio noise reduction method, device, equipment and storage medium
CN114882151A (en) Method and device for generating virtual image video, equipment, medium and product
CN114612976A (en) Key point detection method and device, computer readable medium and electronic equipment
CN115205094A (en) Neural network training method, image detection method and equipment thereof
CN114419182A (en) Image processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant