Summary of the invention
The present invention is different from the Moving Objects in Video Sequences extracting methods such as existing optical flow method, frame difference method, kinergety detection method, background subtraction method, that to take the coded messages such as predictive mode in video code flow, motion vector be basis, according to the relevance of coded message and visual impression region-of-interest, visual signature significance region, spatial domain and temporal signatures visual saliency region in identification video coding contents, thus realize the Automatic Logos of video interested region and obtain.
According to HVS feature, human eye is more responsive than chrominance information to monochrome information, and the inventive method, for the coded message of the luminance component in video sequence, is carried out the Automatic Logos of video interested region and obtains.
The inventive method specifically comprises the steps:
Step 1: input yuv format, GOP(Group of Picture, GOP) video sequence that structure is IPPP, reads the luminance component Y of coded macroblocks, carries out coding parameter configuration and initiation parameter;
Step 2: to the first frame of video sequence, I frame carries out intraframe predictive coding;
In video encoding standard, I frame is as the reference point of random access, contain bulk information, because can not utilizing the temporal correlation between consecutive frame, it encodes, thereby employing intra-frame predictive encoding method, utilize the coded message of own coding and rebuilding macro block in present frame to predict current macro, to eliminate spatial redundancy.To the first frame of video sequence, to carry out intraframe predictive coding be a kind of conventional coded system habitual in Video coding to I frame.
Step 3: current p frame is carried out to inter prediction encoding, utilize the correlation of consecutive frame video content to eliminate time redundancy.The inter-frame forecast mode type that records all coded macroblockss in present frame, is designated as Mode
pn;
Wherein, p=1,2,3 ..., L-1, represents p frame of video of carrying out interframe encode, L is the totalframes that whole video sequence is encoded; N is illustrated in the sequence number of n coded macroblocks in current encoded frame.
Step 4: identify the visual signature significance region, spatial domain of current p frame, be specially: if the inter-frame forecast mode Mode of current coding macro block
pnbelong to sub-split set of modes or intra prediction mode set, i.e. Mode
pn∈ 8 * 8,8 * 4,4 * 8, and 4 * 4}or{Intra16 * 16, Intra4 * 4}, is labeled as S by this macro block
yp(x, y, Mode
pn)=1, belongs to visual signature significance region, spatial domain, otherwise mark S
yp(x, y, Mode
pn)=0; Wherein, the luminance component of Y presentation code macro block, (x, y) represents the position coordinates of this coded macroblocks, p and Mode
pndefinition the same, travel through all coded macroblockss in current p frame;
Fig. 1 has provided H.264 standard inter-frame forecast mode and has selected schematic flow sheet.
Through experiment, find, in standard code H.264/AVC, between predictive coding result and human eye area-of-interest, to there is strong correlation: for the higher moving region of human eye attention rate or texture-rich region, Mode
pnlarge more options sub-split set of modes 8 * 8,8 * 4,4 * 8,4 * 4}; At camera lens, switch, video content is undergone mutation, or while there is the larger Moving Objects of motion amplitude, human eye attention rate is the highest, now Mode
pnjust can select intra prediction mode set { Intra16 * 16, Intra4 * 4}; For the lower background smooth region of human eye attention rate, Mode
pnlarge more options macroblock partition set of modes { Skip, 16 * 16,16 * 8,8 * 16}.Fig. 2 be take Claire sequence as example, has provided Claire sequence the 50th frame inter-frame forecast mode distribution map, can find from figure in the higher region of human eye attention rate, and coded macroblocks has mostly been selected the set of interframe sub-split predictive mode.
Step 5: record each coded macroblocks motion vector V in the horizontal direction in p frame
xpnmotion vector V in vertical direction
ypn; And calculate all coded macroblockss average motion vector in the horizontal direction in previous coded frame
and the average motion vector in vertical direction
Wherein,
V
x (p-1) nand V
y (p-1) nrepresent each coded macroblocks motion vector in the horizontal and vertical directions in previous coded frame, the definition of p and n is identical with step 3; Num represents the macro block number comprising in a coded frame, namely accumulative frequency.It is example that Fig. 3 be take the video of QCIF form (176 * 144), has provided position and the sequence number n thereof of all coded macroblockss (16 * 16) in a coded frame, now,
Step 6: identify the time domain visual signature significance region of current p frame, be specially: if the horizontal motion vector V of current coding macro block
xpnbe greater than former frame coded macroblocks motion vector mean value in the horizontal direction
or the movement in vertical direction vector V of current coding macro block
ypnbe greater than former frame coded macroblocks motion vector mean value in the vertical direction
this macro block belongs to time domain visual signature significance region, mark T
yp(x, y, V
xpn, V
ypn)=1, otherwise mark T
yp(x, y, V
xpn, V
ypn)=0, travels through all coded macroblockss in current p frame;
Wherein, the luminance component of Y presentation code macro block, (x, y) represents the position coordinates of this coded macroblocks, the definition of p is identical with step 3.
Motion perception is one of most important visual processes mechanism in human visual system.Through experiment, the encoded content that discovery has larger motion vector is the interested moving region of human eye (as head, arm, personage etc.) just; And motion vector less be even zero the encoded content static background region that human eye attention rate is lower just.Fig. 4 be take Akiyo sequence as example, has provided Akiyo sequence the 50th frame motion vector distribution map, can find that coded macroblocks has larger motion vector conventionally in the higher people's face of human eye attention rate and head shoulder region from figure.
Acutely whether, the setting of decision threshold is larger on the impact of result for the movement degree of current coding macro block.For reducing False Rate, the present invention is designated as the movement degree decision threshold of horizontal direction and vertical direction respectively
with
represent all coded macroblockss average motion vector in the horizontal direction in former frame,
represent all coded macroblockss average motion vector in vertical direction in former frame.The setting of dynamic threshold in the present invention, taken into full account the temporal correlation of video sequence, threshold value can be changed with the variation of former frame coded macroblocks motion vector mean value, effectively reduced erroneous judgement, can obtain quickly and accurately time domain visual signature significance region.
Step 7: the video interested region of the current p frame of mark, is specially: travel through all coded macroblockss in current p frame, carry out mark according to the spatial feature significance of each coded macroblocks and time domain visual signature significance, concrete mark formula is as follows:
Marking video area-of-interest, is divided into following a few class situation:
If current coding macro block has spatial domain and time domain visual signature significance, i.e. S simultaneously
yp(x, y, Mode
pn)=1 and T
yp(x, y, V
xpn, V
ypn)=1, current coding macro block is described, and not only grain details is abundant, and has produced larger motion vector, and human eye interest level is the highest, mark ROI
yp(x, y)=3;
If only there is time domain visual signature significance, do not there is spatial domain visual signature significance, i.e. T
yp(x, y, V
xpn, V
ypn)=1 and S
yp(x, y, Mode
pn)=0, illustrates that current coding macro block has produced larger motion vector, and according to the Perception Features of HVS, human eye has high susceptibility to the motion of object, and human eye interest level takes second place, mark ROI
yp(x, y)=2;
If macro block movement degree is lower, do not there is time domain visual signature significance, but there is abundant texture information, only there is spatial domain visual signature significance, i.e. S
yp(x, y, Mode
pn)=1 and T
yp(x, y, V
xpn, V
ypn)=0, human eye interest level again, mark ROI
yp(x, y)=1;
If neither there is spatial domain visual signature significance, do not there is time domain visual signature significance, i.e. S yet
yp(x, y, Mode
pn)=0 and T
yp(x, y, V
xpn, V
ypn)=0, illustrates that current coding macro block texture is smooth, mild or static, the normally static background area of moving, and is the non-area-of-interest of human eye, and human eye interest level is minimum, mark ROI
yp(x, y)=0;
Wherein, ROI
yp(x, y) represents current coding macro block visual impression interest priority; T
yp(x, y, V
xpn, V
ypn) represent the time domain visual signature significance of current coding macro block; S
yp(x, y, Mode
pn) represent the spatial domain visual signature significance of current coding macro block; (x, y) represents the position coordinates of current coding macro block; Y represents the luminance component of macro block; P represents p frame of video of carrying out interframe encode; N is illustrated in the sequence number of n coded macroblocks in current encoded frame.
Step 8: output video encoding code stream, is specially: according to the ROI of mark
yp(x, y) priority level height interested, does following processing to the luminance component Y of all macro blocks in current p frame, and the video flowing after output token,
Because the span of the luminance component of coded macroblocks is Y ∈ [0,255], from 0 to 255 represents that macro block brightness component is from complete black in complete 256 white ranks.According to the ROI of mark
yp(x, y) priority level height interested, the luminance component Y that the present invention is directed to macro block does following processing, and the video flowing after output token.
If ROI
yp(x, y)=3, interest level is the highest, and human eye attention rate is the highest, and the luminance component of this coded macroblocks is made as to 255, and output macro Block Brightness component value is the highest, i.e. Y
p(x, y)=255;
If ROI
yp(x, y)=2, interest level takes second place, and human eye attention rate is higher, and the luminance component of this coded macroblocks is made as to 150, and output macro Block Brightness component value is higher, i.e. Y
p(x, y)=150;
If ROI
yp(x, y)=1, again, human eye attention rate is lower for interest level, and the luminance component of this coded macroblocks is made as to 100, and output macro Block Brightness component value is lower, i.e. Y
p(x, y)=100;
If ROI
yp(x, y)=0, moral sense region-of-interest, human eye attention rate is minimum, and the luminance component of this coded macroblocks is made as to 0, and output macro Block Brightness component value is minimum, i.e. Y
p(x, y)=0.
Step 9: return to step 3, next frame is processed, until travel through whole video sequence.
Fig. 5 has provided video interested region sign and extracting method flow chart.
Fig. 6 has provided the video interested region Output rusults after the mark of exemplary video sequence.Beneficial effect
This method according to basic coding information realization the rapid extraction of video interested region.This method is utilized the relevance between basic coding information and human eye vision area-of-interest, identify respectively visual signature significance region, spatial domain and temporal signatures visual saliency region in video coding contents, again in conjunction with the sign result in spatial domain and time domain visual signature significance region, define video interested priority, finally realized video interested automatic extraction.The video coding technique that the inventive method can be based on region of interest ROI (Region of Interest, ROI) provides important coding basis.
Embodiment
More responsive than chrominance information to monochrome information in view of human eye, the inventive method is encoded for the luminance component of frame of video.First read in video sequence, extract its luminance component, call Automatic Logos and extraction that video interested region extraction module of the present invention completes area-of-interest.
In the invention process, be to adopt video capture device (as Digital Video etc.) to realize the collection of video image, and by picture transmission to computer, in computer, according to the coded message in video code flow, realize the Automatic Logos of video interested region.Visual signature significance region, predictive coding pattern identification spatial domain according to current coding macro block; The motion vector in horizontal or vertical direction according to current coding macro block again, sign time domain visual signature significance region, reduces due to the impact of different video motion types for region of interesting extraction accuracy by setting dynamic motion vector decision threshold; Finally according to spatial domain/time domain visual signature significance, obtain video interested classification results, realize the automatic extraction of video interested region.
In concrete enforcement, in computer, complete following program:
The first step: encoder.cfg reads in video sequence according to coding configuration file, according to the parameter configuration encoder in configuration file.For example: complete video code flow structure GOP=IPPP Coding frame number FramesToBeEncoded=100; Frame per second FrameRate=30f/s; Video file width S ourceWidth=176, height SourceHeight=144; Output file title OutputFile=ROI.264; Quantization step value QPISlice=28, QPPSlice=28; Motion estimation search scope SearchRange=± 16; Reference frame number NumberReferenceFrames=5; Activity ratio distortion cost function RDOptimization=on; The parameter configuration such as entropy type of coding SymbolMode=CAVLC are set, the initiation parameter L=frame number of encoding, p=1;
Second step: read frame by frame in order coded macroblocks luma component values Y from input video sequence;
The 3rd step: to the first frame of video sequence, I frame carries out intraframe predictive coding;
The 4th step: current p frame is carried out to inter prediction encoding; Record the inter-frame forecast mode type Mode of current coding macro block
pn; Wherein, p=1,2,3 ..., L-1, represents p frame of video of carrying out interframe encode, L is the totalframes that whole video sequence is encoded; N is illustrated in the sequence number of n coded macroblocks in current encoded frame.
The 5th step: sign visual signature significance region, spatial domain, if the inter-frame forecast mode Mode of current coding macro block
pnbelong to sub-split set of modes or intra prediction mode set, Mode
pn∈ 8 * 8,8 * 4,4 * 8, and 4 * 4}or{Intra16 * 16, Intra4 * 4}, is labeled as S by this macro block
yp(x, y, Mode
pn)=1, belongs to visual signature significance region, spatial domain, otherwise mark S
yp(x, y, Mode
pn)=0;
The 6th step: if each coded macroblocks motion vector V in the horizontal direction in p frame is recorded in p ≠ 1
xpnmotion vector V in vertical direction
ypn; And calculate all coded macroblockss average motion vector in the horizontal direction in previous coded frame
and the average motion vector in vertical direction
otherwise, jump to the tenth step;
The 7th step: sign time domain visual signature significance region, if the horizontal motion vector V of current coding macro block
xpnbe greater than former frame coded macroblocks motion vector mean value in the horizontal direction
or the movement in vertical direction vector V of current coding macro block
ypnbe greater than former frame coded macroblocks motion vector mean value in the vertical direction
meet wherein any one criterion, this macro block belongs to time domain visual signature significance region, mark T
yp(x, y, V
xpn, V
ypn)=1, otherwise mark T
yp(x, y, V
xpn, V
ypn)=0;
The 8th step: marking video area-of-interest.
If current coding macro block has spatial domain and time domain visual signature significance, i.e. S simultaneously
yp(x, y, Mode
pn)=1 and T
yp(x, y, V
xpn, V
ypn)=1, human eye interest level is the highest, mark ROI
yp(x, y)=3;
If only there is time domain visual signature significance, i.e. T
yp(x, y, V
xpn, V
ypn)=1 and S
yp(x, y, Mode
pn)=0, human eye interest level takes second place, mark ROI
yp(x, y)=2;
If only there is spatial domain visual signature significance, i.e. S
yp(x, y, Mode
pn)=1 and T
yp(x, y, V
xpn, V
ypn)=0, human eye interest level again, mark ROI
yp(x, y)=1;
If neither there is spatial domain visual signature significance, do not there is time domain visual signature significance, i.e. S yet
yp(x, y, Mode
pn)=0 and T
yp(x, y, V
xpn, V
ypn)=0, is the non-area-of-interest of human eye, mark ROI
yp(x, y)=0;
The 9th step: output video encoding code stream.
If ROI
yp(x, y)=3, interest level is the highest, and human eye attention rate is the highest, and the luminance component of this coded macroblocks is made as to 255, and output macro Block Brightness component value is the highest, i.e. Y
p(x, y)=255;
If ROI
yp(x, y)=2, interest level takes second place, and human eye attention rate is higher, and the luminance component of this coded macroblocks is made as to 150, and output macro Block Brightness component value is higher, i.e. Y
p(x, y)=150;
If ROI
yp(x, y)=1, again, human eye attention rate is lower for interest level, and the luminance component of this coded macroblocks is made as to 100, and output macro Block Brightness component value is lower, i.e. Y
p(x, y)=100;
If ROI
yp(x, y)=0, moral sense region-of-interest, human eye attention rate is minimum, and the luminance component of this coded macroblocks is made as to 0, and output macro Block Brightness component value is minimum, i.e. Y
p(x, y)=0.
The tenth step: if p ≠ L-1, p=p+1, jumps to the 3rd step; Otherwise, finish coding.
Utilize the Output rusults schematic diagram of the inventive method marking video area-of-interest, as shown in Figure 6.Take typical video monitoring sequence (Hall) and indoor activity video sequence (Salesman) is example, utilize motion vector distribution result and inter-frame forecast mode selection result, marking video area-of-interest, if the human eye interest level of certain macro block is higher, in output video, the brightness value of this position is higher, otherwise brightness value is lower.From Fig. 6, the mark result of the rightmost side one row can be found, it is irregular adopting the shape of the video interested region of the inventive method acquisition, compare with the area-of-interest that the moving target detecting method of traditional employing solid shape template obtains, the inventive method mark result more approaches the interesting target shape that human eye is paid close attention to, more accurately mark area-of-interest.
The inventive method also can be combined with other fast coding technology, under guaranteeing the prerequisite of human eye encoding region of interest quality, reduce the uninterested background area of human eye encoder complexity, further reduce the scramble time, also can be used in the scalable coding based on H.264, the selectivity that realizes area-of-interest strengthens coding.