CN105592315A - Video characteristic redundant information compression method and system based on video space-time attribute - Google Patents

Video characteristic redundant information compression method and system based on video space-time attribute Download PDF

Info

Publication number
CN105592315A
CN105592315A CN201510953472.3A CN201510953472A CN105592315A CN 105592315 A CN105592315 A CN 105592315A CN 201510953472 A CN201510953472 A CN 201510953472A CN 105592315 A CN105592315 A CN 105592315A
Authority
CN
China
Prior art keywords
video
sift
feature
compression
representative frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510953472.3A
Other languages
Chinese (zh)
Inventor
朱映映
江传华
钟圣华
黄小燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201510953472.3A priority Critical patent/CN105592315A/en
Publication of CN105592315A publication Critical patent/CN105592315A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of video data compression and provides a video characteristic redundant information compression method based on a video space-time attribute. The method comprises steps of: extracting video representative frames from a video source file; extracting, from each video representative frame, a Shorter-SIFT (S-SIFT) characteristic in order to perform video spatial domain compression; further extracting, from each video representative frame, a TC-S-SIFT characteristic on the basis of the SIFT characteristic in order to perform video spatial domain compression. The invention also provides a video characteristic redundant information compression system. The method and the system effectively compress video characteristic redundant information on the basis of a time domain and the spatial domain and maintain the basic robustness and distinguishability of the video characteristics.

Description

Video features redundancy compression method and system based on video time-space attribute
Technical field
The present invention relates to video data compression field, relate in particular to a kind of video features redundancy based on video time-space attributeInformation compressing method and system.
Background technology
Along with the fast development of multimedia and Internet technology, the Internet video quantity on internet is explosive growth.Video data to sharply increase must effectively be managed, organizing ability better analyzes and utilizes it, for example numeralVideo search on video data retrieval, the Internet in library, the particular film in VOD are searched and in some fieldVideo intelligence analysis etc. But video is retrieved or classified is but a very loaded down with trivial details job consuming time, because originalVideo data is all that maximum particle size is whole file, and the Media Stream that minimum particle size is single frame, without any middle level knotStructure. Therefore need a kind of feature that can represent video information to describe, make visual classification and retrieval more accurately convenient.
Each video file is according to duration length (from several seconds to several hours) and different frame per second (per second tenA few to tens of frames), the frame of video comprising has frames even up to ten thousand up to a hundred, and each frame is all a pictures, is comprising huge letterBreath amount.
Popular way is the processing granularity using camera lens as video at present, and by extracting a frame or several frame generationTable frame represents a camera lens, and scholars have proposed a large amount of camera lenses to be cut apart and representative frame extraction algorithm, and these algorithms are not forSame situation has all obtained certain achievement, also in various Video processing, obtained application, but major part all exists with nextA little defects: 1) do not there is versatility, be only suitable in particular video frequency. 2) robustness is strong not, and video is through some editor and processingCannot carry out afterwards camera lens accurately and cut apart 3) representative frame can not give full expression to the content information of a camera lens, likely causes letterBreath is lost. These defects cause the compression quality of video not ideal enough, and often have a large amount of redundancies.
Therefore, need a kind of brand-new video features redundancy compression method of design badly.
Summary of the invention
In view of this, the object of the embodiment of the present invention is to provide a kind of video features redundancy compression method and isSystem, is intended to solve in prior art strong not to the robustness of video compress and has a problem of a large amount of redundancies.
The embodiment of the present invention is achieved in that a kind of video features redundancy compression method, comprising:
In video source file, extract video representative frame;
In each video representative frame, extract Shorter-SIFT (S-SIFT) feature, to carry out the compression of sdi video territory;
In each video representative frame, further extract TC-S-SIFT feature based on described S-SIFT feature, to lookFrequently time-domain compression.
Preferably, described Shorter-SIFT (S-SIFT) feature of extracting in each video representative frame, to carry out videoThe step of spatial domain compression comprises:
Key point in each video representative frame is detected and located;
To the key point travel direction assignment in each video representative frame;
Generate key point descriptor.
Preferably, the step of described generation key point descriptor specifically comprises:
Reference axis is rotated to be to the principal direction of key point, to guarantee rotational invariance;
On fritter in each key point neighborhood vertical direction, calculate the gradient orientation histogram in all directions, with shapeBecome the descriptor of multidimensional.
Preferably, the described TC-S-SIFT feature of further extracting in each video representative frame based on S-SIFT feature, withThe step of carrying out the compression of video time territory specifically comprises:
Shorter-SIFT (S-SIFT) feature based on extracting in each video representative frame is obtained each representativeMultidimensional descriptor in frame;
The characteristic point with S-SIFT feature in each video representative frame is followed the tracks of to obtain S-SIFT spy frame by frameLevy the locus of points, and a series of similar S-SIFT features are arranged to form described TC-S-SIFT feature according to time sequencingTime domain track;
Calculate the average of S-SIFT descriptors all in every time domain track using retouching as described TC-S-SIFT featureState symbol.
On the other hand, the present invention also provides a kind of video features redundancy compressibility, comprising:
Extraction module, for extracting video representative frame at video source file;
The first compression module, for extract Shorter-SIFT (S-SIFT) feature in each video representative frame, to enterThe compression of row sdi video territory;
The second compression module, for further extracting TC-S-based on described S-SIFT feature in each video representative frameSIFT feature, to carry out the compression of video time territory.
Preferably, described the first compression module specifically comprises:
Detection sub-module, for detecting and locate the key point of each video representative frame;
Assignment submodule, for the key point travel direction assignment to each video representative frame;
Generate submodule, for generating key point descriptor.
Preferably, described generation submodule specifically comprises:
Gyrator module, for reference axis being rotated to be to the principal direction of key point, to guarantee rotational invariance;
Calculating sub module, for calculating the gradient in all directions on the fritter in each key point neighborhood vertical directionDirection histogram, to form the descriptor of multidimensional.
Preferably, described the second compression module specifically comprises:
Descriptor module, for Shorter-SIFT (S-SIFT) spy based on extracting in each video representative frameLevy the multidimensional descriptor obtaining in each representative frame;
Arrange submodule, follow the tracks of frame by frame for the characteristic point with S-SIFT feature to each video representative frameTo obtain S-SIFT characteristic point track, and described in a series of similar S-SIFT features are arranged to form according to time sequencingThe time domain track of TC-S-SIFT feature;
Average submodule, for the average of calculating all S-SIFT descriptors of every time domain track using as describedThe descriptor of TC-S-SIFT feature.
Technical scheme of the present invention can effectively be compressed video features redundancy based on time-domain and spatial domain, withTime kept the basic robustness of video features and the property distinguished.
Brief description of the drawings
Fig. 1 is video features redundancy compression method flow chart in an embodiment of the present invention;
Fig. 2 is the detailed sub-step flow chart of step S12 shown in Fig. 1 in an embodiment of the present invention;
Fig. 3 is the detailed sub-step flow chart of step S123 shown in Fig. 2 in an embodiment of the present invention;
Fig. 4 is the detailed sub-step flow chart of step S13 shown in Fig. 1 in an embodiment of the present invention;
Fig. 5 is video features redundancy compressibility structural representation in an embodiment of the present invention;
Fig. 6 is the internal structure schematic diagram of the first compression module 12 shown in Fig. 5 in an embodiment of the present invention;
Fig. 7 is the internal structure schematic diagram that generates submodule 123 in an embodiment of the present invention shown in Fig. 6;
Fig. 8 is the internal structure schematic diagram of the second compression module 13 shown in Fig. 5 in an embodiment of the present invention.
Detailed description of the invention
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, rightThe present invention is further elaborated. Should be appreciated that specific embodiment described herein is only in order to explain the present invention, andBe not used in restriction the present invention.
The specific embodiment of the invention provides a kind of video features redundancy compression method, mainly comprises following stepRapid:
S11, in video source file, extract video representative frame;
S12, in each video representative frame, extract Shorter-SIFT (S-SIFT) feature, to carry out sdi video territory pressureContracting;
S13, in each video representative frame, further extract TC-S-SIFT feature based on described S-SIFT feature, to enterThe compression of row video time territory.
A kind of video features redundancy compression method provided by the present invention, based on time-domain and spatial domain to video spyLevying redundancy can effectively compress, and has kept the basic robustness of video features and the property distinguished simultaneously.
Below will be elaborated to a kind of video features redundancy compression method provided by the present invention.
Referring to Fig. 1, is video features redundancy compression method flow chart in an embodiment of the present invention.
In step S11, in video source file, extract video representative frame.
In the present embodiment, the extraction of video representative frame, except adopting the mode of uniform sampling, can also adoptVideo lens cutting techniques, also has other extracting mode certainly, does not limit at this. In the present embodiment, in conjunction with people'sVisual characteristic, it is 12~20 frames per second that human eye is watched video, the solution of the present invention adopts 15 frames per second to extract video representative frame.
In step S12, in each video representative frame, extract Shorter-SIFT (S-SIFT) feature, to carry out videoSpatial domain compression.
In the present embodiment, the conversion of yardstick invariant features (Scale-invariantfeaturetransform,SIFT) there is good robustness and the property distinguished, aspect image, video frequency searching, be widely used, and obtain goodEffect. Every piece image can extract thousands of SIFT features that arrive up to a hundred according to the size of the difference of content and image is different,SIFT feature has 128 dimensions in spatial domain, if all representative frame of video are all extracted to SIFT feature, the feature obtaining soData volume is very surprising, and the 128 high-dimensional dimension disasters of also having brought of SIFT feature, and it is also very multiple dealing withAssorted and consuming time. But the present invention extracts SIFT feature in video representative frame, but extracts in video representative frameShorter-SIFT (S-SIFT) feature.
In the present embodiment, the extraction of S-SIFT feature is different from the extraction of SIFT feature, the extraction of S-SIFT featureEffectively combine human vision property (being the inhomogeneities of people in vision orientation), in each extraction step of standard SIFT featureIn rapid, abandon people's the insensitive direction of vision, thereby effectively SIFT feature has been reduced to 96 from 128 dimensions in spatial domainDimension, and kept the characteristic of SIFT feature, the vision that S-SIFT characteristic extraction procedure has all been abandoned people in each step is notResponsive tilted direction.
In the present embodiment, in order to verify that S-SIFT has kept the robustness of SIFT, SIFT and S-SIFT for the present inventionCarry out images match, following table one is the correct coupling number table of SIFT and S-SIFT.
The correct coupling number table of table one: SIFT and S-SIFT
Wherein, Zoom is that image amplifies (translator of Chinese corresponding to this word of Zoom in form please be provided), above-mentionedIn experimental result, proved, S-SIFT can obtain more correct coupling number than SIFT as a rule, and then well protectsDemonstrate,prove robustness.
In the present embodiment, described Shorter-SIFT (S-SIFT) feature of extracting in each video representative frame, withThe step S12 that carries out the compression of sdi video territory further comprises S121-S123 tri-sub-steps, as shown in Figure 2.
Refer to Fig. 2, be depicted as the detailed sub-step flow chart of the S12 of step shown in Fig. 1 in an embodiment of the present invention.
In step S121, the key point in each video representative frame is detected and located.
In the present embodiment, critical point detection and locator key are to go out stable point at different size measurements, pass throughThe Analysis On Multi-scale Features of tectonic scale spatial simulation view data, metric space meets vision consistency, when we observe with eyesWhen object, on the one hand in the time of the illumination condition variation of object background of living in, the luminance level of retina perceptual image and contrastBe different, therefore require metric space operator not to be subject to the grey level of image and the shadow that contrast changes to the analysis of imageRing, meet gray scale consistency and contrast consistency. On the other hand, with respect to a certain fixed coordinate system, when observer and thingWhen relative position between body changes, position, size, angle and the shape of the image of retina institute perception are different, thereforeRequire position, size, angle and the affine transformation of the analysis to image of metric space operator and image irrelevant, meet translationConsistency, yardstick consistency and affine consistency.
In the present embodiment, the metric space L (x, y, σ) of piece image is defined as a Gaussian function that changes yardstickThe convolution of G (x, y, σ) and original image I (x, y):
L(x,y,σ)=G(x,y,σ)*I(x,y)(1)
Wherein,(x, y) is space coordinates, yardstick coordinate. σ size determines imageLevel and smooth degree, the general picture feature of large scale correspondence image, the minutia of small scale correspondence image. Large σ value is corresponding coarseYardstick (low resolution), otherwise, corresponding fine dimension (high-resolution). In order effectively stable pass to be detected at metric spaceKey point, has proposed difference of Gaussian metric space DoG, utilizes the Gaussian difference pyrene of different scale and image convolution to generate.
DoG(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)
(2)
=L(x,y,kσ)-L(x,y,σ)
In the present embodiment, set up after difference of Gaussian metric space, in metric space, found extreme point and detectKey point, the consecutive points comparison that each sampled point will be all with it, sees its whether adjacent than its image area and scale domainPoint is large or little. Generally held standard SIFT algorithm is by middle test point and its 8 consecutive points and neighbouring with yardstick9 × 2 points totally 26 somes comparison that yardstick is corresponding, to guarantee extreme point all to be detected at metric space and two dimensional image space,If when a point is maximum or minimum of a value in DOG this layer of metric space and bilevel 26 neighborhoods, just think thisPoint is the characteristic point of image under this yardstick. But S-SIFT algorithm provided by the invention is by test point and its same yardstickNogata 4 consecutive points and 5*2 point totally 14 somes comparison corresponding to neighbouring yardstick upwards, if this point is 14 neighborhoodsMiddle maximum or minimum of a value, just think that this point is a characteristic point of image.
In step S122, to the key point travel direction assignment in each video representative frame.
In the present embodiment, in previous step, determine the characteristic point in every width figure, for each characteristic point is calculated oneDirection, does further calculating according to this direction, and the gradient direction distribution characteristic of utilizing key point neighborhood territory pixel is each passKey point assigned direction parameter, makes operator possess rotational invariance, each key point is calculated to its gradient with formula below straightFang Tu;
m ( x , y ; s ) = ( L ( x + 1 , y ; s ) - L ( x - 1 , y ; s ) ) 2 + ( L ( x , y + 1 ; s ) - L ( x , y - 1 ; s ) ) 2 - - - ( 3 )
θ ( x , y ; s ) = tan - 1 ( L ( x , y + 1 ; s ) - L ( x , y - 1 ; s ) L ( x + 1 , y ; s ) - L ( x - 1 , y ; s ) ) - - - ( 4 )
(3) and (4) be mould value and the direction formula that (x, y) locates gradient. Histogram of gradients in generally held standard SIFT algorithmScope is 0~360 degree, a wherein post of every 10 degree, and 36 posts altogether, histogram peak direction has represented the main side of key pointTo. But in S-SIFT algorithm provided by the invention, ignore the tilted direction of gradient, every 10 degree a post, totally 24 posts.
In the present embodiment, in order to verify that the present invention only examines to the key point direction assignment in each video representative frameIt is feasible having considered principal direction, in each two field picture, calculates the principal direction ratio of each key point, as shown in following table two.
Table two: the principal direction schedule of proportion of each key point
Wherein, Zoom is that image amplifies, and Cardinal is principal direction, and Oblique is that tilted direction (please provide in formZoom, Cardinal, these three translators of Chinese that word is corresponding of Oblique), from experimental result, can find out, key pointThe shared ratio of principal direction is very large, is smaller so ignore the impact of those tilted directions.
In step S123, generate key point descriptor.
In the present embodiment, the step S123 of generation key point descriptor specifically comprises two sub-steps of S1231-S1232Suddenly, as shown in Figure 3.
Refer to Fig. 3, be depicted as the detailed sub-step flow chart of the S123 of step shown in Fig. 2 in an embodiment of the present invention.
In step S1231, reference axis is rotated to be to the principal direction of key point, to guarantee rotational invariance.
In the present embodiment, generate key point descriptor and first reference axis rotated to be the principal direction of key point, with reallyProtect rotational invariance.
In step S1232, on the fritter in each key point neighborhood vertical direction, calculate the gradient in all directionsDirection histogram, to form the descriptor of multidimensional.
In the present embodiment, generally held standard SIFT algorithm calculates 8 on the fritter of neighborhood 4*4 around each key pointGradient orientation histogram in direction, the descriptor of formation 4*4*8=128. But in S-SIFT algorithm provided by the inventionIn each key point neighborhood vertical direction, on the fritter of 3*4, calculate 8 gradient orientation histograms in direction, form 3*4*The descriptor of 8=96 dimension.
In the present embodiment, S-SIFT aspect ratio standard SIFT dimension is lower, has kept the spies such as its invariable rotary simultaneouslyProperty, the reduction of dimension makes S-SIFT feature speed in characteristic matching faster than standard SIFT feature.
Please continue to refer to Fig. 1, in step S13, further in each video representative frame based on described S-SIFT featureExtract TC-S-SIFT feature, to carry out the compression of video time territory.
In the present embodiment, TC-S-SIFT (Temporal-CompressandShorter-SIFT) feature is divided intoTwo parts: time domain track and descriptor. The representative frame of video is as a sequence in time-domain, and time domain track is to referring to tableS-SIFT characteristic point on frame is followed the tracks of obtained S-SIFT key point track frame by frame. What time domain track was described is that one is lookedDuration and the appearance/disappearance position of feel content in time-domain. Time domain track is by a series of similar S-SIFT featuresRearrange according to time sequencing. The descriptor of TC-S-SIFT refers to the average of all S-SIFT descriptors on time domain track, phaseWhen within a period of time, certain vision content concentrated. Generally, TC-S-SIFT is retrieved as video features,Refer to and utilize the descriptor of TC-S-SIFT to retrieve, the reference information that time domain track is located as video segment. CauseThis, TC-S-SIFT not only compresses vision content in time domain, removes redundancy and greatly reduces feature quantity, preserves simultaneouslyThe robustness of S-SIFT feature and the property distinguished, recorded in addition the temporal signatures of video.
In the present embodiment, the extraction of TC-S-SIFT is to follow the tracks of at the S-SIFT descriptor to video representative frameBasis on carry out. The tracking of S-SIFT descriptor is exactly the process of S-SIFT characteristic matching. If a S-SIFT describesThe ratio of the distance of symbol and time neighbour's distance and it and arest neighbors is greater than assign thresholds, and this S-SIFT descriptor and it areNeighbour S-SIFT descriptor matches, and matched rule is defined as follows:
d i s t ( d , d 2 ) d i s t ( d , d 1 ) ≥ δ , d ∈ D set 1 ; d 1 , d 2 ∈ D set 2 ; d 1 ≠ d 2 - - - ( 5 )
Wherein, d1And d2Be respectivelyIn with d arest neighbors and time neighbour's S-SIFT descriptor, dist represents two S-The Euclidean distance of SIFT descriptor, δ represents the threshold value of ratio. In S-SIFT descriptor tracing process, due to vision content withThe time of changes, or the motion of some noises or video camera, causes some S-SIFT descriptors to disappear in the process of following the tracks ofLose after t frame, occurred again, thereby make a track be broken into many, for this situation, need to be by those because of someThe of short duration disappearance of characteristic point reappears again and causes the track of many to be merged into one. But one time domain symbol track is not suitable for oversizely, holdsEasily there is matching error. In order to prevent that this matching error from constantly propagating along track, cause matching error accumulation, the present invention's ruleA fixed S-SIFT descriptor path length can not exceed the representative frame of some, if path length is greater than certain threshold value,This track is split, until all sub-trajectory length is all less than or equal to this threshold value.
In the present embodiment, in each video representative frame, further extract TC-S-based on described S-SIFT featureSIFT feature, further comprises S131-S133 tri-sub-steps to carry out the step S13 of video time territory compression, as Fig. 4 instituteShow.
Refer to Fig. 4, be depicted as the detailed sub-step flow chart of the S13 of step shown in Fig. 1 in an embodiment of the present invention.
In step S131, Shorter-SIFT (S-SIFT) feature based on extracting in each video representative frameObtain the multidimensional descriptor in each representative frame.
In the present embodiment, utilize above-mentioned S-SIFT algorithm to extract the S-SIFT feature of each representative frame, wherein, S-SIFT feature is divided into two parts: weak geological information 4 is tieed up the descriptor of (being X, Y, Scale, Orientation) and 96 dimensions. At thisIn embodiment, ignore the weak geological information of S-SIFT, only utilize the descriptor of 96 dimensions.
In step S132, to the characteristic point with S-SIFT feature in each video representative frame follow the tracks of frame by frame withObtain S-SIFT characteristic point track, and a series of similar S-SIFT features are arranged to form described TC-according to time sequencingThe time domain track of S-SIFT feature.
In the present embodiment, from first representative frame, each S-SIFT descriptor is followed the tracks of frame by frame, therebyForm time domain track: suppose that a time domain trajectory table is shown as:Wherein fiRepresent to follow the tracks of S-The representative frame that SIFT descriptor lives through,For S-SIFT feature descriptor.
In the present embodiment, for any two tracksWithIfS-SIFT descriptorWithCoupling, and t=fb-fs<θ1, merge these two tracks and be T r a c k = { d f s , ... , d f e , d f b , ... , d f o } , Wherein θ1=3。
In the present embodiment, for arbitrarilyIf length n > is θ2, phase in this trackThe matching distance of adjacent S-SIFT descriptor is Dist={d1,d2,...,di,...,dn-1, if max (Dist)=di, from fiAnd fi+1Between this track is split as to two tracks, repeat operation until all path lengths are less than θ2, wherein θ2Get 15.
In step S133, the average of calculating S-SIFT descriptors all in every time domain track is using as described TC-The descriptor of S-SIFT feature.
In the present embodiment, calculate TC-S-SIFT descriptor, in every time domain track, all S-SIFT descriptors is equalValue is as follows:
T C - S - SIFT d = d ^ = n - 1 Σ f = f s f e d f - - - ( 6 )
Wherein, dfRefer to and in time domain track, appear at f the S-SIFT descriptor in representative frame, fsAnd feRepresent time domainTrack is at fsIndividual representative frame starts, at feIndividual representative frame finishes, and n is path length.
In the present embodiment, SIFT feature obtains good result in image retrieval and object identification, but video is doneBe simultaneous object in a time-domain and spatial domain, SIFT feature can not be from these two territories the content of reflecting video.In the present embodiment, the TC-S-SIFT proposing is based on S-SIFT tracking and with S-SIFT descriptors all in trackAverage as TC-S-SIFT, the robustness that had therefore both retained SIFT feature (changes and has convergent-divergent, rotation and brightnessConsistency) and the property distinguished, the while is compressed the redundancy of vision content in time domain, has greatly reduced the quantity of feature, TC-S-SIFT feature has reflected space content feature and the time domain information of video simultaneously, be more suitable for for the analysis of video content andVideo frequency searching application.
A kind of video features redundancy compression method provided by the present invention, based on time-domain and spatial domain to video spyLevying redundancy can effectively compress, and has kept the basic robustness of video features and the property distinguished simultaneously. The present invention utilizesThe video TC-S-SIFT feature proposing, effectively compressed video feature redundancy, utilizes this feature to can be applicable to videoRetrieval, the TC-S-SIFT feature of match video, retrieves the video matching with target video.
A kind of video features redundancy compression method provided by the present invention, in the time extracting video TC-S-SIFT feature,Can preserve the time domain track of lower video features, utilize this time domain trace information, can be applied to video approximate copy location. ExistingCopy number of videos on network is doubled and redoubled, and extracts the copy TC-S-SIFT feature of video and the TC-S-SIFT spy of source videoLevy and mate, find the feature matching, and utilize the time domain trace information of this feature, what location feature occurred and finished looksFrequently representative frame position, the accurate location that obtains copying video.
The specific embodiment of the invention also provides a kind of video features redundancy compressibility 10, mainly comprises:
Extraction module 11, for extracting video representative frame at video source file;
The first compression module 12, for extract Shorter-SIFT (S-SIFT) feature in each video representative frame, withCarry out the compression of sdi video territory;
The second compression module 13, for further extracting TC-based on described S-SIFT feature in each video representative frameS-SIFT feature, to carry out the compression of video time territory.
A kind of video features redundancy compressibility 10 provided by the present invention, based on time-domain and spatial domain to videoFeature redundancy can effectively be compressed, and has kept the basic robustness of video features and the property distinguished simultaneously.
Refer to Fig. 5, the structure that is depicted as video features redundancy compressibility 10 in an embodiment of the present invention is shownIntention. In the present embodiment, video features redundancy compressibility 10 comprise extraction module 11, the first compression module 12 withAnd second compression module 13.
Extraction module 11, for extracting video representative frame at video source file.
In the present embodiment, the extraction of video representative frame, except adopting the mode of uniform sampling, can also adoptVideo lens cutting techniques, also has other extracting mode certainly, does not limit at this. In the present embodiment, in conjunction with people'sVisual characteristic, it is 12~20 frames per second that human eye is watched video, the solution of the present invention adopts 15 frames per second to extract video representative frame.
The first compression module 12, for extract Shorter-SIFT (S-SIFT) feature in each video representative frame, withCarry out the compression of sdi video territory.
In the present embodiment, the conversion of yardstick invariant features (Scale-invariantfeaturetransform,SIFT) there is good robustness and the property distinguished, aspect image, video frequency searching, be widely used, and obtain goodEffect. Every piece image can extract thousands of SIFT features that arrive up to a hundred according to the size of the difference of content and image is different,SIFT feature has 128 dimensions in spatial domain, if all representative frame of video are all extracted to SIFT feature, the feature obtaining soData volume is very surprising, and the 128 high-dimensional dimension disasters of also having brought of SIFT feature, and it is also very multiple dealing withAssorted and consuming time. But the present invention extracts SIFT feature in video representative frame, but extracts in video representative frameShorter-SIFT (S-SIFT) feature.
In the present embodiment, the extraction of S-SIFT feature is different from the extraction of SIFT feature, the extraction of S-SIFT featureEffectively combine human vision property (being the inhomogeneities of people in vision orientation), in each extraction step of standard SIFT featureIn rapid, abandon people's the insensitive direction of vision, thereby effectively SIFT feature has been reduced to 96 from 128 dimensions in spatial domainDimension, and kept the characteristic of SIFT feature, the vision that S-SIFT characteristic extraction procedure has all been abandoned people in each step is notResponsive tilted direction.
In the present embodiment, the first compression module 12 specifically comprise detection sub-module 121, assignment submodule 122 andGenerate submodule 123 these three submodules, as shown in Figure 6.
Refer to Fig. 6, be depicted as the signal of the internal structure of the first compression module 12 shown in Fig. 5 in an embodiment of the present inventionFigure.
Detection sub-module 121, for detecting and locate the key point of each video representative frame.
In the present embodiment, critical point detection and locator key are to go out stable point at different size measurements, pass throughThe Analysis On Multi-scale Features of tectonic scale spatial simulation view data, metric space meets vision consistency, when we observe with eyesWhen object, on the one hand in the time of the illumination condition variation of object background of living in, the luminance level of retina perceptual image and contrastBe different, therefore require metric space operator not to be subject to the grey level of image and the shadow that contrast changes to the analysis of imageRing, meet gray scale consistency and contrast consistency. On the other hand, with respect to a certain fixed coordinate system, when observer and thingWhen relative position between body changes, position, size, angle and the shape of the image of retina institute perception are different, thereforeRequire position, size, angle and the affine transformation of the analysis to image of metric space operator and image irrelevant, meet translationConsistency, yardstick consistency and affine consistency.
In the present embodiment, set up after difference of Gaussian metric space, in metric space, found extreme point and detectKey point, the consecutive points comparison that each sampled point will be all with it, sees its whether adjacent than its image area and scale domainPoint is large or little. Generally held standard SIFT algorithm is by middle test point and its 8 consecutive points and neighbouring with yardstick9 × 2 points totally 26 somes comparison that yardstick is corresponding, to guarantee extreme point all to be detected at metric space and two dimensional image space,If when a point is maximum or minimum of a value in DOG this layer of metric space and bilevel 26 neighborhoods, just think thisPoint is the characteristic point of image under this yardstick. But S-SIFT algorithm provided by the invention is by test point and its same yardstickNogata 4 consecutive points and 5*2 point totally 14 somes comparison corresponding to neighbouring yardstick upwards, if this point is 14 neighborhoodsMiddle maximum or minimum of a value, just think that this point is a characteristic point of image.
Assignment submodule 122, for the key point travel direction assignment to each video representative frame.
In the present embodiment, in previous step, determine the characteristic point in every width figure, for each characteristic point is calculated oneDirection, does further calculating according to this direction, and the gradient direction distribution characteristic of utilizing key point neighborhood territory pixel is each passKey point assigned direction parameter, makes operator possess rotational invariance, each key point is calculated to its gradient with formula below straightFang Tu.
Generate submodule 123, for generating key point descriptor.
In the present embodiment, generate submodule 123 and specifically comprise 1232 liang of gyrator module 1231 and calculating sub moduleIndividual submodule, as shown in Figure 7.
Refer to Fig. 7, be depicted as the internal structure signal that generates submodule 123 in an embodiment of the present invention shown in Fig. 6Figure.
Gyrator module 1231, for reference axis being rotated to be to the principal direction of key point, to guarantee rotational invariance.
In the present embodiment, generate key point descriptor and first reference axis rotated to be the principal direction of key point, with reallyProtect rotational invariance.
Calculating sub module 1232, for calculating in all directions on the fritter in each key point neighborhood vertical directionGradient orientation histogram, to form the descriptor of multidimensional.
In the present embodiment, generally held standard SIFT algorithm calculates 8 on the fritter of neighborhood 4*4 around each key pointGradient orientation histogram in direction, the descriptor of formation 4*4*8=128. But in S-SIFT algorithm provided by the inventionIn each key point neighborhood vertical direction, on the fritter of 3*4, calculate 8 gradient orientation histograms in direction, form 3*4*The descriptor of 8=96 dimension.
In the present embodiment, S-SIFT aspect ratio standard SIFT dimension is lower, has kept the spies such as its invariable rotary simultaneouslyProperty, the reduction of dimension makes S-SIFT feature speed in characteristic matching faster than standard SIFT feature.
Please continue to refer to Fig. 5, the second compression module 13, for based on described S-SIFT feature in each video representative frameFurther extract TC-S-SIFT feature, to carry out the compression of video time territory.
In the present embodiment, TC-S-SIFT (Temporal-CompressandShorter-SIFT) feature is divided intoTwo parts: time domain track and descriptor. The representative frame of video is as a sequence in time-domain, and time domain track is to referring to tableS-SIFT characteristic point on frame is followed the tracks of obtained S-SIFT key point track frame by frame. What time domain track was described is that one is lookedDuration and the appearance/disappearance position of feel content in time-domain. Time domain track is by a series of similar S-SIFT featuresRearrange according to time sequencing. The descriptor of TC-S-SIFT refers to the average of all S-SIFT descriptors on time domain track, phaseWhen within a period of time, certain vision content concentrated. Generally, TC-S-SIFT is retrieved as video features,Refer to and utilize the descriptor of TC-S-SIFT to retrieve, the reference information that time domain track is located as video segment. CauseThis, TC-S-SIFT not only compresses vision content in time domain, removes redundancy and greatly reduces feature quantity, preserves simultaneouslyThe robustness of S-SIFT feature and the property distinguished, recorded in addition the temporal signatures of video.
In the present embodiment, the second compression module 13 specifically comprise descriptor module 131, arrange submodule 132 and133 3 submodules of average submodule, as shown in Figure 8.
Refer to Fig. 8, be depicted as the signal of the internal structure of the second compression module 13 shown in Fig. 5 in an embodiment of the present inventionFigure.
Descriptor module 131, for the Shorter-SIFT (S-based on extracting in each video representative frameSIFT) feature is obtained the multidimensional descriptor in each representative frame.
In the present embodiment, utilize above-mentioned S-SIFT algorithm to extract the S-SIFT feature of each representative frame, wherein, S-SIFT feature is divided into two parts: weak geological information 4 is tieed up the descriptor of (being X, Y, Scale, Orientation) and 96 dimensions. At thisIn embodiment, ignore the weak geological information of S-SIFT, only utilize the descriptor of 96 dimensions.
Arrange submodule 132, carry out frame by frame for the characteristic point with S-SIFT feature to each video representative frameFollow the tracks of to obtain S-SIFT characteristic point track, and a series of similar S-SIFT features are arranged to form institute according to time sequencingState the time domain track of TC-S-SIFT feature.
In the present embodiment, from first representative frame, each S-SIFT descriptor is followed the tracks of frame by frame, therebyForm time domain track: suppose that a time domain trajectory table is shown as:Wherein fiRepresent to follow the tracks of S-The representative frame that SIFT descriptor lives through,For S-SIFT feature descriptor.
In the present embodiment, for any two tracksWithAsFruit S-SIFT descriptorWithCoupling, and t=fb-fs<θ1, merge these two tracks and be T r a c k = { d f s , ... , d f e , d f b , ... , d f o } , Wherein θ1=3。
In the present embodiment, for arbitrarilyIf length n > is θ2, phase in this trackThe matching distance of adjacent S-SIFT descriptor is Dist={d1,d2,...,di,...,dn-1, if max (Dist)=di, from fiAnd fi+1Between this track is split as to two tracks, repeat operation until all path lengths are less than θ2, wherein θ2Get 15.
Average submodule 133, for the average of calculating all S-SIFT descriptors of every time domain track using as instituteState the descriptor of TC-S-SIFT feature.
In the present embodiment, SIFT feature obtains good result in image retrieval and object identification, but video is doneBe simultaneous object in a time-domain and spatial domain, SIFT feature can not be from these two territories the content of reflecting video.In the present embodiment, the TC-S-SIFT proposing is based on S-SIFT tracking and with S-SIFT descriptors all in trackAverage as TC-S-SIFT, the robustness that had therefore both retained SIFT feature (changes and has convergent-divergent, rotation and brightnessConsistency) and the property distinguished, the while is compressed the redundancy of vision content in time domain, has greatly reduced the quantity of feature, TC-S-SIFT feature has reflected space content feature and the time domain information of video simultaneously, be more suitable for for the analysis of video content andVideo frequency searching application.
A kind of video features redundancy compressibility 10 provided by the present invention, based on time-domain and spatial domain to videoFeature redundancy can effectively be compressed, and has kept the basic robustness of video features and the property distinguished simultaneously. Profit of the present inventionBy the video TC-S-SIFT feature proposing, effectively compressed video feature redundancy, utilizes this feature to can be applicable to lookRetrieval frequently, the TC-S-SIFT feature of match video, retrieves the video matching with target video.
A kind of video features redundancy compressibility 10 provided by the present invention, is extracting video TC-S-SIFT featureTime, can preserve the time domain track of lower video features, utilize this time domain trace information, can be applied to video approximate copy fixedPosition. Copy number of videos on existing network network is doubled and redoubled, and extracts the copy TC-S-SIFT feature of video and the TC-S-of source videoSIFT feature is mated, and finds the feature matching, and utilizes the time domain trace information of this feature, and location feature occurs and knotThe video representative frame position of bundle, the accurate location that obtains copying video.
It should be noted that in above-described embodiment, included unit is just divided according to function logic,But be not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit alsoJust, for the ease of mutual differentiation, be not limited to protection scope of the present invention.
In addition, one of ordinary skill in the art will appreciate that all or part of step realizing in the various embodiments described above methodBe can carry out the hardware that instruction is relevant by program to complete, corresponding program can be stored in an embodied on computer readable storage and be situated betweenIn matter, described storage medium, as ROM/RAM, disk or CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all in essence of the present inventionAny amendment of doing within god and principle, be equal to replacement and improvement etc., within protection scope of the present invention all should be included in.

Claims (8)

1. the video features redundancy compression method based on video time-space attribute, is characterized in that, described method comprises:
In video source file, extract video representative frame;
In each video representative frame, extract Shorter-SIFT (S-SIFT) feature, to carry out the compression of sdi video territory;
In each video representative frame, further extract TC-S-SIFT feature based on described S-SIFT feature, when carrying out videoBetween territory compression.
2. video features redundancy compression method as claimed in claim 1, is characterized in that, described in each video representativeIn frame, extract Shorter-SIFT (S-SIFT) feature, comprise with the step of carrying out the compression of sdi video territory:
Key point in each video representative frame is detected and located;
To the key point travel direction assignment in each video representative frame;
Generate key point descriptor.
3. video features redundancy compression method as claimed in claim 2, is characterized in that, described generation key point is describedThe step of son specifically comprises:
Reference axis is rotated to be to the principal direction of key point, to guarantee rotational invariance;
On fritter in each key point neighborhood vertical direction, calculate the gradient orientation histogram in all directions, many to formThe descriptor of dimension.
4. video features redundancy compression method as claimed in claim 1, is characterized in that, described based on S-SIFT featureIn each video representative frame, further extract TC-S-SIFT feature, specifically comprise with the step of carrying out the compression of video time territory:
Shorter-SIFT (S-SIFT) feature based on extracting in each video representative frame is obtained in each representative frameMultidimensional descriptor;
The characteristic point with S-SIFT feature in each video representative frame is followed the tracks of to obtain S-SIFT characteristic point frame by frameTrack, and a series of similar S-SIFT features are arranged to form the time domain of described TC-S-SIFT feature according to time sequencingTrack;
Calculate the average of S-SIFT descriptors all in every time domain track using the description as described TC-S-SIFT featureSymbol.
5. the video features redundancy compression based on video time-space attribute, is characterized in that described video features redundancyInformation Compression system comprises:
Extraction module, for extracting video representative frame at video source file;
The first compression module, for extract Shorter-SIFT (S-SIFT) feature in each video representative frame, to lookFrequently spatial domain compression;
The second compression module, for further extracting TC-S-SIFT based on described S-SIFT feature in each video representative frameFeature, to carry out the compression of video time territory.
6. video features redundancy compressibility as claimed in claim 5, is characterized in that, described the first compression module toolBody comprises:
Detection sub-module, for detecting and locate the key point of each video representative frame;
Assignment submodule, for the key point travel direction assignment to each video representative frame;
Generate submodule, for generating key point descriptor.
7. video features redundancy compressibility as claimed in claim 6, is characterized in that, described generation submodule is concreteComprise:
Gyrator module, for reference axis being rotated to be to the principal direction of key point, to guarantee rotational invariance;
Calculating sub module, for calculating the gradient direction in all directions on the fritter in each key point neighborhood vertical directionHistogram, to form the descriptor of multidimensional.
8. video features redundancy compressibility as claimed in claim 5, is characterized in that, described the second compression module toolBody comprises:
Descriptor module, obtains for Shorter-SIFT (S-SIFT) feature based on extracting in each video representative frameGet the multidimensional descriptor in each representative frame;
Arrange submodule, for the characteristic point with S-SIFT feature of each video representative frame being followed the tracks of frame by frame to obtainTo S-SIFT characteristic point track, and a series of similar S-SIFT features are arranged to form described TC-S-according to time sequencingThe time domain track of SIFT feature;
Average submodule, for the average of calculating all S-SIFT descriptors of every time domain track using as described TC-S-The descriptor of SIFT feature.
CN201510953472.3A 2015-12-16 2015-12-16 Video characteristic redundant information compression method and system based on video space-time attribute Pending CN105592315A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510953472.3A CN105592315A (en) 2015-12-16 2015-12-16 Video characteristic redundant information compression method and system based on video space-time attribute

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510953472.3A CN105592315A (en) 2015-12-16 2015-12-16 Video characteristic redundant information compression method and system based on video space-time attribute

Publications (1)

Publication Number Publication Date
CN105592315A true CN105592315A (en) 2016-05-18

Family

ID=55931487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510953472.3A Pending CN105592315A (en) 2015-12-16 2015-12-16 Video characteristic redundant information compression method and system based on video space-time attribute

Country Status (1)

Country Link
CN (1) CN105592315A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111050174A (en) * 2019-12-27 2020-04-21 清华大学 Image compression method, device and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1365574A (en) * 1999-06-18 2002-08-21 艾利森电话股份有限公司 A method and a system for generating summarized radio
CN101650830A (en) * 2009-08-06 2010-02-17 中国科学院声学研究所 Compressed domain video lens mutation and gradient union automatic segmentation method and system
CN102184551A (en) * 2011-05-10 2011-09-14 东北大学 Automatic target tracking method and system by combining multi-characteristic matching and particle filtering
CN103631932A (en) * 2013-12-06 2014-03-12 中国科学院自动化研究所 Method for detecting repeated video

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1365574A (en) * 1999-06-18 2002-08-21 艾利森电话股份有限公司 A method and a system for generating summarized radio
CN101650830A (en) * 2009-08-06 2010-02-17 中国科学院声学研究所 Compressed domain video lens mutation and gradient union automatic segmentation method and system
CN102184551A (en) * 2011-05-10 2011-09-14 东北大学 Automatic target tracking method and system by combining multi-characteristic matching and particle filtering
CN103631932A (en) * 2013-12-06 2014-03-12 中国科学院自动化研究所 Method for detecting repeated video

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YINGYING ZHU ET AL.: ""A Temporal-Compress and Shorter SIFT Research on Web Videos"", 《LECTURE NOTES IN COMPUTER SCIENCE, ENGINEERING AND MANAGEMENT-8TH INTERNATIONAL CONFERENCE KSEM 2015》 *
朱映映等: ""基于类型标志镜头与词袋模型的体育视频分类"", 《计算机辅助设计与图形学学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111050174A (en) * 2019-12-27 2020-04-21 清华大学 Image compression method, device and system

Similar Documents

Publication Publication Date Title
Senst et al. Crowd violence detection using global motion-compensated lagrangian features and scale-sensitive video-level representation
CN103593464B (en) Video fingerprint detecting and video sequence matching method and system based on visual features
US10140575B2 (en) Sports formation retrieval
US8417037B2 (en) Methods and systems for representation and matching of video content
US8358840B2 (en) Methods and systems for representation and matching of video content
WO2009129243A1 (en) Methods and systems for representation and matching of video content
CN103718193B (en) Method and apparatus for comparing video
CN103198293A (en) System and method for fingerprinting video
WO2014009490A1 (en) A method and an apparatus for the extraction of descriptors from video content, preferably for search and retrieval purpose
Li et al. Structuring lecture videos by automatic projection screen localization and analysis
Li et al. Video synopsis in complex situations
JP5432677B2 (en) Method and system for generating video summaries using clustering
Chamasemani et al. Video abstraction using density-based clustering algorithm
CN105592315A (en) Video characteristic redundant information compression method and system based on video space-time attribute
Abbas et al. Vectors of locally aggregated centers for compact video representation
Gharbi et al. A novel key frame extraction approach for video summarization
Bhaumik et al. Towards redundancy reduction in storyboard representation for static video summarization
Jiang et al. Hierarchical video summarization in reference subspace
Wang et al. Visual saliency based aerial video summarization by online scene classification
Megrhi et al. Spatio-temporal salient feature extraction for perceptual content based video retrieval
Rani et al. Key frame extraction techniques: A survey
Bhaumik et al. Real-time storyboard generation in videos using a probability distribution based threshold
Warhade et al. Effective algorithm for detecting various wipe patterns and discriminating wipe from object and camera motion
Anju et al. Video copy detection using F-sift and graph based video sequence matching
Bhaumik et al. Real-time video segmentation using a vague adaptive threshold

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160518