CN100348032C - Method and equipment for processing and browsing provided video/audio signal - Google Patents

Method and equipment for processing and browsing provided video/audio signal Download PDF

Info

Publication number
CN100348032C
CN100348032C CNB2004100983682A CN200410098368A CN100348032C CN 100348032 C CN100348032 C CN 100348032C CN B2004100983682 A CNB2004100983682 A CN B2004100983682A CN 200410098368 A CN200410098368 A CN 200410098368A CN 100348032 C CN100348032 C CN 100348032C
Authority
CN
China
Prior art keywords
video
shooting
motion
audio signal
characteristic point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004100983682A
Other languages
Chinese (zh)
Other versions
CN1625246A (en
Inventor
M·彼得·库恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Corp
Original Assignee
Sony Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Corp filed Critical Sony Corp
Priority to CNB2004100983682A priority Critical patent/CN100348032C/en
Publication of CN1625246A publication Critical patent/CN1625246A/en
Application granted granted Critical
Publication of CN100348032C publication Critical patent/CN100348032C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

The present invention relates to a method and equipment for processing and browsing provided video/audio signals. The method comprises the steps that a camera shot motion conversion image is established by gradation, wherein image building steps comprises the steps that at least one main image pick-up motion conversion image and a plurality of pattern layouts for representing other image pick-up motion of conversion paths for video sequence explanation are provided; the browsing is carried out through the image pick-up motion conversion image and the prime frame of the image pick-up motion video sequence on the node by explanation. The browsing is also carried out through the image pick-up motion conversion image and the image representation of image pick-up motion on the node by explanation. A metadata extracting unit is provided with a feature point selection unit and a motion estimation unit (62), which are used for extracting at least one feature point for representing the characteristics of a video frequency / an audio signal from a compressed region of the video frequency / the audio signal. Thus, the reduction of processing time or the cost is realized, and the validated processing becomes possible.

Description

Handle and browse the method and apparatus of the video/audio signal that is provided
This case be that November 29, application number in 1999 are 99815915.8 the applying date, denomination of invention divides an application for the application for a patent for invention of " video/audio signal processing method and video/audio signal treatment facility ".
Technical field
The present invention relates to video/audio signal processing method and video/audio signal treatment facility, and provide a kind of effective calculation method that is used for this to simplify (facilitate) such as (but being not limited to) application from the shooting extraction and the video summarization (video summarization) of MPEG compression field.
Background technology
Along with the increase of video storage device capacity, occurred conveniently carrying out the structure of video content and the needs of summary for the user browses.Metadata (that is, about the data of data) makes video tour become possibility, and this metadata is preferably extracted automatically.
Fig. 1 has described the prior art of extracting the motion relevant with metadata from the MPEG pixel domain (Motion Picture Experts Group) compressed video.MPEG video to the full decoder (Full decoding) of pixel domain is carried out by mpeg decode unit 11.Motion estimation unit 12 (optical flow computation or the piece all known based on those skilled in the art mate) calculating kinematical vector from the pixel of video flowing is represented.Parameter and shooting movement calculation unit 13 are calculated the motion that relates to metadata from these motion vectors.
For the shooting estimation in the pixel domain, Ingemar J.Cox is arranged, the patent of Sebastien Roy " US5; 751,838:5/1998: " in the correction (Correction of camera motion between two image frames:382/107) of the shooting campaign of 382/107 of two picture frame " and publication.
" the overall convergent-divergent of video compression/panorama is estimated and compensation (Global Zoom/Pan estimation and compensation for video compression) " ICASSP 91 at Yi Tong Tse and Richard L.Baker, 1991, estimated the shooting convergent-divergent and the panorama of video coding in the 2725-2728 page or leaf.But this method may produce insecure result under the situation of those shooting type of sports that is not modeling.
The A.Akutsu that in the visual communication of 1992 the 1818th of SPIE volumes and image processing section 1522-1530 page or leaf, delivers, Y.Tonomura, H.Hashimoto, analyzed the shooting campaign in use Hough transformation (Hough Transform) pixel domain in Y.Ohha " using motion vector to carry out video frequency searching (Video indexing using motion vectors) ", yet described method is not extracted the shooting amount of exercise.
The Jong-II Park that in rolling up 4 the 3rd 288-296 page or leaf in the June, 1994 of IEEE Trans.CSVT, delivers, Nobuyuki Yagi, Kazumasa Enami, Kiyoharu Aizama, " to the estimation (Estimationof Camera Parameters from Image Sequence for model based video coding) " of MitsutoshiHatori and the Jong-II Park that in 1996 the 9th volume 43-53 pages or leaves of " signal processing: Image Communication ", delivers based on the camera parameter in the image sequence of video coding model, find to use the characteristic point in the pixel domain of texture gradient in Choong Woong Lee " strong (robust) that carry out camera parameter from the image sequence that video is formed estimates (Robust estimation of camera parameters from image sequence for videocomposition) ", and determined the shooting campaign of the motion of these characteristic points.
The Jong-II Park that delivers in 1996 the 9th volume 43-53 pages or leaves of " signal processing: Image Communication " has used outlier refusal method to make that the shooting estimation in pixel domain is more strong in Choong Woong Lee " carrying out the strong estimation (Robust estimation of camera parameters from image sequence for videocomposition) of camera parameter from the image sequence that video is formed ".
The Y.P.Tan that in the 406-409 page or leaf of the Proc.ICPC of nineteen ninety-five, delivers, S.R.Kuilami has described the recurrence least square method according to the shooting estimation in the pixel domain of the hypothesis of the little total amount of shooting motion in P.J.Ramadge " shooting kinematic parameter New Estimation Method (A new method for camera motionparameter estimation) ".
At the Philippe Joly that the 8th volume 295-307 pages or leaves in 1996 of " signal process: Image Communication " are delivered, based on Sobel operator (Sobel operator) or standard edge detecting unit and edge to the space-time projection of bargraphs the shooting motion estimation algorithm in the pixel domain has been described in Hae-Kwang Kim " using time-space image to carry out effective automatic analysis (Efficient automatic analysis of camera work and microsegmentation of videousing spatiotemporal images) that camera work and video differential cut ". Use Hough transformation to analyze bargraphs to extract the edge in the direction of motion.
The 4th M.V.Srinivasan that the 593-606 page or leaf is delivered of 1997 the 30th volumes in " the lines identification " of Dutch political geography, S.Venkatesh, in R.Hosi " qualitative estimation shooting kinematic parameter (Qualitative estimation of camera motion parameters from videosequence) from video sequence ", extract the shooting kinematic parameter in the uncompressed video from pixel domain, the total amount of shooting panorama, inclination, rotation and convergent-divergent wherein is provided respectively.
At ICASSP 99, the Richard R.Schultz that delivers in 1999, Mark G.alfbrd " by using the multiframe comprehensive (Multiframeintegration Via the projective transform with automated block matching featurepoint selection) from the projection conversion that motion block carries out of matching characteristic point selection " advised a kind of based on translation, rotation, convergent-divergent, panorama and the inclination so that calculating is made a video recording from pixel resolution image registration algorithm in the pixel domain of non-linear projection transformation model.
At IEEE image encoding collection of thesis, PCS99, the R.S.Jasinschi that delivers in 1999, T.Naveen, P.Babic-Vovk has described the camera speed of the pixel domain that is used for data base querying and sprite (splicing) application and has estimated in A.J.Tabatabai " apparent 3-D camera speed extracts and use (Apparent 3-Dcamera velocity extraction and its Applicants) ".
Because the audio-visual-materials that increasing employing MPEG-1/MPEG-2 or MPEG-4 form compress have appearred in the huge memory space of video content.But the shooting motion estimation algorithm of developing at pixel domain can not be applied directly to the MPEG compression field.Therefore, need the time consumption decoding of MPEG compression position flow and the computation requirement estimation in the pixel domain, and must carry out shooting estimation (Fig. 1).
And, in order to avoid the computation burden of the MPEG video decompression in the pixel domain and the estimation of making a video recording, advised the shooting estimation of in compression domain, carrying out.The previous shooting estimation in the compression domain is based on use MPEG motion vector and they is matched in the movement parameter model of describing the shooting motion.
Fig. 2 has described the current state of this area of the extraction of the motion that relates to metadata in the mpeg compressed video.The MPEG video analysis is carried out by mpeg bit stream analytic unit 21.From the bit stream of this analysis, unit 22 extracts motion vector and it is delivered to parameter and shooting movement calculation unit 23.
The V.Kobla that in collection of thesis the 3022nd volume 200-211 page or leaf about the SPIE meeting of " storage of image and video database V and retrieval " in February, 1997, delivers, D.Doermann, K-I.Lin determines " flow vector " so that determine comprehensive translational movement direction by the service orientation block diagram in C.Faloutsos " using the DCT of MPEG video and the compressed domain video technology (Compressed domain Video indexing techniques using DCT and motion vectorinformation in MPEG video) of motion vector information " from MPEG compression domain motion vector.But this basic model can not detect shooting convergent-divergent and rotation.
At ICIP, the Kobe, the Roy Wang that delivers on 1999 has described the rapid movement parser in the MPEG territory in Thomas Huang " the quick shooting analysis in MPEG territory (Fast Camera Motion Analysis in MPEG domain) ".This algorithm is based on use from the MPEG motion vector of P-frame with based on the interpolation motion vector for acquisition I frame from the B frame.The outlier refusal least-squares algorithm of parameter shooting estimation is used to strengthen the reliability of the shooting estimation from these motion vectors.
But, use the MPEG motion vector of shooting estimation to have several shortcomings.
At first, motion vector in the mpeg stream of compression is not represented real motion, but select these motion vectors be on encoder fast and effective compression of bit rate, and these motion vectors depend on the coding strategy of encoder manufacturer, this coding strategy is not by mpeg standardization, may be distinct.For example, the high bit rate that has the motion estimation algorithm of the hunting zone that has increased with use is compared with the high-quality mpeg encoded, adopts low compound movement algorithm for estimating for fast coding.Relatively: publishing house of Kluwer institute, " algorithm of MPEG-4 estimation, complex analyses and VLSI-framework (the Complexity Analysis and VLSI-Architecturesfor MPEG-4Motion Estimation) " of the Peter Kuhn that publish in June, 1999, ISBN 792385160.
And, use the MPEG motion vector to make a video recording image sets (GOP) structure, video sampling speed that the level of estimation depends on MPEG significantly (for example, per second 30 frames) and other factor, be insecure and therefore for accurate shooting estimation.For example, some the mpeg encoder equipment on the market dynamically changes gop structure to the sequence with rapid movement.
And MPEG motion vector (particularly little motion vector) is usually greatly influenced by noise and may be unreliable.
And, using some fast motion estimation algorithm to use under the situation in restrained motion estimating searching zone, may there be long motion vector.
And only the I frame of MPEG video does not comprise motion vector.Therefore, be the algorithm that can not use here based on using the MPEG motion vector.Only the I frame of MPEG video is effective MPEG video format, makes this MPEG video format be used in the video editing owing to can carry out accurate shearing of frame.In this field, the motion that relates to metadata is very important, for example, is used for determining this camera work.
And some compressed video format such as DV and MJPEG are based on similar DCT (discrete cosine transform) structure that resembles MPEG, but do not comprise movable information.Therefore do not use based on the shooting motion estimation algorithm that is included in the motion vector in the compressive flow for these situations.
And, from the B frame, to carry out motion vector interpolation acquisition I frame and fail for the situation of quick shooting or object motion, the new images content has appearred in these situations.
Summary of the invention
Because the above-mentioned state of this area, one object of the present invention are to be provided for extracting and browsing a kind of video/audio signal processing method and a kind of video/audio signal treatment facility of the motion that relates to metadata from the video of compression.
The main application of motion metadata in the present invention, comprises video summarization, shooting movement representation and based on the motion of video tour.
Video/audio signal processing method according to the present invention is applicable to that video/audio signal that processing provides is to realize above-mentioned purpose.This equipment comprises step: the compression domain characteristic point of extracting the characteristic of the described video/audio signal at least one compression domain that is illustrated in described video/audio signal; The characteristic point of extracting at described extraction step is carried out estimation; And the frame that passes through the described video/audio signal of formation of predetermined number is followed the tracks of the characteristic point relevant with motion vector.
In video/audio signal processing method according to the present invention, in compression domain, extract the characteristic point of video/audio signal, carry out the estimation of the characteristic point of being extracted, and the tracking characteristic point relevant with motion vector.
And video/audio signal treatment facility according to the present invention is applicable to that video/audio signal that processing provides is to realize above-mentioned purpose.This equipment comprises: extraction element, extract the compression domain characteristic point of the characteristic of the described video/audio signal at least one compression domain that is illustrated in described video/audio signal; Movement estimation apparatus is carried out estimation to the characteristic point of extracting at described extraction element; And the feature point tracking device, the frame of the described video/audio signal of formation by predetermined number is followed the tracks of the characteristic point relevant with motion vector.
In video/audio signal treatment facility according to the present invention, the device that extracts the compression domain characteristic point in compression domain extracts the characteristic point of video/audio signal, the device of carrying out the estimation of characteristic point is carried out the estimation of the characteristic point of being extracted and the device tracking of the tracking characteristics point characteristic point relevant with motion vector.
And a kind of video/audio signal processing method is used to handle and browse the video/audio signal that is provided so that realize above-mentioned purpose.The method comprising the steps of: be illustrated in the compression domain characteristic point of the characteristic of the described video/audio signal in the compression domain of described video/audio signal based at least one, shooting movement conversion figure is set up in classification, the frame of the described video/audio signal of formation by predetermined number is followed the tracks of the characteristic point relevant with motion vector described compression domain characteristic point is carried out estimation, and the movement conversion of wherein making a video recording figure establishment step comprises step: provide to have at least one main shooting movement conversion figure and have other the pattern layout with the shooting campaign that is used for the transduction pathway that video sequence illustrates of a plurality of expressions; By shooting movement conversion figure, and browse by the prime frame (keyframe) that the shooting motion video sequence on node is described; And pass through shooting movement conversion figure, and browse by the diagrammatic representation of the shooting campaign of explanation on node.
In video/audio signal processing method according to the present invention, shooting movement conversion figure is set up in classification, carry out by shooting movement conversion figure, and by browsing that the prime frame of the shooting motion video sequence of explanation on node is carried out, and carry out, and by browsing that the diagrammatic representation of the shooting campaign of explanation on node is carried out by shooting movement conversion figure.
And video/audio signal treatment facility according to the present invention is applicable to handles and browses the video/audio signal that is provided so that realize above-mentioned purpose.This equipment comprises: apparatus for establishing, be used for being illustrated in the compression domain characteristic point of characteristic of described video/audio signal of the compression domain of described video/audio signal based at least one, shooting movement conversion figure is set up in classification, the frame of the described video/audio signal of formation by predetermined number is followed the tracks of the characteristic point relevant with motion vector described compression domain characteristic point is carried out estimation, and the movement conversion of wherein making a video recording figure apparatus for establishing comprises step: be used to provide have at least one main shooting movement conversion figure and have other the device of the pattern layout with the shooting campaign that is used for the transduction pathway that video sequence illustrates of a plurality of expressions; Browsing apparatus by shooting movement conversion figure, and by the prime frame of the shooting motion video sequence on node is described, is browsed; And another browsing apparatus, by shooting movement conversion figure, and browse by the diagrammatic representation of the shooting campaign of explanation on node.
In video/audio signal treatment facility according to the present invention, set up shooting movement conversion figure by the device classification of setting up figure, first browsing apparatus is carried out by shooting movement conversion figure, and by browsing that the prime frame of the shooting motion video sequence of explanation on node is carried out, and second browsing apparatus carry out by shooting movement conversion figure, and by browsing that the diagrammatic representation of the shooting campaign of explanation on node is carried out.
And video/audio signal processing method according to the present invention is applicable to that the classification decomposition of extracting the synthetic video selection that is used to browse is so that realize above-mentioned purpose.The method comprising the steps of: identification video; From the described video capture (shot) of representing each video-frequency band, collect prime frame; According to shooting motion or the comprehensively collection of movable information classification prime frame; And the diagrammatic representation of setting up video, diagrammatic representation is based on the result, moment and relevant with each part of video capture shooting movable information of described classification step, and wherein said diagrammatic representation establishment step comprises the step of being represented each classification of video capture by node.
In video/audio signal processing method according to the present invention, identification video is collected prime frame from video capture, the collected prime frame of classifying, and the diagrammatic representation of setting up video.
And, be to be applicable to that the classification decomposition of extracting the synthetic video selection that is used to browse is to obtain above-mentioned purpose at video/audio signal treatment facility according to the present invention.This equipment comprises: recognition device is used for identification video; Gathering-device is used for collecting prime frame from the described video capture of representing each video-frequency band; Sorter is used for according to shooting motion or the comprehensively collection of movable information classification prime frame; And apparatus for establishing, be used to set up the diagrammatic representation of video, diagrammatic representation is based on the result, moment and relevant with each part of video capture shooting movable information of described classification step, and wherein said diagrammatic representation establishment step comprises the step of being represented each classification of video capture by node.
In video/audio signal treatment facility according to the present invention, the device identification video of identification video, collect the device of prime frame and from video capture, collect prime frame, the collected prime frame of device classification of classification, and set up the diagrammatic representation that the figured device of video is set up video.
And video/audio signal processing method according to the present invention is applicable to that video/audio signal that processing provides is to realize above-mentioned purpose.The method comprising the steps of: the compression domain characteristic point of extracting the characteristic of the described video/audio signal at least one compression domain that is illustrated in described video/audio signal.
In video/audio signal processing method according to the present invention, in compression domain, extract the characteristic point of video/audio signal.
And video/audio signal treatment facility according to the present invention is applicable to that video/audio signal that processing provides is to realize above-mentioned purpose.This device comprises: extraction element is used for extracting in the compression domain of described video/audio signal at least one compression domain characteristic point of the characteristic of the described video/audio signal of expression.
In video/audio signal treatment facility according to the present invention, in compression domain, extract the characteristic point of video/audio signal by the device that extracts the compression domain characteristic point.
And video/audio signal processing method according to the present invention is applicable to the video/audio signal that processing provides.The method comprising the steps of: at least one characteristic point to the characteristic of the described video/audio signal of expression in the compression domain of described video/audio signal is carried out estimation.
In video/audio signal processing method according to the present invention, carry out the estimation of institute's extract minutiae.
And video/audio signal treatment facility according to the present invention is applicable to the video/audio signal that processing provides.This equipment comprises: movement estimation apparatus is used at least one characteristic point in the characteristic of the described video/audio signal of expression of the compression domain of described video/audio signal is carried out estimation.
In video/audio signal treatment facility according to the present invention, carry out the estimation of institute's extract minutiae by the device of carrying out estimation.
Description of drawings
Fig. 1 has described the prior art of motion meta-data extraction;
Fig. 2 has described other prior art of motion meta-data extraction;
Fig. 3 has described the skeleton diagram of video tour and meta-data extraction unit;
Fig. 4 has provided the name agreement to piece and macro block;
Fig. 5 has provided the skeleton diagram of compression domain characteristic point estimation notion;
Fig. 6 shows the data flowchart of meta-data extraction unit;
Fig. 7 has illustrated that mpeg bit stream analysis, DCT-coefficient extract and the motion vector extraction unit;
Fig. 8 shows the characteristic point registration of the IDCT algorithm that use uses and the control flows of estimation in selecting piece;
Fig. 9 shows the calculating stream of piece relevance metric;
Figure 10 has described that characteristic point in the DCT-territory is selected and the control flows of estimation;
Figure 11 shows the DCT coefficient numbering of a 8X8DCT-piece;
Figure 12 shows the data structure in the characteristic point life-span of video summarization;
Figure 13 has illustrated the shooting direction of motion;
Figure 14 has provided the skeleton diagram of an example of video tour unit;
Figure 15 shows the video tour unit with shooting panorama, convergent-divergent and rotation prime frame example;
Figure 16 has provided the diagrammatic representation of video tour unit.
Embodiment
Referring now to description of drawings according to embodiments of the invention.
Disclose a kind of new compression domain characteristic point in the present invention and selected and the motion estimation algorithm under multiple applicable cases, comprised that shooting estimation, object motion estimation, video summarization, video code conversion, motor activity are measured, video scene detects and the video prime frame detects.
Be used for object identification, image tracing, full movement are estimated and the existing characteristic point system of selection of video summarization is applied to pixel domain, therefore need elapsed time to carry out the decoding of compressed video bitstream.
Therefore the on-line operation of disclosed characteristic point selection algorithm and has been avoided the waste of calculating and the decoding compressed time that video flowing consumed in compression domain.The preselected mechanism of compression domain selects characteristic point greatly to reduce computational complexity after determining.
The characteristic point selection algorithm has adopted the texture information that is included in DCT (discrete cosine transform) coefficient and MPEG (Motion Picture Experts Group) motion vector (when existing), and therefore can be applied directly to compression rest image (as motion JPEG (associating picture experts group), MIPEG) based on DCT and compression video (as MPEG-1/MPEG-2/MPEG-4, ITU-T (International Telecommunications Union-telecommunication standard department) recommendation H.261, H.263, H.26X or the DV form).
Content disclosed by the invention has been described the extraction of the characteristic point in compression domain (for example using MPEG-1) and utilized the estimation of these features of the motion vector exist and the error energy of prediction in the MPEG compression domain.
And, the invention discloses the following application of this characteristic point selection algorithm of use in compression domain.
(1) object identification and classification
(2) be used to the object motion of following the tracks of and estimate (using for example movement parameter model or Kalman filter)
(3) comprehensive (shooting) estimation (operation parameter shooting motion model)
(4) using the motion vector that is extracted by this method to carry out motor activity calculates
(5) video code conversion (area-of-interest is determined in the position according to characteristic point in the frame, and by suitable quantizer control area-of-interest is provided more position, uses the shooting kinematic parameter so that again coding or encode as subsequence motion vector is provided)
(6) foreground/background segmentation (, determining the full movement and the object motion of characteristic point) in the video scene by the life-span of tracking characteristics point
(7) video summarization and video scene detect (by the life-span of tracking characteristics point.When the characteristic point of a large amount of preexists disappears and a large amount of new characteristic point when occurring, this is a sign that new scene begins so, can be used as video summarization)
(8) the video prime frame detects (detect prime frame from partial video stream, wherein, along with the time goes over, number of characteristics point does not change in video flowing)
(9) video tour (use characteristic point and relate to the characteristic point of the method for representing according to above-mentioned scalable video and the object/full movement of prime frame)
(10) (the less part by merging several frame of video is to generate a single big image for video-splicing.Here characteristic point is used as reference point)
Fig. 3 has described the skeleton diagram of meta-data extraction and video tour unit.Described equipment comprises: storage medium 31 (comprise light, magnetic, electricity with the medium of electromechanics, as CD-ROM, DVD-RAM, DVD-ROM, video tape, hard disk, RAM, ROM etc.), metadata MD30 is provided the meta-data extraction unit 36 to video tour unit 35.The enforcement of meta-data extraction unit 36 and video tour unit 35 can be according to programmable computer 34, but also may be other device.Video tour unit 35 is by controlling with user people's 33 mutual user interface sections 32.
To describe first preferred embodiment in detail now.
This part has at first provided comprehensive general view, and then as first preferred embodiment, has described that characteristic point in compression domain is selected and the basic skills of estimation.Another preferred embodiment has been described and the first preferred embodiment diverse ways, and the application of this characteristic point and method for estimating.
Fig. 4 has provided the symbol of the piece of the MPEG macro block (MB) of 16 * 16 pixel sizes and their 8 * 8 pixel sizes.Reference frame is the frame on the different time points of comparing with the current time in general.In general, the hypothetical reference frame temporarily is positioned at the back of present frame in context.Under the situation of MPEG-4, with reference to MBcur is the MB of current (cur) frame or current video object plane (VOP) and in the situation of MPEG-4, and MBref is the MB with reference to (ref) frame or reference video plane (VOP), compare with present frame or VOP, these MB are relevant with the different time occasion.In the present invention, term " frame " also is included in the arbitrary shape object (VOP) that uses among the MPEG-4.MV is a motion vector, and its component on x direction and y direction is respectively MV xAnd MV y
" frame in " used herein is to be used at the MPEG and the interior coded macroblocks of standard and recommendation H.26X, and be used for DCT only at the encoding block of DV form and MJPEG." P-type " is used at MPEG and H.26X standard and the predictive coding macro block recommended, and " B-type " is used at the MPEG and the bi-directional predicted macro block of standard and recommendation H.26X.
Fig. 5 has provided the general skeleton diagram of feature point extraction and method for estimating.Characteristic point (or the marginal point in this example) is for example to have the place that suddenly changes on brightness, color or texture, and therefore is applicable to estimation and motion tracking.51 have described the object video that has some marginal points in the present frame of t=t0, and for example in these marginal points is in the position 52.For the reference frame at t=t1, this marginal point 52 (renumber in the reference frame of t=t1 is 54) moves to position 55.Should move relevant with motion vector 53.In order to find this motion vector, in the region of search 56 around the motion vectors, carry out motion estimation techniques.Certain methods disclosed by the invention is how to seek the technology of calculating the motion of estimating between the technology of characteristic point and two the relevant characteristic points in compression domain effectively in compression domain.No doubt two identical characteristic points of different time situation (or be surpass two characteristic point in the situation of an expression object at the several characteristic point) be link together so that find their motion vector, the invention also discloses a kind of signature technology of the characteristic point in compression domain and pixel domain.This signature technology will be described in the step S83 of Fig. 8 in more detail.
Fig. 6 has described the data flow of meta-data extraction unit.This analytic unit 61 is to be responsible for mpeg bit stream analysis and DCT-coefficient and motion vector to extract, and is described in more detail in Fig. 7.Analytic unit 61 is with the type (I: in the frame of current macro, B: bi-directional predicted, prediction), the MPEG motion vector (if having this macro block (mb) type) that extracts and the DCT-coefficient (if existence) of present frame be provided to characteristic point selected cell and motion estimation unit 62 P:.
Characteristic point selected cell 63 is controlled by characteristic point fidelity parameter.It calculates the characteristic point coordinate of present frame from the data of these inputs, and they are delivered to characteristic point motion estimation unit 64, parameter and shooting movement calculation unit 65 and video summarization unit 66.From characteristic point selected cell 63, with candidate motion vector MV (x, y), needed motion vector decision (resolution) and region of search be delivered to characteristic point motion estimation unit 64.The control flows of characteristic point selection and estimation has been described in Fig. 8.Characteristic point motion estimation unit 64 is calculating kinematical vector from the characteristic point coordinate of the characteristic point coordinate of present frame and reference frame, and these motion vectors are outputed to parameter and shooting movement calculation unit 65.
Parameter and shooting movement calculation unit 65 obtain motion vector from the step of front, and the parameter of calculating parameter motion model and shooting kinematic parameter, and these parameters are passed to video summarization unit 66.
Video summarization unit 66 comprises the basic step of characteristic point life-span tabulation 67 and detects and the characteristic point of prime frame extraction unit 68 and the step of motion based on scene change.
The characteristic point life-span 67 comprises characteristic point coordinate and signature, the motion vector relevant with characteristic point and is the range measurement of motion vector computation, for example Figure 12 relatively.Detect and the characteristic point and the motion of prime frame extraction unit 68 based on scene change, with the frame number of scene change, have corresponding important level and the prime frame of the kinematic parameter of making a video recording is sent to video tour unit 35 shown in Figure 3 as metadata.
Video summarization unit 66 can promptly have the number of the prime frame of corresponding important level and shooting kinematic parameter according to the degree of depth of summarizing, and carries out the control on (optionally) profile.
Fig. 7 has described the analytic unit of being made up of mpeg bit stream analytic unit 71, and mpeg bit stream analytic unit 71 for example transmits the stream from MPEG and extracts mpeg video bit stream.Frame-and macro block-type extraction unit 72 extracts macro block-types, and is under the situation of P-MB or B-MB (being P-VOP or B-VOP respectively) 74 in current macro (MB), also uses motion vector extraction unit 75 to extract motion vector for this macro block (or VOP).From the bit stream of preanalysis, DCT-coefficient extraction unit 73 is extracted in the interior piece of frame in I-frame, P-frame, the B-frame (or the I-VOP among the MPEG-4, P-VOP or B-VOP).
Fig. 8 has described by the characteristic point of only using IDCT (inverse discrete cosine transform) on more selected pieces and has selected and motion estimation process.
When using CIF form (352 * 288 pixel), the full decoder of mpeg stream (comparison diagram 1) need carry out 2 * 396 * 4=3168IDCT calculating to cur and ref.But for example, for the shooting estimation, only 6 characteristic points (num=6) in the cur relevant with motion vector are necessary to one 6 movement parameter model for example.In this example, for each characteristic point, when using little [4 ,+4] pixel region of search (for example, around predictor), need a IDCT among the cur to calculate and ref in 4 IDCT calculating (=5 * 6=30IDCT).This has provided suitable facility for IDCT calculates required amount of calculation, approximately reduce 100 times.For big motion, can also use the predictor of MPEG motion vector as the region of search.When using the MPEG motion vector as predictor, the region of search of [4 ,+4] is normally enough.But this region of search can suitably be selected.
In Fig. 8, step S81 calculates all 8 * 8 the piece relevance metrics in the present frame, also determines the numeral of " num " of the piece among the cur according to the highest association according to their these pieces of associative classification.In Fig. 9, understand the calculating of piece relevance metric in more detail.Note, only in frame-piece in the macro block can selected conduct " newly " relating dot one, and that (in case selection) relating dot can pass through I-frame, P-frame and B-frame is tracked.In Fig. 9, describe the step S81 of preferred embodiment in detail.
In Fig. 8 step S82,, calculate 8 * 8 IDCT (and MC, motion compensation, for the piece in P-macro block or B-macro block) to the cur piece that " num " selects.It is method well known to those skilled in the art that 8 * 8 IDCT and MC calculate.
In Fig. 8, step S83 is that all " num " cur piece execution block signatures extract.Calculating for the piece signature disclosed herein is two preferred embodiments: a) calculating of the signature of the piece in pixel domain; And b) calculating of the signature of the piece in the DCT territory.Since must be only to the signature of " num " piece computing block among these cur, " num " piece among these cur has carried out conversion by step S82 in pixel domain, so caused the nonsensical additional calculations expense of pixel domain piece signature from this step.
As a simple pixel domain block feature, the number of the pixel of all or piece of selecting can be used as a signature, and can use SAD (absolute difference and), MSE (mean square error) or carry out signatures match such as other standards well known to those skilled in the art of Hausdorff distance (Haussdorf-distance).But, because this is not very suitable with regard to expression efficient, so the signature of the higher level block feature point in pixel domain is the expression preferred embodiment.These higher level signature characters comprise: as Canny (at pattern analysis and the machine intelligence journal (IEEE Transactionson Pattern Analysis and Machine Intelligience) of IEEE in 1986, volume 8, the 6th, on the 679-698 page or leaf, " computational methods of rim detection " of John Canny (A computational approach to edgedetection)), Sobel, the edge detecting technology of Prewitt and, (parliament of international federation artificial intelligence in 1980 prints on (Intemational Joint Conference on ArtificialIntelligence) 674-679 page or leaf as Lucas/Kanade, " being applied to the iteration image registration technology of stereo image " (the An Iterative Image Registration Techniquewith an Application to Stereo Visition) that Bruce D.Lucas and Takeo Kanade delivers), texture and the color classification image registration technology of Marr/Hildreth (in the David Marr that institute of Imperial College London periodical volume (Proc.of the Royal Society of London B) 207 187-217 pages or leaves in 1980 are delivered, " the rim detection theory " of Ellen Hildreth (Theory of edge detection)); Perhaps that can use with their match-on criterion, that be preferred embodiment and be other technologies well known to those skilled in the art.
For DCT-territory piece signature calculation, the DCT-coefficient all or that select among Figure 11 can be used for the characteristic point registration.The DCT-coefficient of DCT-piece signature can be only from brightness (Y) piece or-selectively-(U V) obtains in the DCT-piece from colourity.Here, only describe the use of the DCT-coefficient of luminance block, but those skilled in the art can easily carry out the extension of chrominance block.Preferred embodiment comprises: according to a) D00 of applicable cases, b) D00, D01, D02, D03; And c) all DCT coefficient.At D Hv(signature of current DCT-piece) and D HvThe preferred embodiment of the distance calculation between the DCT-coefficient of (coefficient of the signature of the DCT-piece that expression is compared) comprises:
Dis tan ce = Σ h h max Σ v v max P hv · | C hv - D hv |
Or
Dis tan ce = Σ h h max Σ v v max P hv · ( C hv - D hv ) 2
Wherein (for example h=v=0, and hmax=vmax=7), and each can be weighted selectively weighting of factor phv.Use these parameters, DCT-piece signature can be applied in the various application, for example, is used for the image splicing of video sequence, can select different h, v, hmax, vmax, phv value from those values for video summarization or the selection of shooting estimation.Sign for higher level DCT-piece, preferred embodiment also comprises DCT-piece active characteristics, DCT-direction character, DCT-energy feature, " discrete cosine transform-algorithm, advantage and application " (Discrete Cosine Transform-Algorithms as the K.R.Rao.P.Yip that publishes in nineteen ninety publishing house of institute (Academic Press), Advantages, Applications) with at Bo Shen, Ishwar K.Sethi in 1996 at SPIE 2670, storage and retrieval (the Storage ﹠amp of image and video database IV; Retrieval for Image and Video DatabasesIV) " extraction of direction characteristics from the image of compression " (the Direct feature extractionfrom compressed images) that delivers on " described, and these are well known to those skilled in the art.
In Fig. 8 step S84, be that a selected cur piece calculates motion vectors (MV), reference block locations and region of search.The motion prediction planning extremely relies on and uses.For example, for the shooting extraction of using 6 parameter models, predict in affine (affine) motion model that the position of the characteristic point in ref can obtain from previous frame.Similarly can be the motion of object tracking prediction characteristic point.Under the situation of P-macro block or B-macro block, the motion vector that extracts from the mpeg bit stream of compression can be used the center that is made in the region of search among the ref.In this case, and special in the situation that the MPEG motion vector becomes little, and the region of search can be selected forr a short time.This means that only 4 IDCT decoding and motion compensation have been enough just.In frame-situation of macro block in, must determine whether one or several piece be new by DCT-piece signature or pixel domain piece signature.At this piece is under the news, and then according to application, preferred embodiment is provided with a bigger region of search.
Piece signature showed this piece Already in one or several apart from frame in, then from the motion vector history of piece tabulation, can determine the next direction of motion and hunting zone by method of motion vector prediction well known to those skilled in the art.In Fig. 8 step S85, in step S84 all the I-reference frame/, the P-reference frame/, the piece position of calculating in the B-reference frame, calculate 8 * 8 IDCT.The piece position is to have the region of search of among step S84s calculating of center for the motion vector predictor calculated in step S84.Also calculate MC (motion compensation) for P-reference macroblock and B-reference macroblock.
This technology is identical with the technology of using in the MPEG-1/MPEG-2/MPEG-4 standard decoder, and is well known to those skilled in the art.Notice that IDCT (and the MC in the situation of P-macro block and B-macro block) not only is applied on the entire frame, and be applied in the little region of search among the ref relevant, and be remarkable fast therefore than the full decoder of entire frame with " num " piece in cur.
In the step S86 of Fig. 8, in pixel domain to searching position (the two that in step S84, calculate among the ref in the region of search around all prediction MV,) carry out 8 * 8 estimation, so that be the best motion vector in the selected region of search of finding among the ref among the cur.For 8 * 8 estimation in the pixel domain, preferred embodiment includes, but is not limited to the method for estimating as full search block coupling well known to those skilled in the art, pixel recurrence search etc., compare the Peter Kuhn of in June, 1999 publishing house of Kluwer institute (Kluwer Academic Publishers) publication " algorithm of MPEG-4 estimation, complex analyses and VLSI-framework (Complexity Analysis andVLSI-Architectures for MPEG-4 Motion Estimation) ", ISBN 792385160.Attention: for P-macro block/B-macro block, because the motion vector from the MPEG-bit stream is used as the motion vector predictor (but in most applications, it is at 16 * 16 macro blocks, and always not reliable), region of search (and needed computing capability) can be very little.A preferred embodiment of motion estimation unit is that block size is not limited to 8 * 8, but also can cover the estimation of use such as 4 * 4 and 8 * 8 variable-block size.Another preferred embodiment of estimation is that a controlled moving displacement (displacement) of profile decomposes (resolution), it can be for example to be arranged to 1 pixel, 2 pixels or 0.5 pixel, and can implement by method well known to those skilled in the art.Attention: when using specific feature, when for example resembling the Lucas/Kanade feature, with regard to computational complexity and tracking fidelity, be preferably in and use Lucas/Kanade/Tomasi signature tracking device in the region of search of being calculated, estimate and be substituted in piece-matched motion of carrying out on these characteristic points.
In Fig. 8 step S87, use the identical method calculating of in step S83, describing to sign by the piece of the piece among the ref of motion vector (in step S86, the determining) sensing of 8 * 8 positions of optimum Match.Attention: all pixels of 8 * 8 positions of optimum Match must transform to the DCT territory when using DCT-piece signature.
In Fig. 8 step S88, piece position among the cur (wherein just having carried out step S84, S85, S86, S87), the piece signature, the motion vector that in step S87, calculate and calculated by the distance (MSE: mean square error between the current and reference block of optimum movement vector (in step S86, calculating) sensing, SAD: absolute difference and, according to employed motion estimation algorithm) be stored in the data structure, preferred embodiment is the embodiment described in Figure 12 for example.Result in distance calculation is higher than to use the thresholding that provides, and under the processed situation of last " num " piece, adopt the strategies below one or more: increase " num " piece that from the piece linked list, obtains or the region of search that increases motion estimation unit.This method allows to adopt the different content material and the coding structure of compressed video.
Whether in the step S89 of Fig. 8, it is all processed to detect all " num " pieces of determining at step S83.If all " num " pieces of determining are all processed (being), then for this frame, stop at here based on the characteristic point of motion estimation algorithm, if not all processed (denying), then enter step S90.
In Fig. 8 step S90, definite " num " piece position of the next one of estimation is not also carried out in visit at present, and carries out the circulation that comprises step S84, S85, S86, S87, S88 again.
The preferred embodiment that the piece relevance metric calculates has been described in Fig. 9.The piece relevance metric represents to be used for the adaptability of the piece of estimation or motion tracking, usually by (being not limited to) visual characteristic decision as edge, color or other important structural dip degree.When can obtain P-frame or B-frame, the movable information that comprises in the P-of these frames macro block and B-macro block can be used to help to find to describe high related piece.
In Fig. 9 step S91, the count of macroblocks device of present frame, MBcur is configured to zero.This counter iteration in present frame all macro blocks and no matter their macro block (mb) type (I-type, P-type or B-type).
In the step S92 of Fig. 9, select with reference frame in the relevant macro block of MBcur, MBref.If there is the motion vector (because we have visited the next coded frame of compression position flow, so this information is obtainable) of a MBcur, MBref is the macro block relevant with motion vector.The if there is no motion vector of the MBcur motion vector of distance of zero mark degree (or have), MBref has the macro block identical with MBcur number (number) so.The macro block (mb) type of MBcur and MBref also is to extract from the bit stream of this step compresses.
In the step S93 of Fig. 9, test a condition.Macro block (mb) type at MBcur is in the frame, and MBref is under the situation of P-type or B-type macro block, enters step S94.
In the step S98 of Fig. 9, test another condition.Macro block (mb) type at MBcur is the P-type, and MBref is under the situation of B-type, enters step S99.
In the step S104 of Fig. 9, test another condition.Macro block (mb) type at MBcur is in the frame, and MBref under the situation in the frame, enters step S105.Step S105 and subsequent step are handled all uncertain only DCT (DCT-only) encoded MPEG forms and other forms as DV or MJPEG.
In the step S94 of Fig. 9, the block counter (Fig. 4) that is used for the DCT-piece of interior macroblocks is configured to zero, and enters step S95.
In the step S95 of Fig. 9, block has been described MBcur, iThe related preferred embodiment that calculates, wherein the association of this 8 * 8DCT piece is defined as follows:
Relevance ( block MBcur , i ) = Activity ( block MBcur , i )
+ k × MV 2 MBcur , x + MV 2 MBcur , y DCTenergy ( block MBref , i )
Wherein " k " is the weight coefficient according to application choice, and is different from for the selection of following the tracks of (for example, by the feature point tracking technology as Lukas/Kanade/Tomasi) for the selection of estimation (for example, by the piece coupling).Defined the preferred embodiment that 8 * 8 activity in the DCT territory is measured below, wherein D HvBe DCT-coefficient (Figure 11).
Activity = Σ h h max Σ v v max | D hv | , ( h , v ) ≠ ( 0,0 )
The value of hmax=vmax is chosen as 7 usually, but can select between (1...6) so that obtain comparatively fast but the strong enforcement of more noises.But, " discrete cosine transform-algorithm, advantage and application " (Discrete CosineTransform-Algorithms of the K.R.Rao.P.Yip that publishes in nineteen ninety publishing house of institute (Academic Press), Advantages, possible embodiment of the present invention has also been represented in defined other DCT-activity or edge metering in Applications).This DCTenergy is defined as:
DCTenergy = Σ h h max Σ v v max | D hv |
Another preferred embodiment with decrement computational complexity is: for each independent association calculate or only use motion vector and (and not being quadratic sum), the DCT-energy item is arranged to 1.
In Fig. 9, step S96 and S97 iterative step S95 four times are till four all pieces of MBcur are all processed.
In the step S99 of Fig. 9, be configured to zero for the block counter of the inner piece of macro block (Fig. 4), and enter step S100.
In the step S100 of Fig. 9, because in P-macro block or B-macro block, predicted macroblock pixel from previous frame (under the situation of B-frame, also having frame in the future), and do not have new characteristic point to appear at here, so the association of this piece is configured to zero.
But in the step S101 of Fig. 9, the block feature point of having followed the tracks of the existence that obtains from other frame still is retained in the characteristic point tabulation of " num " current block characteristic point.Note, for these characteristic points, because macro block is type P or B, so must carry out IDCT and MC at step S82.
In Fig. 9, step S102 and S103 iterative step S100 and S101 four times are till four all pieces of MBcur are all processed.
But, in the step S105 of Fig. 9, be configured to zero for the block counter of the inner piece of macro block (Fig. 4), and enter step S106.
In the step S106 of Fig. 9, at current macro and reference macroblock be in the frame-situation of macro block under, the piece association of calculating MBcur.The piece association is calculated as follows:
Relevance(block MBcur,i)=Activity(block MBcur,i)
+Activity(block MBref,i)
And
Activity ( block MBref , i ) = Σ k = 0 k max m k × Activity ( block MBref , i )
Wherein the calculating of the activity in the DCT-territory as mentioned above.Activity for the relevant block in reference frame is calculated, and the several movable measurement of corresponding and adjacent kmax piece is become the activity of current block by summary and addition.The activity of adjacent block gives the prompting of size of the region of search of subsequence estimation.Value kmax depends on frame sign and application constraint.Value m kWeighting is at a distance with reference to the activity of DCT-piece, and is determined according to application constraint, but for the preferred embodiment m kBe little and, but also can be zero for other (for example, more constraint in the calculating) embodiment below 1.
In Fig. 9, step S107 and step S108 iterative step S106 four times are till four all pieces of MBcur are all processed.
In Fig. 9, step S109 and S110 determine that all current macro are whether processed and all macro blocks of frame MBcur are carried out iteration.
In Fig. 9, step S111 relates to the classification of piece linked list, merge with the characteristic point of having followed the tracks of, and definite " num " piece of output.Piece among the MBcur is stored according to their piece relating value, and must best " num " characteristic point of decision.Sorting algorithm is well known to those skilled in the art.The selection of characteristic point number is mainly according to target application.For example, for the 6 parameters shooting estimation based on affine 6 parameter models, 6 characteristic points that need be relevant with their motion vector.Therefore, in this case, must select at least 6 to have high related piece.Example selects to describe 6 pieces that optimal relevance is measured hereto.For video summarization, selected number of representation feature point depends on the outside fidelity parameter of selecting.For other application, the greater number of characteristic point can only be limited by 8 * 8 number in the image.Tracking in characteristic point only causes very short motion vector (it is often by noise jamming), or the motion estimation process of back causes insufficient result (promptly, very high range measurement appears) situation in, a preferred embodiment of the present invention is: n characteristic point be not till having characteristic point to be left again below selecting according to their correlation.Use for feature point tracking, the block feature point with new calculating of high association must merge with the block feature point of the existence of having followed the tracks of from remote frame.
To describe second preferred embodiment in detail now.
Figure 10 has described second preferred embodiment of the present invention that uses based on the estimation of DCT.This method has following advantage, promptly for current block or region of search, does not have macro block to be converted into pixel-territory from the DCT-territory by using IDCT.But, in compressed video bitstream, exist in the situation of P-frame or B-frame, in compression domain, must carry out motion compensation (MC), this just makes accuracy present loss.That passes through block boundary also can cause the loss of accuracy based on the estimation of DCT.The main application expectation of the second embodiment of the present invention is with the leading video field of interior frame (Intra-frame), as if the interior frame of the compression position flow of DV, MJPEG and the only mpeg format that often uses in broadcast service.
In the step S121 of Figure 10, calculate in the same procedure described in Fig. 8 step S81 for 8 * 8 all piece relevance metric use among the cur.
In the step S122 of Figure 10, calculate the piece signature of " num " cur piece of all selections.Basically can both use in the DCT-territory of Fig. 8 step S83 description and two kinds of methods in pixel domain.But the advantage of the piece endorsement method in the DCT-territory of describing in Figure 10 step S122 is not need IDCT on this step, and does not carry out the complete algorithm that any IDCT also can carry out Figure 10.But,, in compression domain or in pixel domain, need motion compensation for P-macro block and B-macro block.
In the step S123 of Figure 10, use the same procedure of in Fig. 8 step S84, describing to calculate motion vector, the reference block locations of calculating and the region of search among the ref of prediction.
In the step S124 of Figure 10, for the P-macro block and the B-macro block of the region of search among the ref, must be in the DCT-compression domain compute motion compensated (MC).In the several preferred embodiments one is Shih-Fu Chang in 13 the 1st of regional periodical (the IEEE Journal on Selected Areas inCommunication) volumes of the selection in the nineteen ninety-five ieee communication, " processing of MC-DCT compressed video and combination " that David G.Messerschmidt delivers (Manipulation and Compositing ofMC-DCT Compressed Video) and Yoshiaki Shibata on ICAS SP 99 in 1999, Zhigang Chen, the algorithm that revision in " for the fast free degradation algorithm of DCT piece extraction in the compression domain " (A fast degradation-free algorithm for DCT block extraction in thecompressed domain) that Roy H.Campell delivers is described.
In the step S125 of Figure 10, in the DCT-territory, all searching positions among the ref around the motion vectors are calculated estimation.For best searching position, preserve distance metric value and motion vector.Preferred embodiment for the calculating of the estimation in the DCT-territory for example is listed in Ut-va Koc, the U.S. Pat 5 of K.J.Ray Liu, 790,686:8/1998 is in " based on the motion compensation process of DCT: 382/107 " (DCt-based motion estimation method:382/107).
In the step S126 of Figure 10, calculate the piece signature of optimum movement vector position among the ref.Basically can both use in the DCT-territory of Figure 10 step S122 description and two kinds of methods in pixel domain.But the advantage of the piece endorsement method in the DCT-territory of describing in Fig. 8 step S83 is not need IDCT on this step, and does not carry out the complete algorithm that any IDCT also can carry out Figure 10.Piece endorsement method in pixel domain only needs two IDCT, and one is used for the best that each " num " current block and another be used for the compression domain estimation and shifts piece, and its calculating is still seldom.
In the step S127 of Figure 10, in the piece tabulation, preserve the criterion distance of optimical block position among position, piece signature, motion vector and the ref.In the result of distance calculation is to be higher than under the processed situation of standard that this application provides and last " num " piece, can use the one or more of following strategy: increase " num " piece that obtains or the region of search that increases motion estimation unit from the piece linked list.This method allows to adopt the different content material and the coding structure of compressed video.
In the step S128 of Figure 10, the next one of " num " piece position that visit is determined, the next one of " num " piece position that this is determined is not carry out estimation at present.And carry out the circulation that comprises step S123, S124, S125, S126, S127 once more.
To describe the 3rd preferred embodiment in detail now.
Another preferred embodiment of the present invention is a video summarization.This life-span tabulation by keeping characteristic point (it can be distinguished by their characteristic point signature) and their relevant position in frame, their motion vector, their distance (distance of motion vector computation) and their signature are realized.Under a large amount of new feature points appear at situation in the new frame, then very likely have scene change.Similarly when from a frame when next frame has disappeared a large amount of characteristic point, so also very likely be scene change.Select the prime frame of scene in such frame, it is low wherein having a large amount of characteristic points and whole motion total amount.
Figure 12 has described the preferred embodiment of data structure of the characteristic point life-span tabulation of video summarization.Have a list of links for each characteristic point, for mark individually it, provided feature_point_id, described as 131,138,141.The feature_point_id data structure comprises that also an object_id field is so that interrelate one or several characteristic point and object.These feature_point_id use pointer to link to each other as lists of links 136.Each feature_point_id points to another tabulation (for example 132) of the time situation of each characteristic point in video flowing, wherein each comprises this characteristic point (for example 134,135 and 137) at special time situation (location_0=(x for example, y), the data of the space-time position time), motion vector (for example, the MV_0=(MV between characteristic point identical on the extremely next time situation of this characteristic point on the special time situation x, MV y)) data, be used for the distance value (distance_0) of the characteristic point motion vector computation that the reliability of motion vector determines, and the characteristic point signature (signature_0) that interrelates of the correct characteristic point under identical feature_point_id.Attention: use for some, some of these data-fields is the data-field that can select or need other.
The time situation of these characteristic points also connects by list of links, and wherein the link of last term and initial term for example can be regarded as the function as the part of a displaying video, object (comprising many characteristic points) or special exercise figure have wherein occurred.For these list of links, because they do not occur in scene, so there is the mechanism of removing according to their time feature_point_id (mechanism).Also have the mechanism of adding new feature_point_id, it uses the distance of the characteristic point in signature space.Distance in this signature space determines whether this is a new characteristic point or is not and a characteristic point that characteristic point is relevant that exists.Add new feature_point_id and comprise their space length to another mechanism of the object that exists from this object.In the motion vector from the characteristic segments (feature-field) that is included in a feature_point_id, can make up this characteristic point movement locus in time, this be well known to those skilled in the art (for example, by Kalman filter or Lucas/Kanade/Tomasi signature tracking, but be not limited thereto).
The motion vector that belongs to several feature_point_id of an object_id group (for example can dividing into groups according to the space length of signature and their positions) can be used to calculate the movement parameter of the object of being discerned by feature_point_id, as known to those skilled in the art.In the selected situation of doing the rectangle background frames of object, similarly this list of links can be used as expression shooting motion, and this will describe in detail in the preferred embodiment below.
To describe the 4th preferred embodiment in detail now.
Figure 13 has described the possible direction of motion of shooting, and it comprises convergent-divergent, in rotation on three directions and the translation on three directions.A preferred embodiment of the present invention is to use the motion metadata in the data-structure that is included in Figure 12 of extraction, to calculate the shooting campaign of convergent-divergent as video sequence in the compression domain, panorama, inclination etc.For example browse (representing the prime frame shooting campaign relevant), video editing (for example shearing the video on the frame of convergent-divergent output) and (for example simplify from a compression expression with them for effective video, MPEG-2) to another compression expression (for example, MPEG-4) code conversion, quick and effective calculating of shooting motion is useful.
For based in Fig. 6 62 in resulting characteristic point motion vector extract the shooting kinematic parameter, one of preferred embodiment is to use shooting motion model (" from the qualitative estimation of the shooting campaign of video sequence " (Qualitative estimation ofcamera motion parameters from video sequence) that M.V.Srinivasan, S.Venkatesh in 30 the 4th 593-606 pages or leaves of the pattern identification (Patternrecognition) of Elsevier in 1997 volume, R.Hosi deliver) and its shooting kinematic parameter extracting method:
u x=-r y+Y·r z+X·r zoom
u y=r x-X·r z+Y·r zoom
In this algorithm, for each motion vector (u x, u y), according to parameter r x, r y, r zAnd r ZoomThe resultant motion vector section that equation above calculating is described, wherein X and Y are the pixel coordinates of the plane of delineation.Actual vector section (providing in the step 62 of Fig. 6) then is provided from synthetic motion vector section, and calculates the depth of parallelism (the parallelism of the residual motionvector field) of residual movement vector section.Residual movement vector segment table shows the translational component of shooting motion.When all motion vectors of residual movement vector section when being parallel, find r x, r y, r zAnd r ZoomOptimized parameter.This algorithm is for example by changing parameter r x, r y, r zAnd r ZoomCarrying out four-dimensional single file formula minimizes up to the optimal approximation depth of parallelism that obtains residue ((translatoric) of translation) motion vector.But the movement parameter model of definite shooting from motion vector well-known to those skilled in the art or the additive method of object motion also are feasible.
Figure 14 has described the skeleton diagram of the figured example of video tour unit.This video tour unit (or video tour user interface) uses movable information (that is, metadata), and the motion metadata of particularly making a video recording is so that can be in classification decomposition and the video summarization on shooting and the prime frame level.Ding Yi shooting is as a sequence video frame in this article, and this sequence video frame is the single continuously active in time and space of being caught by a video camera.The present invention is general, and be not limited to the to make a video recording motion but cover motion and the scene that relates to metadata as the parameter object motion of this video browser.The invention is not restricted to rectangular frame, also can be used for browsing of the arbitrary shape object relevant with their motion metadata.In the example below, usually, be described in the video tour unit under shooting motion and the rectangular object situation.For video tour, use the state transition graph of classification shooting motion model.
At first use the similar motion metadata section of gradient well known to those skilled in the art and sorting technique identification, the collection of prime frame draws from these and is used to represent each video-frequency band.Shooting campaign transition arcs between the prime frame of each section is described by the shooting kinematic parameter, and these shooting kinematic parameters are visually represented in browser.The total amount of shooting motion is described in video browser, makes the user visually distinguish between little and big shooting campaign, perhaps distinguishes between slow and fast shooting convergent-divergent.
Figure 14 describes, and for example, has three motion metadata states: the situation of shooting panorama, shooting convergent-divergent and shooting rotation.
Figure 14 step 151 has been described the shooting panorama state that has 0.5 shooting panorama constant at directions X.This arrow has been described the relative velocity of make a video recording panorama travel direction and its length, the motion of making a video recording.One of preferred diagrammatic representation of shooting panorama is that a splicing that comprises the successive frame of the panorama of making a video recording is represented.The generation of this splicing expression is well known to those skilled in the art, as " signal processing, image communication " (Signal Processing in 1996, Image Communications) M.Irani on the volume 8, " the effective expression and the application thereof of video sequence " that P.Anandan, J.Bergen, R.Kumar, S.Hsu deliver (Efficientrepresentations of video sequences and their application).
Figure 14 step 152 has been described the figured preferred embodiment of the shooting zoom state in state transition graph, wherein goes up in the time " to " 2 shooting convergent-divergent to occur.The center of thumbnail (thumbnail) (being prime frame) the expression shooting convergent-divergent in the shooting convergent-divergent is represented.The length of the arrow in the shooting resize-window is represented relative shooting scaling speed.Direction indication convergent-divergent to the arrow at center.The direction indication convergent-divergent that passes the arrow at center amplifies.
Figure 14 step 153 has been described the figured preferred embodiment of shooting rotation, wherein the expression frame of focus of rotation of representing to make a video recording of the thumbnail in the icon.This arrow is described the direction of rotation, and arrow represent the to make a video recording relative velocity of rotation.
Each shooting motion icon represents that specific shooting motion state and the arrow between the shooting motion icon are illustrated in the shooting motion state conversion between the specific shooting motion state.Can find conversion simply by for example gradient technology or to the total amount established standards of each type of the shooting campaign between successive frame.But, also can use more advanced algorithm well known to those skilled in the art.The center of convergent-divergent is determined by the joining of all (artificial prolongation) motion vectors.
Figure 15 has described the expanded view of the video tour unit that occurs in Figure 14.One of preferred function is the BROWSE order (preferred embodiment is to use click, percussion function key or stylus) on one of three Status icons (161,163,164), and it will cause showing more detailed expression.When providing BROWSE and order to panorama status window 161, the prime frame of shooting panorama is represented shown in 162.When providing BROWSE and order to zoom state window 163, the prime frame of shooting convergent-divergent is represented shown in 166.In 166, the part of prime frame 168 quilt is mark (preferred embodiment may be the frame in the square of the pericentral different colours of convergent-divergent of focus area) visually.When providing an order to this color frame (preferred embodiment is to use mouse or stylus to click), the low hierarchical level of the next one of same movement metadata is 167 by graphical display.When providing BROWSE and order to rotation status window 164, the prime frame of shooting panorama is represented shown in 165.The function of another preferred embodiment is included in one of three Status icons (161,163,164) or prime frame represents that the PLAY on (162,165,166,167) orders (preferred embodiment is to use mouse to double-click, percussion function key or stylus): the part (specific in this example is the shooting motion) of describing the video sequence of this specific metadata.This causes playing the part of the video sequence that belongs to this state.
Figure 16 has described another preferred embodiment of the function of video tour unit, when when represent one of (relatively Figure 15) at three Status icons (171,173,174) or their prime frame on, providing a GRAPH order (preferred embodiment is to use mouse button, function key or stylus to click) from coordinate representation, the diagrammatic representation of display element data (preferred embodiment: along time/the shooting campaign metadata of frame of digital axle).
Applicability on the industry
As above described in detail, video/audio signal processing method according to the present invention is applicable to the video/audio signal that processing provides.The method comprising the steps of: the compression domain characteristic point of extracting the characteristic of the video/audio signal at least one compression domain that is illustrated in video/audio signal; The characteristic point of extracting at extraction step is carried out estimation; And the frame that passes through the formation video/audio signal of predetermined number is followed the tracks of the characteristic point relevant with motion vector.
Therefore, in video/audio signal processing method according to the present invention, in compression domain, extract the characteristic point of video/audio signal, carry out the estimation of the characteristic point of being extracted, and the tracking characteristic point relevant with motion vector.Thereby can realize the reduction of time of handling or cost, and make and effectively be treated as possibility.
And video/audio signal treatment facility according to the present invention is applicable to the video/audio signal that processing provides.This equipment comprises: extraction element, extract the compression domain characteristic point of the characteristic of the video/audio signal at least one compression domain that is illustrated in video/audio signal; Movement estimation apparatus is carried out estimation to the characteristic point of extracting at extraction element; And the feature point tracking device, the frame of the formation video/audio signal by predetermined number is followed the tracks of the characteristic point relevant with motion vector.
Therefore, in video/audio signal treatment facility according to the present invention, the device that extracts the compression domain characteristic point in compression domain extracts the characteristic point of video/audio signal, the device of carrying out the estimation of characteristic point is carried out the estimation of the characteristic point of being extracted and the device tracking of the tracking characteristics point characteristic point relevant with motion vector.Thereby can realize the reduction of time of handling or cost, and make and effectively be treated as possibility.
And a kind of video/audio signal processing method is used to handle and browse the video/audio signal that is provided.The method comprising the steps of: shooting movement conversion figure is set up in classification, and wherein the figure establishment step comprises step: provide to have at least one main shooting movement conversion figure and have other the pattern layout of node of the shooting campaign with the transduction pathway that is used for video sequence of a plurality of expressions; By shooting movement conversion figure, and browse by the prime frame of the shooting motion video sequence of explanation on node; And pass through shooting movement conversion figure, and browse by the diagrammatic representation of the shooting campaign of explanation on node.
Therefore, in video/audio signal processing method according to the present invention, shooting movement conversion figure is set up in classification, carry out by shooting movement conversion figure, and by browsing that the prime frame of the shooting motion video sequence of explanation on node is carried out, and carry out, and by browsing that the diagrammatic representation of the shooting campaign of explanation on node is carried out by shooting movement conversion figure.Thereby can realize the reduction of processing time or cost, and make and effectively be treated as possibility.
And video/audio signal treatment facility according to the present invention is applicable to handles and browses the video/audio signal that is provided.This equipment comprises: apparatus for establishing, be used for classification and set up shooting movement conversion figure, wherein the figure apparatus for establishing comprises step: provide to have at least one main shooting movement conversion figure and have other the pattern layout of the shooting campaign with the transduction pathway that is used for the video sequence explanation of a plurality of expressions; Browsing apparatus by shooting movement conversion figure, and by the prime frame of the shooting motion video sequence on node is described, is browsed; And browsing apparatus, browse by shooting movement conversion figure and by the diagrammatic representation of the shooting campaign of explanation on node.
Therefore, in video/audio signal treatment facility according to the present invention, set up shooting movement conversion figure by the device classification of setting up figure, first browsing apparatus is carried out by shooting movement conversion figure, and by browsing that the prime frame of the shooting motion video sequence of explanation on node is carried out, and second browsing apparatus carry out by shooting movement conversion figure, and by browsing that the diagrammatic representation of the shooting campaign of explanation on node is carried out.Thereby can realize the reduction of processing time or cost, and make and effectively be treated as possibility.
And video/audio signal processing method according to the present invention is applicable to the classification decomposition of extracting the synthetic video selection that is used to browse.The method comprising the steps of: identification video; From the video capture of representing each video-frequency band, collect prime frame; According to shooting motion or the comprehensively collection of movable information classification prime frame; And the diagrammatic representation of setting up video, diagrammatic representation is based on the result, moment and relevant with each part of video capture shooting movable information of classification step, and wherein the diagrammatic representation establishment step comprises the step of being represented each classification of video capture by node.
Therefore, in video/audio signal processing method according to the present invention, identification video is collected prime frame from video capture, the collected prime frame of classifying, and the diagrammatic representation of setting up video.Thereby can realize the reduction of processing time or cost, and make and effectively be treated as possibility.
And, be to be applicable to the classification decomposition of extracting the synthetic video selection that is used to browse at video/audio signal treatment facility according to the present invention.This equipment comprises: recognition device is used for identification video; Gathering-device is used for collecting prime frame from the video capture of representing each video-frequency band; Sorter is used for according to shooting motion or the comprehensively collection of movable information classification prime frame; And apparatus for establishing, be used to set up the diagrammatic representation of video, diagrammatic representation is based on the result, moment and relevant with each part of video capture shooting movable information of classification step, and wherein said diagrammatic representation establishment step comprises the step of being represented each classification of video capture by node.
Therefore, in video/audio signal treatment facility according to the present invention, the device identification video of identification video is collected the device of prime frame and collect prime frame from video capture, the prime frame that the device classification of classification is collected, and set up the diagrammatic representation that the figured device of video is set up video.Thereby can realize the reduction of processing time or cost, and make and effectively be treated as possibility.
And video/audio signal processing method according to the present invention is applicable to the video/audio signal that processing provides.The method comprising the steps of: the compression domain characteristic point of extracting the characteristic of the video/audio signal at least one compression domain that is illustrated in video/audio signal.
In video/audio signal processing method according to the present invention, in compression domain, extract the characteristic point of video/audio signal.Thereby can realize the reduction of processing time or cost, and make and effectively be treated as possibility.
And video/audio signal treatment facility according to the present invention is applicable to the video/audio signal that processing provides.This device comprises: extraction element is used for extracting in the compression domain of video/audio signal at least one compression domain characteristic point of the characteristic of the described video/audio signal of expression.
Therefore, in video/audio signal treatment facility according to the present invention, in compression domain, extract the characteristic point of video/audio signal by the device that extracts the compression domain characteristic point.Thereby can realize the reduction of processing time or cost, and make and effectively be treated as possibility.
And video/audio signal processing method according to the present invention is applicable to the video/audio signal that processing provides.The method comprising the steps of: at least one characteristic point to the characteristic of the expression video/audio signal in the compression domain of video/audio signal is carried out estimation.
Therefore, in video/audio signal processing method according to the present invention, carry out the estimation of institute's extract minutiae.Thereby can realize the reduction of processing time or cost, and make and effectively be treated as possibility.
And video/audio signal treatment facility according to the present invention is applicable to the video/audio signal that processing provides.This equipment comprises: movement estimation apparatus is used at least one characteristic point in the characteristic of the expression video/audio signal of the compression domain of video/audio signal is carried out estimation.
Therefore, in video/audio signal treatment facility according to the present invention, carry out the estimation of institute's extract minutiae by the device of carrying out estimation.Thereby can realize the reduction of processing time or cost, and make and effectively be treated as possibility.

Claims (2)

1. method of handling and browsing the video/audio signal that is provided comprises step:
Be illustrated in the compression domain characteristic point of the characteristic of the described video/audio signal in the compression domain of described video/audio signal based at least one, shooting movement conversion figure is set up in classification, the frame of the described video/audio signal of formation by predetermined number is followed the tracks of the characteristic point relevant with motion vector described compression domain characteristic point is carried out estimation, and the movement conversion of wherein making a video recording figure establishment step comprises step: provide to have at least one main shooting movement conversion figure and have other the pattern layout with the shooting campaign that is used for the transduction pathway that video sequence illustrates of a plurality of expressions;
By shooting movement conversion figure, browse by the prime frame of the shooting motion video sequence of explanation on node; And
By shooting movement conversion figure, and browse by the diagrammatic representation of the shooting campaign of explanation on node.
2. an equipment of handling and the video/audio signal that is provided being provided comprises
Apparatus for establishing, be used for being illustrated in the compression domain characteristic point of characteristic of described video/audio signal of the compression domain of described video/audio signal based at least one, shooting movement conversion figure is set up in classification, the frame of the described video/audio signal of formation by predetermined number is followed the tracks of the characteristic point relevant with motion vector described compression domain characteristic point is carried out estimation, and the movement conversion of wherein making a video recording figure apparatus for establishing comprises: be used to provide have at least one main shooting movement conversion figure and have other the device of the pattern layout with the shooting campaign that is used for the transduction pathway that video sequence illustrates of a plurality of expressions;
Browsing apparatus by shooting movement conversion figure, and by the prime frame of the shooting motion video sequence on node is described, is browsed; And
Another browsing apparatus by shooting movement conversion figure, and is browsed by the diagrammatic representation of the shooting campaign of explanation on node.
CNB2004100983682A 1999-11-29 1999-11-29 Method and equipment for processing and browsing provided video/audio signal Expired - Fee Related CN100348032C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100983682A CN100348032C (en) 1999-11-29 1999-11-29 Method and equipment for processing and browsing provided video/audio signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100983682A CN100348032C (en) 1999-11-29 1999-11-29 Method and equipment for processing and browsing provided video/audio signal

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CNB998159158A Division CN100387061C (en) 1999-11-29 1999-11-29 Video/audio signal processing method and video/audio signal processing apparatus

Publications (2)

Publication Number Publication Date
CN1625246A CN1625246A (en) 2005-06-08
CN100348032C true CN100348032C (en) 2007-11-07

Family

ID=34766639

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100983682A Expired - Fee Related CN100348032C (en) 1999-11-29 1999-11-29 Method and equipment for processing and browsing provided video/audio signal

Country Status (1)

Country Link
CN (1) CN100348032C (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106791862A (en) * 2015-11-19 2017-05-31 掌赢信息科技(上海)有限公司 A kind of method for video coding and equipment
US11228754B2 (en) * 2016-05-06 2022-01-18 Qualcomm Incorporated Hybrid graphics and pixel domain architecture for 360 degree video

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5708767A (en) * 1995-02-03 1998-01-13 The Trustees Of Princeton University Method and apparatus for video browsing based on content and structure
WO1998052356A1 (en) * 1997-05-16 1998-11-19 The Trustees Of Columbia University In The City Of New York Methods and architecture for indexing and editing compressed video over the world wide web

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5708767A (en) * 1995-02-03 1998-01-13 The Trustees Of Princeton University Method and apparatus for video browsing based on content and structure
WO1998052356A1 (en) * 1997-05-16 1998-11-19 The Trustees Of Columbia University In The City Of New York Methods and architecture for indexing and editing compressed video over the world wide web

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Clustering Methods for Video Browsing and Annotation. ZHONG D ET AL.Proceedings of Spie,US,Bellingham,Spie,Vol.2670 No.2. 1996 *

Also Published As

Publication number Publication date
CN1625246A (en) 2005-06-08

Similar Documents

Publication Publication Date Title
CN1335021A (en) Video/audio signal processing method and video/audio signal processing apparatus
Tan et al. Rapid estimation of camera motion from compressed video with application to video annotation
US6940910B2 (en) Method of detecting dissolve/fade in MPEG-compressed video environment
CN1711556A (en) A method of and system for detecting uniform color segments
Yoo et al. Gradual shot boundary detection using localized edge blocks
CN1331451A (en) Information search system
CN101048799A (en) Video content understanding through real time video motion analysis
US20060193387A1 (en) Extracting key frames from a video sequence
JP2004520760A (en) Video Summarization Using Motion Descriptors
CN1739294A (en) Video encoding method
CN1808469A (en) Image searching device and method, program and program recording medium
JP2001526859A (en) Instruction and editing method of compressed image on world wide web and architecture
CN1692654A (en) Motion picture encoding method and motion picture decoding method
KR20070007330A (en) Monochrome frame detection method and corresponding device
CN100348032C (en) Method and equipment for processing and browsing provided video/audio signal
CN100336390C (en) Step decomposition method and apparatus for extracting synthetic video selection for browsing
CN1806443A (en) Method of representing a sequence of pictures using 3d models, and corresponding devices and signal
JP2007518303A (en) Processing method and apparatus using scene change detection
EP1752891A2 (en) Method and apparatus for establishing and browsing a hierarchical video camera motion transition graph.
Kuhn Camera motion estimation using feature points in MPEG compressed domain
Wang et al. A Key Frame Extraction Method of HEVC Video Based on Clustering Algorithm for Electric Utilities Management
Felip et al. Robust dominant motion estimation using MPEG information in sport sequences
Dong et al. Real-time storyboard generation for H. 264/AVC compressed videos
Dorai et al. Generating Motion Descriptors from MPEG-2 Compressed HDTV Video for Content-Based Annotation and Retrieval
CN1717939A (en) Picture decoding apparatus, picture encoding apparatus, and method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20071107

Termination date: 20091229