WO2000045604A1 - Signal processing method and video/voice processing device - Google Patents
Signal processing method and video/voice processing device Download PDFInfo
- Publication number
- WO2000045604A1 WO2000045604A1 PCT/JP2000/000423 JP0000423W WO0045604A1 WO 2000045604 A1 WO2000045604 A1 WO 2000045604A1 JP 0000423 W JP0000423 W JP 0000423W WO 0045604 A1 WO0045604 A1 WO 0045604A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- audio
- segments
- feature
- scene
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 191
- 238000003672 processing method Methods 0.000 title claims description 18
- 238000001514 detection method Methods 0.000 claims abstract description 27
- 238000000034 method Methods 0.000 claims description 53
- 238000000605 extraction Methods 0.000 claims description 36
- 238000005259 measurement Methods 0.000 claims description 17
- 239000000284 extract Substances 0.000 claims description 14
- 230000002123 temporal effect Effects 0.000 claims description 6
- 230000005236 sound signal Effects 0.000 claims 3
- 230000006870 function Effects 0.000 description 12
- 230000003068 static effect Effects 0.000 description 10
- 230000000694 effects Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000011218 segmentation Effects 0.000 description 6
- 239000013598 vector Substances 0.000 description 6
- 239000011295 pitch Substances 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000007621 cluster analysis Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000003064 k means clustering Methods 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000001020 rhythmical effect Effects 0.000 description 2
- 150000001768 cations Chemical class 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
- G06F16/785—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
- G06F16/7864—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using domain-transform features, e.g. DCT or wavelet transform coefficients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
- H04N19/573—Motion compensation with multiple frame prediction using two or more reference frames in a given prediction direction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/14—Picture signal circuitry for video frequency region
- H04N5/147—Scene change detection
Definitions
- the present invention relates to a signal processing method for detecting and analyzing a pattern reflecting a semantic structure underlying a signal, an image reflecting a semantic structure underlying a video signal, and an image processing method.
- the present invention relates to a video and audio processing device that detects and analyzes an audio pattern.
- a typical 30-minute television program contains hundreds of shots. Therefore, in the above-mentioned conventional video extraction technology, it is necessary for the user to examine a story board in which a huge number of extracted shots are arranged, and to understand such a story board. Had to put a heavy burden on the user.
- the conventional video extraction technology there is a problem that shots in a conversation scene in which two persons are alternately photographed according to a change in a speaker are often redundant. As described above, the shot is too low in the hierarchy as a target for extracting the video structure and the amount of wasted information is large, and the conventional video extraction technology for extracting such a shot is convenient for the user. It wasn't good.
- Video extraction techniques include, for example, "A. Merlino, D. Morey and M. kaybury, Broadcast news nav i gat i on us ing story segmentation, Proc. Of ACM Mul t imedi a 97, 1997 ", as described in Japanese Patent Application Laid-Open No. H10-1369297, there are some which use very specialized knowledge about a specific content genre such as news or football games.
- this conventional video extraction technology can achieve good results for the target genre, but is not useful for other genres at all, and cannot be generalized easily as a result of being limited to the genres. There was a problem.
- the present invention has been made in view of such circumstances, and solves the above-described problems of the conventional video extraction technology, and provides a signal for extracting a high-level video structure in various video data. It is an object of the present invention to provide a processing method and a video / audio processing device.
- a signal processing method for detecting and analyzing a pattern that reflects the semantic structure of the content of a signal, comprising at least one segment representing a feature of a segment formed from a series of consecutive frames constituting the signal.
- the temporal distance between the segments is within a predetermined time threshold, and the mutual distance is within a predetermined time threshold.
- Such a signal processing method detects similar segments in a signal and combines them into a scene.
- a video and audio processing apparatus that achieves the above object is a video and audio processing apparatus that detects and analyzes a video and / or audio pattern that reflects the semantic structure of the content of a supplied video signal.
- Feature extraction means for extracting at least one or more features representing the features from a video and / or audio segment formed from a sequence of continuous video and / or audio frames constituting the video signal; Using the features, a metric for measuring the similarity between the video and / or audio segment pair is calculated for each of the features, and the video and / or audio segment pair is calculated based on the metric.
- the temporal distance between the video and / or audio segments is equal to or less than a predetermined time threshold.
- two video and / or audio segments whose dissimilarity is equal to or less than a predetermined dissimilarity threshold are detected, and the temporally continuous video reflects the semantic structure of the content of the video signal And / or grouping means for grouping into scenes composed of audio segments.
- FIG. 1 is a diagram for explaining the configuration of video data applied in the present invention, and is a diagram for explaining the structure of modeled video data c
- FIG. 2 is a diagram for explaining a scene is there.
- FIG. 3 is a block diagram illustrating a configuration of a video and audio processing device shown as an embodiment of the present invention.
- FIG. 4 is a flowchart illustrating a series of steps of detecting and duplicating a scene in the video / audio processing apparatus.
- FIG. 5 is a diagram illustrating dynamic feature amount sampling processing in the video and audio processing apparatus.
- FIG. 6 is a diagram illustrating the dissimilarity threshold.
- FIG. 7 is a diagram illustrating the time threshold.
- FIG. 8 is a flowchart illustrating a series of steps in grouping segments in the video / audio processing apparatus.
- An embodiment to which the present invention is applied is a video and audio processing apparatus for automatically searching for and extracting desired contents from recorded video data.
- the video data to be processed in the present invention is structured as a model, and has a structure hierarchized into three levels of frames, segments, and scenes. That is, video data is composed of a series of frames in the lowest layer. In addition, video data is composed of segments formed by a series of consecutive frames as the next layer above the frame. Furthermore, the video data is composed of scenes that are formed at the highest layer by grouping the segments based on a meaningful association.
- This video data includes both video and audio information.
- the frames in the video data include a video frame that is a single still image and an audio frame that represents audio information sampled in a short period of time, typically several H ⁇ to several hundred milliseconds / long. included.
- a segment is composed of a sequence of video frames shot continuously by a single camera, and is generally called a shot.c.
- This segment includes a video segment and a video or audio segment. , which is the basic unit in the video structure.
- These sections Many definitions are possible for the audio segment, especially for the audio segment. For example, the following can be considered.
- an audio segment may be formed by delimiting a silent period in video data detected by a generally well-known method. Also, as described in "D. Kimber and L. Wilcox, Acoustic Segmentation for Audio Browsers, Xerox Pare Technical Report", for example, speech, music, noise, silence, etc. In some cases, it may be formed from a series of speech frames classified into a small number of categories.
- the audio segments are described in "S. Pfe if fer, S. Fischer and E. Wolfgang, Automatic Audio and ontent Analysis, Proceeding of ACM Multimedia 96, Nov. 1996, pp21-30".
- a large change in a certain feature between two consecutive audio frames is detected as an audio power cut point, and the determination is made based on the audio power cut point.
- scenes are used to describe the content of video data at a higher level based on semantics, so that segments obtained by video segment (shot) detection or audio segment detection can be used, for example, perceptual activities in the segments. They are grouped into meaningful units using feature quantities that represent segment characteristics such as quantity. Scenes are subjective and depend on the content or genre of the video data, but here are groupings of repetitive patterns of video or audio segments whose features are similar to each other. And Specifically, as shown in FIG. 2, when two speakers are talking with each other, the video segments appear alternately according to the speakers. In video data having such a repeating pattern, A series of video segments A of one speaker and a series of video segments B of the other speaker are grouped together to form one scene. Such repetitive patterns are closely related to the high-level meaningful structure in video data, and scenes are indicative of such high-level meaningful cohesion in video data.
- the video / audio processing apparatus 10 shown in FIG. 3 measures the similarity between segments using the above-described segment feature amounts in video data, and combines these segments into a scene. This automatically extracts the video structure, and can be applied to both video and audio segments.
- the video / audio processing device 10 includes a video division unit 11 for dividing the stream of the input video data into video, audio, or both segments, and a video data division information.
- a video segment memory 12 for storing, a video feature extraction unit 13 as a feature extraction means for extracting a feature in each video segment, and a feature extraction means for extracting a feature in each audio segment Audio feature amount extraction unit 14, a segment feature amount memory 15 that stores the video segment and audio segment feature amounts, and a scene detection unit that is a grouping unit that groups video segments and audio segments into scenes.
- 16 and a feature amount similarity measuring unit 17 as similarity measuring means for measuring the similarity between two segments.
- the video division unit 11 includes, for example, MPEG 1 (Moving Pictures Experts Group phase 1), MPEG 2 (Moving Pictures Experts Group phase 2), or so-called DV (Digital Video).
- compression A stream of video data consisting of video data and audio data in various digitized formats including a video data format is input, and this video data is converted into video, audio, or both segments. It is to divide.
- the video division unit 11 can directly process the compressed video data without completely expanding the compressed video data.
- the video division unit 11 processes the input video data and divides the video data into a video segment and an audio segment. Further, the video division unit 11 supplies division information, which is a result of dividing the input video data, to the video segment memory 12 at the subsequent stage. Further, the video division unit 11 supplies the division information to the video characteristic amount extraction unit 13 and the audio characteristic amount extraction unit 14 at the subsequent stage according to the video segment and the audio segment.
- the video segment memory 12 stores the division information of the video data supplied from the video division unit 11. In addition, the video segment memory 12 supplies division information to the scene detection unit 16 in response to an inquiry from a scene detection unit 16 described later.
- the video feature extracting unit 13 extracts a feature for each video segment obtained by dividing the video data by the video dividing unit 11.
- the video feature quantity extraction unit 13 can directly process the compressed video data without completely expanding it.
- the video feature extraction unit 13 supplies the extracted feature of each video segment to the segment feature memory 15 at the subsequent stage.
- the audio feature extraction unit 14 extracts a feature for each audio segment obtained by dividing the video data by the video division unit 11.
- the audio feature quantity extraction unit 14 can directly process the compressed audio data without completely expanding it.
- the audio feature extraction unit 14 outputs the extracted audio segments.
- the feature amount of the segment is supplied to the segment feature amount memory 15 at the subsequent stage.
- the segment feature memory 15 stores the feature of the video segment and the sound segment supplied from the video feature extractor 13 and the audio feature extractor 14, respectively.
- the segment feature memory 15 supplies the stored feature and segment to the feature similarity measuring unit 17 in response to an inquiry from the feature similarity measuring unit 7 described later.
- the scene detection unit 16 starts from each segment in the group, detects a repetition pattern of similar segments from the segment group, and groups such segments as the same scene.
- the scene detecting section 16 collects segments in a certain scene and gradually increases the group, performs processing until all the segments are grouped, and finally generates and outputs a detected scene.
- the scene detection unit 16 uses the feature similarity measurement unit 17 to determine how similar the two segments are.
- the feature similarity measuring unit 17 measures the similarity between two segments.
- the feature similarity measuring unit 17 queries the segment feature memory 15 to search for a feature related to a certain segment. Since similar segments that are repeated in close proximity in time are almost part of the same scene, the video and audio processing apparatus 10 detects such a segment and performs grouping by detecting such a segment. Is detected.
- Such a video / audio processing device 10 detects a scene by performing a series of processes as schematically shown in FIG. First, the video and audio processing device 10 performs video division in step S1, as shown in FIG. That is, the video and audio processing device 10 divides the video data input to the video division unit 11 into either a video segment or an audio segment or, if possible, both.
- the video and audio processing device 1.0 has no particular prerequisites for the applied video segmentation method.
- the video and audio processing device 10 is "G. Ahanger and TD. Little, A survey of technologies for parsing and indexing digital video, J. of Visual Communication and Image Representation 7: 28—4, 1996" (this Video segmentation is performed by a method and method described as described above, and such a method of video segmentation is well known in the technical field. The video division method is also applicable.
- the video and audio processing device 10 extracts a feature amount. That is, the video / audio processing apparatus 10 uses the video feature extraction unit 13 and the audio feature extraction unit 14 to calculate the feature representing the feature of the segment. For example, the video / audio processing device 10 applies the time length of each segment, video features such as color histograms and texture features, audio features such as frequency analysis results, levels and pitches, and activity measurement results. Calculated as possible features. Of course, the video / audio processing device 10 is not limited to these as applicable feature amounts.
- step S3 the video and audio processing device 10 measures the similarity of the segments using the feature values. That is, the video / audio processing apparatus 10 performs the dissimilarity measurement by the feature amount similarity measurement unit 17 and how similar the two segments are based on the measurement criterion. Measure. The video / audio processing device 10 calculates the dissimilarity metric using the feature amounts extracted in the previous step S2.
- the video and audio processing device 10 groups the segments in step S4. That is, the video and audio processing apparatus 10 uses the dissimilarity metric calculated in the previous step S3 and the feature amount extracted in the previous step S2 to make a close similarity in time. These segments are repeated and these segments are grouped. The video / audio processing device 10 outputs the group finally generated in this way as a detected scene.
- the video and audio processing device 10 can detect a scene from the video data. Therefore, the user can use the result to summarize the content of the video data and quickly access interesting points in the video data.
- the video / audio processing device 10 divides the video data input to the video division unit 11 into either video segments or audio segments, or, if possible, into both segments. There are many techniques for automatically detecting video, and as described above, the video / audio processing apparatus 10 does not provide any special prerequisites for this video division method. On the other hand, in the video and audio processing device 10, the accuracy of scene detection in a later process essentially depends on the accuracy of the underlying video division. The video and audio processing device 1 Scene detection at 0 can tolerate some errors during video segmentation. In particular, in the video / audio processing apparatus 10, video division is more preferably performed when segment detection is performed excessively than when segment detection is insufficient. In general, the video and audio processing apparatus 10 can combine segments that are over-detected during scene detection as the same scene as long as the result of detection of similar segments is excessive.
- a feature is an attribute of a segment that represents the characteristics of the segment and provides data for measuring the similarity between different segments.
- the video / audio processing device 10 calculates the feature value of each segment by the video feature value extraction unit 13 and the sound feature value extraction unit 14 and represents the feature of the segment.
- the feature values considered to be effective when used in the video / audio processing device 10 include, for example, the following.
- a necessary condition of these feature amounts that can be applied in the video and audio processing device 10 is that dissimilarity can be measured.
- the video and audio processing apparatus 10 may simultaneously perform the feature extraction and the above-described video division for efficiency. The features described below enable such processing.
- the feature amount As the feature amount, first, a feature amount related to an image is given. In the following, this will be referred to as the image feature. Since a video segment is composed of consecutive video frames, the appropriate video frame is extracted from the video segment to describe the video segment. The content can be characterized by the extracted video frames. That is, the similarity of video segments can be replaced by the similarity of appropriately extracted video frames. From this, the video feature is one of the important features that can be used in the video and audio processing device 10. In this case, the video feature alone can represent only static information, but the video and audio processing apparatus 10 applies a method described later to make the video segment based on this video feature available. It is also possible to extract the dynamic characteristics of the bird.
- the video / audio processing apparatus 10 uses these color feature amounts and video correlation as the video feature amounts.
- the color in the video is an important material when determining whether two videos are similar. Judgment of image similarity using color histograms is described in, for example, "G. Ahanger and ⁇ '. DC little, A survey of technologies for parsing and indexing digital video, J. of Visual Communication and Image Representation. 7: 28-4, 1996 ", which is well known.
- the color histogram is obtained by dividing a three-dimensional color space such as HSV or RG ⁇ into ⁇ regions and calculating the relative proportion of the appearance frequency of pixels in the video in each region. The obtained information gives an ⁇ -dimensional vector.
- the compressed video data for example, as described in US Pat. No. 5,708,767, a color histogram is obtained from the compressed data. Can be extracted directly.
- Such a histogram represents the overall tone of the video, but does not include time information. Therefore, the video / audio processing device 10 calculates a video correlation as another video feature amount.
- a structure in which a plurality of similar segments intersect each other is a powerful index indicating that it is a single integrated scene structure. For example, in a conversation scene, the camera position alternates between two speakers, but the camera usually returns to approximately the same position when re-shooting the same speaker.
- the correlation based on the grayscale reduced video is a good indicator of the similarity of the segments.
- M and N are both small and sufficient, for example, 8 ⁇ 8. In other words, these reduced grayscale images are interpreted as M N-dimensional feature vectors.
- the audio feature is a feature that can represent the content of the audio segment, and the video / audio processing apparatus 10 can use frequency analysis, pitch, level, and the like as the audio feature.
- the video and audio processing device 10 can determine the distribution of frequency information in a single audio frame by performing frequency analysis such as Fourier transform.
- the video / audio processing apparatus 10 includes, for example, an FFT (Fast Fourier Transform) component, a frequency histogram, a power spectrum, and other features to represent the distribution of frequency information over one audio segment. Amount can be used.
- the video / audio processing apparatus 10 can also use pitches such as an average pitch and a maximum pitch, and audio levels such as an average loudness and a maximum loudness as effective audio feature amounts representing audio segments.
- Other feature values include video / audio common feature values.
- the video and audio processor 10 provides useful information for representing the feature of a segment in a scene.
- the video / audio processing device 10 uses the segment length and the activity as the video / audio common feature amount.
- the video / audio processing device 10 can use the segment length as the video / audio common feature amount. This segment length is the length of time in the segment.
- a scene has rhythmic features that are unique to the scene. This rhythmic feature manifests itself as a change in segment length within the scene.
- a short segment that runs quickly represents a commercial.
- the segments in the conversation scene are longer than those in the commercial, and the conversation scene is characterized in that the segments combined with each other are similar to each other.
- the video and audio processing apparatus 10 uses the segment length having the above characteristics for both video and audio. It can be used as a feature value.
- An activity is an indicator that shows how dynamic or static the content of a segment is. For example, if it is visually dynamic, the activity represents the degree to which the camera moves quickly along the object or the object being photographed changes rapidly.
- This activity is calculated indirectly by measuring the average interframe dissimilarity of features such as color histograms.
- dissimilarity metric for the feature value F measured between frame i and frame] is defined as d F (i, j)
- the video activity V F is expressed by the following equation (1). Defined.
- b and f are the frame numbers of the first and last frames in one segment, respectively.
- the video and audio processing device 10 can calculate the video activity VF using, for example, the above-described histogram.
- the feature amounts including the video feature amounts described above basically represent the static information of the segment.
- Dynamic information in order to accurately represent the feature of the segment, Dynamic information must also be considered. Therefore, the video and audio processing device 10 is described below. Dynamic information is represented by such a feature amount sampling method.
- the video / audio processing device 10 extracts one or more static feature values from different time points within one segment, for example, as shown in FIG. At this time, the video / audio processing apparatus 10 determines the number of extracted feature amounts by balancing the maximization of fidelity and the minimization of data redundancy in the segment representation. For example, if a certain image in a segment can be specified as a key frame of the segment, a histogram calculated from the key frame is a feature to be extracted. The video and audio processing device 10 determines which of the samples that can be extracted as features in the target segment is to be selected by using a sampling method described later.
- the video / audio processing apparatus 10 does not extract the feature amount at the fixed point as described above, but extracts a statistical representative value in the entire segment.
- the feature value can be represented as a real n-dimensional vector
- the dissimilarity metric only. The case where it cannot be used will be described.
- (1) contains the histogram
- the most well-known video and audio features such as power spectrum and power spectrum.
- the number of samples is determined to be k in advance, and the audiovisual processing apparatus 10 is described in "L. Kaufman and PJ Rousseeu, Finding Groups in Data ⁇ 'An Introduction to Cluster Analysis, John -Wiley and sons, 1990 ", the well-known k-means clustering method is also used (k-means—clustering method) to determine the features of the entire segment by k different values. Automatically split into groups. Then, the video and audio processing apparatus 10 selects, from the k groups, a centroid value of the group or a sample close to this centroid value as a sample value. The complexity of this processing in the video and audio processor 10 increases only linearly with respect to the number of samples.
- the video and audio processing device 10 is based on "L. Kaufra an and PJ Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John-Wiley and sons, 1990" Then, k groups are formed using the k-medoids algorithm method. Then, the video / audio processing apparatus 10 uses the above-mentioned medoid for the k groups as sample values for each of the k groups.
- the method of constructing the dissimilarity metric for the feature quantity representing the extracted dynamic feature is based on the dissimilarity metric of the static feature quantity serving as the basis. This will be described later.
- the video and audio processing device 10 can store a plurality of static features. By extracting and using these multiple static features, dynamic features can be represented.
- the video and audio processing device 10 can extract various feature amounts. In general, each of these features is often insufficient by itself to represent the features of a segment. Therefore, the video and audio processing device 10 can select a set of feature amounts that complement each other by combining these various feature amounts. For example, the video and audio processing device 10 can obtain more information than the information of each feature by combining the above-described color histogram and video correlation.
- the video / audio processing device 10 uses the feature similarity measurement unit using a dissimilarity metric, which is a function that calculates a real value for measuring the degree of dissimilarity between the two feature amounts.
- the similarity of the segments is measured according to 17.
- a dissimilarity metric a small value indicates that two features are similar, and a large value indicates dissimilarity.
- a function for calculating the dissimilarity of the two segments SS 2 related to the feature value F is defined as a dissimilarity metric d F (S ⁇ S 2 ). Such a function must satisfy the relation given by the following equation (2).
- the video / audio processor 10 introduces the L.1 distance. I do.
- the ⁇ 1 distance d between A and B (A, B) is given by the following equation (3).
- the subscript i indicates the i-th element of each of the n-dimensional vectors A and B.
- the video and audio processing device 10 extracts static feature values at various points in the segment as feature values representing dynamic features. Then, the video and audio processing apparatus 10 determines the similarity between the two extracted dynamic features by using the non- Using similarity metrics You. These dynamic feature dissimilarity metrics are often determined using the dissimilarity of the most similar static feature pair selected from each dynamic feature. Is the best. In this case, the dissimilarity metric between the two extracted dynamic features SF, and SF 2 is defined as in the following equation (4).
- the function d F (FF 2 ) in the above equation (4) indicates the dissimilarity metric for the static feature F on which it is based. In some cases, instead of taking the minimum value of the feature dissimilarity, the maximum value or the average value may be used.
- the video / audio processing apparatus 10 is not enough to determine the similarity of segments by using only a single feature, and it is necessary to combine information from many features related to the same segment. Often. As one such method, the video / audio processing apparatus 10 calculates' dissimilarity based on various features as a weighted combination of the features. That is, when k feature quantities F ,, F2 ) , Fk, and Fk are present, the video / audio processing apparatus 10 performs the dissimilarity measurement on the combined feature quantity represented by the following equation (5). Use the criterion d F (SS 2 ).
- the video and audio processing apparatus 10 calculates the dissimilarity metric using the feature amount extracted in step S2 in FIG. 4 and measures the similarity between the segments. it can.
- the video / audio processing device 10 uses the dissimilarity measurement criterion and the extracted features to group similar segments that are close in time and similar, and detects the finally generated group. Output as a scene.
- the video / audio processing apparatus 10 performs two basic processes when detecting a scene by grouping segments.
- the video and audio processing device 10 first detects, as a first process, a group of similar segments that are temporally close to each other. Most of the groups obtained by this process are part of the same scene. Then, as a second process, the video / audio processing device 10 puts together a group of segments whose time overlaps with each other into one. The video / audio processing apparatus 10 repeats such processing starting from a state in which each segment is independent and repeatedly. Then, the video and audio processing device 10 gradually constructs a large group of segments, and outputs the finally generated group as a set of scenes.
- the video and audio processing device 10 uses two constraints to control the processing operation.
- the video-audio processing apparatus 10 sets the first similarity as a dissimilarity threshold ⁇ sim that determines how similar two segments are to be regarded as belonging to the same scene. Used. For example, Figure As shown in FIG. 6, the video / audio processing apparatus 10 determines, for a certain segment, whether one segment belongs to a similarity region or a dissimilarity region.
- the video / audio processing apparatus 10 may set the dissimilarity threshold ⁇ sim by the user, or may automatically determine the threshold as described later.
- the video and audio processing device 10 uses the time threshold T as the maximum value of the separation on the time axis that two segments can still be regarded as segments in the same scene. For example, as shown in FIG. 7, the video and audio processing apparatus 10 combines two similar segments A and B that are close to each other and continue within the range of the time threshold T in the same scene. It does not combine two segments B and C that are far apart and outside the time threshold T. As described above, since the video and audio processing apparatus 10 has a time constraint due to the time threshold T, the video and audio processing apparatus 10 puts similar but largely separated segments on the time axis into the same scene. One that never happens.
- time threshold T is set to a time corresponding to 6 to 8 shots, generally good results are obtained.
- the time threshold T is used as a 6 to 8 shot unit.
- Hierarchical cluster partitioning method Hier In this algorithm, the dissimilar 'I raw metric dc (CC 2 ) between the two clusters C 1 and C 2 is calculated using the following equation (6 As shown in), it is defined as the minimum dissimilarity between elements included in each cluster.
- step S12 the video / audio processing apparatus 10 generates a set of clusters.
- the video and audio processing device 10 regards each of the N segments as a different cluster. That is, in the initial state, there are N clusters.
- Each cluster is C st ' ft and C.
- the element that is included in the cluster is managed as a list ordered by C s laft.
- the video and audio processing apparatus 10 initializes the variable t to 1 in step S13, and determines whether or not the variable t is larger than the time threshold T in step S14.
- the video and audio processing device 10 proceeds to step S23. If the variable t is smaller than the time threshold T, the process proceeds to step S15. However, here, since the variable t is 1, the video / audio processing apparatus 10 shifts the processing to step S15.
- step S15 the video and audio processing device 10 calculates the dissimilarity measurement criterion d c; and detects the two most similar clusters from the N clusters.
- the video and audio processing apparatus 10 calculates the dissimilarity metric d c between adjacent clusters, and detects the most similar cluster pair from among them.
- the variable t representing the time interval of the target cluster is given in segment units, and the clusters are arranged in chronological order. Then, up to t clusters before and after that can be the target of calculation of dissimilarity.
- the two detected clusters are each defined as CCj, and the value of the dissimilarity between these clusters C i, C) is defined as
- step S16 the video and audio processing device 10 determines whether or not the dissimilarity value d is greater than the dissimilarity threshold ⁇ sim .
- the video and audio processing device 10 shifts the processing to step S 21 and sets the dissimilarity value d to the dissimilarity. If it is smaller than the threshold value ⁇ sim, the process proceeds to step S17 .
- it is assumed dissimilarity value du is smaller than the dissimilarity threshold [delta] i m.
- the video and audio processing device 10 joins the cluster Cj to the cluster Ci in step S17. That is, the video and audio processing device 10 adds all the elements of the cluster C i to the cluster C i.
- the video and audio processing device 10 removes the cluster Cj from the set of clusters in step S18. If the value of C i s ′ at the start changes by combining the two clusters C i and C j, the video and audio processing apparatus 10 sets the element of the set of clusters at the start C ′ ⁇ 1 Sort again based on.
- the video and audio processing device 10 subtracts 1 from the variable N in step S19.
- the video and audio processing device 10 determines whether or not the variable N is 1 in step S20.
- the video / audio processing apparatus 10 shifts the processing to step S23 when the variable N is 1, and shifts the processing to step S15 when the variable N is not 1. I do.
- step S15 the video-audio processing apparatus 10 calculates again the dissimilarity metric dc, and detects the two most similar clusters from the N-1 clusters. Again, since the variable t is 1, the video and audio processing device 10 calculates the dissimilarity measurement criterion d c between adjacent clusters and detects the most similar cluster pair from among them.
- step S16 the video and audio processing device 10 determines whether or not the dissimilarity value du is greater than the dissimilarity threshold ⁇ sim . Again, it is assumed dissimilarity value dij is smaller than the dissimilarity threshold [delta] s i m. Then, the video and audio processing device 10 performs the processing from step S17 to step S20.
- the video / audio processing apparatus 10 repeats such processing, and as a result of the variable N being subtracted, if it is determined in step S20 that the variable N is 1, the video / audio processing apparatus 10 returns to step S23. Combine clusters that contain only a single segment. Eventually, in this case, the video and audio processing device 10 is in a form in which all the segmented clients S are put together into one cluster, and a series of processes ends.
- step S 16 When the video / audio processing apparatus 10 determines in step S 16 that the dissimilarity value du is greater than the dissimilarity threshold ⁇ s im , the process proceeds to step S 21. However, in this case, in step S 21, clusters that overlap in time are repeatedly combined. That is, the time interval of C i [C 1 C i end ] force, the time interval of C i
- the video and audio processing apparatus 10 sorts the clusters based on the start time C i s la of the group, thereby detecting the overlapping clusters and combining the clusters into one. be able to.
- step S15 the video and audio processing device 10 calculates the dissimilarity measurement criterion dc, and determines the most similarity among a plurality of clusters currently existing. Detect two clusters. However, here, since the variable t is 2, the video and audio processor 10 calculates the dissimilarity metric dc between the adjacent cluster and every other cluster, and finds the most similar Detected cluster pairs.
- the video-audio processing apparatus 10 calculates the dissimilarity metric dc between the cluster existing up to every third cluster in step S15 and the cluster that is present at every other cluster in step S15. Detect the most similar cluster pair from among them.
- the video / audio processing apparatus 10 repeats such processing, and as a result of adding the variable t, when determining that the variable t is larger than the time threshold value T in step S14, proceeds to step S23. And processing to combine clusters containing only a single segment. That is, the video and audio processing apparatus 10 regards the isolated cluster as a cluster including only a single segment, and when such a series of clusters exists, collects these clusters. Combine. This process summarizes segments that have no similarity relationship with neighboring scenes. Note that the video and audio processing device 10 does not necessarily need to perform this step.
- the video and audio processing apparatus 10 The detected scenes can be generated by grouping the clusters.
- the video / audio processing apparatus 10 may set the dissimilarity threshold ⁇ sim by the user, or may automatically determine the threshold.
- the dissimilarity threshold ⁇ s im the optimum value depends on the content of the video data. For example, in the case of video data having a variety of video contents, the dissimilarity threshold 5 sim needs to be set to a high value.
- the dissimilarity threshold [delta] m has to be set to a lower value.
- the dissimilarity threshold S sim is high, the number of detected scenes is small, and when the dissimilarity threshold ⁇ s im is low, the number of detected scenes is large.
- the video and audio processing device 10 can also automatically determine the effective dissimilarity threshold ⁇ sim by the following method. For example, as one method, the video and audio processor 10
- the video and audio processing apparatus 10 does not need to find the dissimilarity between all the segment pairs, and the average value i and the standard deviation ⁇ give sufficiently close results to the true value.
- Sufficient segment pairs may be randomly selected from the set of all segment pairs, and their dissimilarity may be determined.
- the video / audio processing apparatus 10 can automatically determine an appropriate dissimilarity threshold ⁇ sim by using the average value / X and the standard deviation ⁇ thus obtained.
- the video / audio processing apparatus 10 uses not only a single dissimilarity metric but also weights to determine whether the segments belong to the same group.
- various dissimilarity metrics for heterogeneous features can be combined using functions.
- such weighting of the feature amounts can be obtained after trial and error.However, when each feature amount is of a qualitatively different type, usually an appropriate weighting is used. Is difficult to do.
- the video and audio processing apparatus 10 must detect a scene for each feature amount and synthesize each detected scene structure into a single scene structure. Thus, scene detection considering both features can be realized.
- each result of detecting a scene for each feature amount is referred to as a scene layer.
- the video and audio processing apparatus 10 detects a scene layer based on the respective feature amounts and performs a scene layer for the color histogram and a scene layer for the segment length. And can be obtained. Then, the video and audio processing device 10 can combine these scene layers into a single scene structure.
- the video and audio processing apparatus 10 has the same configuration as the case of combining structures based on qualitatively different types of feature amounts. According to the method, a scene layer obtained based on information from a video region and an audio region can be combined into a single scene structure.
- the video / audio processing device 10 in order to combine different scene layers into a single scene structure, it is necessary to determine how to combine scene boundaries. There is no guarantee that these scene boundaries are aligned with each other.
- a boundary point represented by a series of times indicating a scene boundary is given by ti ,, t ⁇ , ti I xi I.
- the video / audio processing apparatus 10 first selects a certain scene layer as a basis for alignment of boundary points in order to combine various scene layers into a single group. And the video and audio processing device 10 It is determined for each boundary point ti ,, ti ⁇ ⁇ ⁇ , ti I xi I whether or not the boundary of the other scene layer is the scene boundary in the scene structure finally generated.
- B i (t) is a logical function indicating whether or not there is a boundary point of the scene layer in the i-th scene layer X i near at a certain time t.
- the meaning of “proximity” changes according to the situation of the scene layer Xi. For example, when combining scene layers based on video information and audio information, about 0.5 seconds is appropriate. .
- the video and audio processing device 10 can combine different scene layers into a single scene structure.
- the voice processing device 10 extracts a scene structure. It has already been verified by experiments that this method in the video / audio processing apparatus 10 can extract a scene structure from video data having various contents such as a TV drama and a movie.
- the video / audio processing apparatus 10 is completely automatic, does not require user intervention to set the above-described dissimilarity threshold and time threshold, and responds to changes in the content of video data.
- the appropriate threshold can be automatically determined.
- the video and audio processing device 10 does not require the user to know the semantic structure of the video data in advance.
- the video and audio processing device 10 is very simple and has a small calculation load, it can be applied to home electronic devices such as set-top boxes, digital video recorders, and home servers. .
- the video and audio processing device 10 can provide a new high-level access basis for video browsing. Therefore, the video / audio processor 10 can easily access the video data based on the content by visualizing the content of the video data using a high-level video structure such as a scene instead of a segment. I do. For example, by displaying the scene, the video / audio processing apparatus 10 allows the user to quickly know the gist of the program and quickly find the part of interest.
- the video and audio processing device 10 outputs the video data as a result of the scene detection.
- the signal processing method according to the present invention is a signal processing method for detecting and analyzing a pattern that reflects the semantic structure of the content of a supplied signal, A segment formed from a series of consecutive frames that make up the signal
- the video and audio processing apparatus is a video and audio processing apparatus that detects and analyzes a video and a no or audio pattern that reflects the semantic structure of the content of the supplied video signal, and that analyzes the video signal.
- a feature amount extracting means for extracting at least one feature amount representing a feature from a video and / or audio segment formed from a sequence of constituent video and / or audio frames; and Then, for each of the feature values, a metric for measuring the similarity between the video and Z or audio segment pairs is calculated, and the similarity between the video and no or audio segment pairs is calculated based on this metric.
- the similarity measuring means for measuring the image quality, the feature amount and the metric, the temporal distance between the video and the audio segment or the audio segment is within a predetermined time threshold, and Detects two video and audio or audio segments whose dissimilarity is equal to or less than a predetermined dissimilarity threshold, and reflects the semantic structure of the video signal content from temporally continuous video and Z or audio segments.
- grouping means for grouping into different scenes are used.
- the video and audio processing device can detect and combine similar video and audio or audio segments in a video signal and output them as a scene. Extract high-level video structures
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Television Signal Processing For Recording (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/647,301 US6928233B1 (en) | 1999-01-29 | 2000-01-27 | Signal processing method and video signal processor for detecting and analyzing a pattern reflecting the semantics of the content of a signal |
EP00901939A EP1081960B1 (en) | 1999-01-29 | 2000-01-27 | Signal processing method and video/voice processing device |
DE60037485T DE60037485T2 (de) | 1999-01-29 | 2000-01-27 | Signalverarbeitungsverfahren und Videosignalprozessor zum Ermitteln und Analysieren eines Bild- und/oder Audiomusters |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP11/23064 | 1999-01-29 | ||
JP2306499 | 1999-01-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2000045604A1 true WO2000045604A1 (en) | 2000-08-03 |
Family
ID=12099994
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2000/000423 WO2000045604A1 (en) | 1999-01-29 | 2000-01-27 | Signal processing method and video/voice processing device |
Country Status (4)
Country | Link |
---|---|
US (1) | US6928233B1 (ja) |
EP (1) | EP1081960B1 (ja) |
DE (1) | DE60037485T2 (ja) |
WO (1) | WO2000045604A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108881744A (zh) * | 2018-07-31 | 2018-11-23 | 成都华栖云科技有限公司 | 一种视频新闻演播室自动识别方法 |
Families Citing this family (60)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7644282B2 (en) | 1998-05-28 | 2010-01-05 | Verance Corporation | Pre-processed information embedding system |
US6737957B1 (en) | 2000-02-16 | 2004-05-18 | Verance Corporation | Remote control signaling using audio watermarks |
US20020159750A1 (en) * | 2001-04-26 | 2002-10-31 | Koninklijke Philips Electronics N.V. | Method for segmenting and indexing TV programs using multi-media cues |
JP4615166B2 (ja) * | 2001-07-17 | 2011-01-19 | パイオニア株式会社 | 映像情報要約装置、映像情報要約方法及び映像情報要約プログラム |
AU2002336445B2 (en) | 2001-09-07 | 2007-11-01 | Intergraph Software Technologies Company | Image stabilization using color matching |
CN1643902A (zh) * | 2002-03-11 | 2005-07-20 | 皇家飞利浦电子股份有限公司 | 一种显示信息的系统和方法 |
US20070201746A1 (en) * | 2002-05-20 | 2007-08-30 | Konan Technology | Scene change detector algorithm in image sequence |
AU2003282763A1 (en) * | 2002-10-15 | 2004-05-04 | Verance Corporation | Media monitoring, management and information system |
US7375731B2 (en) * | 2002-11-01 | 2008-05-20 | Mitsubishi Electric Research Laboratories, Inc. | Video mining using unsupervised clustering of video content |
US20040130546A1 (en) * | 2003-01-06 | 2004-07-08 | Porikli Fatih M. | Region growing with adaptive thresholds and distance function parameters |
US20040143434A1 (en) * | 2003-01-17 | 2004-07-22 | Ajay Divakaran | Audio-Assisted segmentation and browsing of news videos |
RU2005126623A (ru) * | 2003-01-23 | 2006-01-27 | Интегрэф Хадвеа Текнолоджис Кампэни (US) | Способ демультиплексирования видеоизображений (варианты) |
US20060239501A1 (en) | 2005-04-26 | 2006-10-26 | Verance Corporation | Security enhancements of digital watermarks for multi-media content |
GB0406512D0 (en) | 2004-03-23 | 2004-04-28 | British Telecomm | Method and system for semantically segmenting scenes of a video sequence |
JP4552540B2 (ja) * | 2004-07-09 | 2010-09-29 | ソニー株式会社 | コンテンツ記録装置、コンテンツ再生装置、コンテンツ記録方法、コンテンツ再生方法及びプログラム |
US20080095449A1 (en) * | 2004-08-09 | 2008-04-24 | Nikon Corporation | Imaging Device |
JP5032846B2 (ja) * | 2004-08-31 | 2012-09-26 | パナソニック株式会社 | 監視装置および監視記録装置、それらの方法 |
US7650031B2 (en) * | 2004-11-23 | 2010-01-19 | Microsoft Corporation | Method and system for detecting black frames in a sequence of frames |
US8020004B2 (en) | 2005-07-01 | 2011-09-13 | Verance Corporation | Forensic marking using a common customization function |
US8781967B2 (en) | 2005-07-07 | 2014-07-15 | Verance Corporation | Watermarking in an encrypted domain |
JP4670584B2 (ja) * | 2005-10-25 | 2011-04-13 | ソニー株式会社 | 表示制御装置および方法、プログラム並びに記録媒体 |
WO2007049378A1 (ja) * | 2005-10-25 | 2007-05-03 | Mitsubishi Electric Corporation | 映像識別装置 |
US10324899B2 (en) * | 2005-11-07 | 2019-06-18 | Nokia Technologies Oy | Methods for characterizing content item groups |
WO2007053112A1 (en) * | 2005-11-07 | 2007-05-10 | Agency For Science, Technology And Research | Repeat clip identification in video data |
JP5212610B2 (ja) * | 2006-02-08 | 2013-06-19 | 日本電気株式会社 | 代表画像又は代表画像群の表示システム、その方法、およびそのプログラム並びに、代表画像又は代表画像群の選択システム、その方法およびそのプログラム |
US8639028B2 (en) * | 2006-03-30 | 2014-01-28 | Adobe Systems Incorporated | Automatic stacking based on time proximity and visual similarity |
KR100803747B1 (ko) * | 2006-08-23 | 2008-02-15 | 삼성전자주식회사 | 요약 클립 생성 시스템 및 이를 이용한 요약 클립 생성방법 |
US8116566B2 (en) * | 2006-08-28 | 2012-02-14 | Colorado State University Research Foundation | Unknown pattern set recognition |
US8558952B2 (en) * | 2007-05-25 | 2013-10-15 | Nec Corporation | Image-sound segment corresponding apparatus, method and program |
US8189657B2 (en) * | 2007-06-14 | 2012-05-29 | Thomson Licensing, LLC | System and method for time optimized encoding |
JP5060224B2 (ja) * | 2007-09-12 | 2012-10-31 | 株式会社東芝 | 信号処理装置及びその方法 |
WO2010014067A1 (en) * | 2008-07-31 | 2010-02-04 | Hewlett-Packard Development Company, L.P. | Perceptual segmentation of images |
US9407942B2 (en) * | 2008-10-03 | 2016-08-02 | Finitiv Corporation | System and method for indexing and annotation of video content |
CN102056026B (zh) * | 2009-11-06 | 2013-04-03 | 中国移动通信集团设计院有限公司 | 音视频同步检测方法及其系统、语音检测方法及其系统 |
US9400842B2 (en) | 2009-12-28 | 2016-07-26 | Thomson Licensing | Method for selection of a document shot using graphic paths and receiver implementing the method |
JP2012060238A (ja) * | 2010-09-06 | 2012-03-22 | Sony Corp | 動画像処理装置、動画像処理方法およびプログラム |
US9607131B2 (en) | 2010-09-16 | 2017-03-28 | Verance Corporation | Secure and efficient content screening in a networked environment |
US8923548B2 (en) | 2011-11-03 | 2014-12-30 | Verance Corporation | Extraction of embedded watermarks from a host content using a plurality of tentative watermarks |
US9323902B2 (en) | 2011-12-13 | 2016-04-26 | Verance Corporation | Conditional access using embedded watermarks |
US9113133B2 (en) * | 2012-01-31 | 2015-08-18 | Prime Image Delaware, Inc. | Method and system for detecting a vertical cut in a video signal for the purpose of time alteration |
WO2013150789A1 (ja) * | 2012-04-05 | 2013-10-10 | パナソニック株式会社 | 動画解析装置、動画解析方法、プログラム、及び集積回路 |
US9571606B2 (en) | 2012-08-31 | 2017-02-14 | Verance Corporation | Social media viewing system |
US20140075469A1 (en) | 2012-09-13 | 2014-03-13 | Verance Corporation | Content distribution including advertisements |
US8869222B2 (en) | 2012-09-13 | 2014-10-21 | Verance Corporation | Second screen content |
US8670649B1 (en) * | 2012-10-10 | 2014-03-11 | Hulu, LLC | Scene detection using weighting function |
US8983150B2 (en) | 2012-12-17 | 2015-03-17 | Adobe Systems Incorporated | Photo importance determination |
US8897556B2 (en) | 2012-12-17 | 2014-11-25 | Adobe Systems Incorporated | Photo chapters organization |
US9262793B2 (en) | 2013-03-14 | 2016-02-16 | Verance Corporation | Transactional video marking system |
US9251549B2 (en) | 2013-07-23 | 2016-02-02 | Verance Corporation | Watermark extractor enhancements based on payload ranking |
US9208334B2 (en) | 2013-10-25 | 2015-12-08 | Verance Corporation | Content management using multiple abstraction layers |
JP2017514345A (ja) | 2014-03-13 | 2017-06-01 | ベランス・コーポレイション | 埋め込みコードを用いた対話型コンテンツ取得 |
US9881084B1 (en) * | 2014-06-24 | 2018-01-30 | A9.Com, Inc. | Image match based video search |
KR102306538B1 (ko) * | 2015-01-20 | 2021-09-29 | 삼성전자주식회사 | 콘텐트 편집 장치 및 방법 |
EP3151243B1 (en) * | 2015-09-29 | 2021-11-24 | Nokia Technologies Oy | Accessing a video segment |
US20180025749A1 (en) | 2016-07-22 | 2018-01-25 | Microsoft Technology Licensing, Llc | Automatic generation of semantic-based cinemagraphs |
US20180101540A1 (en) * | 2016-10-10 | 2018-04-12 | Facebook, Inc. | Diversifying Media Search Results on Online Social Networks |
KR101924634B1 (ko) * | 2017-06-07 | 2018-12-04 | 네이버 주식회사 | 콘텐츠 제공 서버, 콘텐츠 제공 단말 및 콘텐츠 제공 방법 |
CN107295362B (zh) * | 2017-08-10 | 2020-02-21 | 上海六界信息技术有限公司 | 基于图像的直播内容筛选方法、装置、设备及存储介质 |
US10733454B2 (en) * | 2018-06-29 | 2020-08-04 | Hewlett Packard Enterprise Development Lp | Transformation of video streams |
CN112104892B (zh) * | 2020-09-11 | 2021-12-10 | 腾讯科技(深圳)有限公司 | 一种多媒体信息处理方法、装置、电子设备及存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07193748A (ja) * | 1993-12-27 | 1995-07-28 | Nippon Telegr & Teleph Corp <Ntt> | 動画像処理方法および装置 |
EP0711078A2 (en) * | 1994-11-04 | 1996-05-08 | Matsushita Electric Industrial Co., Ltd. | Picture coding apparatus and decoding apparatus |
JPH10257436A (ja) * | 1997-03-10 | 1998-09-25 | Atsushi Matsushita | 動画像の自動階層構造化方法及びこれを用いたブラウジング方法 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05282379A (ja) * | 1992-02-06 | 1993-10-29 | Internatl Business Mach Corp <Ibm> | 動画像の管理方法及び管理装置 |
US5532833A (en) | 1992-10-13 | 1996-07-02 | International Business Machines Corporation | Method and system for displaying selected portions of a motion video image |
IL109649A (en) * | 1994-05-12 | 1997-03-18 | Electro Optics Ind Ltd | Movie processing system |
JPH08181995A (ja) | 1994-12-21 | 1996-07-12 | Matsushita Electric Ind Co Ltd | 動画像符号化装置および動画像復号化装置 |
US5708767A (en) * | 1995-02-03 | 1998-01-13 | The Trustees Of Princeton University | Method and apparatus for video browsing based on content and structure |
US5821945A (en) | 1995-02-03 | 1998-10-13 | The Trustees Of Princeton University | Method and apparatus for video browsing based on content and structure |
JP3780623B2 (ja) | 1997-05-16 | 2006-05-31 | 株式会社日立製作所 | 動画像の記述方法 |
US6774917B1 (en) * | 1999-03-11 | 2004-08-10 | Fuji Xerox Co., Ltd. | Methods and apparatuses for interactive similarity searching, retrieval, and browsing of video |
-
2000
- 2000-01-27 WO PCT/JP2000/000423 patent/WO2000045604A1/ja active IP Right Grant
- 2000-01-27 EP EP00901939A patent/EP1081960B1/en not_active Expired - Lifetime
- 2000-01-27 US US09/647,301 patent/US6928233B1/en not_active Expired - Lifetime
- 2000-01-27 DE DE60037485T patent/DE60037485T2/de not_active Expired - Lifetime
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07193748A (ja) * | 1993-12-27 | 1995-07-28 | Nippon Telegr & Teleph Corp <Ntt> | 動画像処理方法および装置 |
EP0711078A2 (en) * | 1994-11-04 | 1996-05-08 | Matsushita Electric Industrial Co., Ltd. | Picture coding apparatus and decoding apparatus |
JPH10257436A (ja) * | 1997-03-10 | 1998-09-25 | Atsushi Matsushita | 動画像の自動階層構造化方法及びこれを用いたブラウジング方法 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108881744A (zh) * | 2018-07-31 | 2018-11-23 | 成都华栖云科技有限公司 | 一种视频新闻演播室自动识别方法 |
Also Published As
Publication number | Publication date |
---|---|
EP1081960A1 (en) | 2001-03-07 |
DE60037485D1 (de) | 2008-01-31 |
DE60037485T2 (de) | 2008-12-04 |
EP1081960A4 (en) | 2004-12-08 |
EP1081960B1 (en) | 2007-12-19 |
US6928233B1 (en) | 2005-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2000045604A1 (en) | Signal processing method and video/voice processing device | |
JP4683253B2 (ja) | Av信号処理装置および方法、プログラム、並びに記録媒体 | |
WO2000045603A1 (fr) | Procede de traitement des signaux et dispositif de traitement de signaux video/vocaux | |
Wang et al. | Survey of compressed-domain features used in audio-visual indexing and analysis | |
JP4253989B2 (ja) | ビデオの類似性探索方法及び記録媒体 | |
US8442384B2 (en) | Method and apparatus for video digest generation | |
EP1374097B1 (en) | Image processing | |
Nam et al. | Dynamic video summarization and visualization | |
KR20060008897A (ko) | 콘텐트 분석을 사용하여 뮤직 비디오를 요약하기 위한 방법및 장치 | |
JP2004526372A (ja) | ストリーミング映像ブックマーク | |
JP2000311180A (ja) | 特徴セット選択方法、ビデオ画像クラス統計モデルの生成方法、ビデオフレームの分類及びセグメント化方法、ビデオフレームの類似性決定方法、およびコンピュータ可読媒体、並びにコンピュータシステム | |
CN1938714A (zh) | 用于对视频序列的场景进行语义分段的方法和系统 | |
EP1067786B1 (en) | Data describing method and data processor | |
US20020140721A1 (en) | Creating a multimedia presentation from full motion video using significance measures | |
JP2000285243A (ja) | 信号処理方法及び映像音声処理装置 | |
US20090132510A1 (en) | Device for enabling to represent content items through meta summary data, and method thereof | |
JP2000285242A (ja) | 信号処理方法及び映像音声処理装置 | |
JP4702577B2 (ja) | コンテンツ再生順序決定システムと、その方法及びプログラム | |
EP2306719B1 (en) | Content reproduction control system and method and program thereof | |
JP5257356B2 (ja) | コンテンツ分割位置判定装置、コンテンツ視聴制御装置及びプログラム | |
JPH10187182A (ja) | 映像分類方法および装置 | |
KR20050033075A (ko) | 비디오 이미지들의 시퀀스에서 콘텐트 속성을 검출하는 유닛 및 방법 | |
Shao et al. | Automatically generating summaries for musical video | |
JP2000287166A (ja) | データ記述方法及びデータ処理装置 | |
Geng et al. | Hierarchical video summarization based on video structure and highlight |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2000901939 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 09647301 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 2000901939 Country of ref document: EP |
|
WWG | Wipo information: grant in national office |
Ref document number: 2000901939 Country of ref document: EP |