WO2000048397A1 - Procede de traitement de signal et dispositif de traitement video/audio - Google Patents
Procede de traitement de signal et dispositif de traitement video/audio Download PDFInfo
- Publication number
- WO2000048397A1 WO2000048397A1 PCT/JP2000/000762 JP0000762W WO0048397A1 WO 2000048397 A1 WO2000048397 A1 WO 2000048397A1 JP 0000762 W JP0000762 W JP 0000762W WO 0048397 A1 WO0048397 A1 WO 0048397A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- segment
- audio
- segments
- sub
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/76—Television signal recording
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/683—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
- G06F16/785—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/102—Programmed access in sequence to addressed parts of tracks of operating record carriers
- G11B27/105—Programmed access in sequence to addressed parts of tracks of operating record carriers of operating discs
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
- G11B27/28—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
Definitions
- the present invention relates to a signal processing method for measuring the similarity between arbitrary different segments constituting a signal, and different arbitrary images and Z constituting a video signal.
- the present invention relates to a video and audio processing device for measuring similarity between audio segments.
- the similarity of contents is first measured numerically.
- the similarity is measured, and the items are ranked in order of similarity based on the similarity metric for the target item.
- the resulting list the most similar ones will appear near the beginning of the list.
- the above-described search technique based on key frames is limited to a search based on similarity of shots.
- a typical 30-minute TV program contains hundreds of shots, and thus the conventional search technique described above requires a large number of extracted shots. Therefore, it was a heavy burden to search such a huge number of data.
- the present invention has been made in view of such circumstances, and solves the above-described problems of the conventional search technique, and provides a signal for performing a search based on the similarity of various levels of segments in various video data. It is an object to provide a processing method and a video / audio processing device.
- a signal processing method includes, among sub-segments included in a segment that constitutes a supplied signal, a representative segment that is a sub-segment representing the content of the segment;
- Such a signal processing method according to the present invention extracts signatures on segments.
- the video and audio processing apparatus includes a video and / or audio sub-segment included in a video and / or audio segment included in a supplied video signal.
- a video and audio processing device that extracts a signature defined by a weighting function to be assigned, and is a target of the signature among the groups obtained by classifying video and Z or audio subsegments based on arbitrary attributes. And a means for selecting one representative segment from the selected group and calculating a weight for the obtained representative segment.
- FIG. 1 is a diagram for explaining the structure of video data applied in the present invention, and is a diagram for explaining the structure of modeled video data c
- FIG. 2 is a video of a shot It is a figure explaining a frame signature.
- FIG. 3 is a diagram illustrating a shot signature for a scene.
- FIG. 4 is a diagram for explaining an audio segment signature for a scene.
- FIG. 5 is a diagram illustrating a shot signature for a television program.
- FIG. 6 is a block diagram illustrating a configuration of a video and audio processing device shown as an embodiment of the present invention.
- FIG. 7 is a flowchart illustrating a series of steps in extracting a signature in the video / audio processing apparatus.
- FIG. 8 is a diagram illustrating a scene applied to specifically explain a series of steps in FIG.
- FIG. 9 is a diagram illustrating r segments selected from the scene shown in FIG. BEST MODE FOR CARRYING OUT THE INVENTION
- An embodiment to which the present invention is applied is a video and audio processing apparatus that automatically extracts data representing an arbitrary set in video data in order to automatically search for and extract desired contents from video data. .
- a description will be given of video data targeted in the present invention.
- the video data targeted in the present invention is modeled as shown in FIG. 1, and has a hierarchical structure at the level of frames, segments, and programs.
- the video data is composed of segments consisting of a plurality of hierarchies between a program representing the entire video data, which is the highest layer, and a series of frames, which are the lowest layer.
- segments in video data there are those formed from a series of continuous frames, and those obtained by organizing such a sequence of frames into a scene based on a certain relation. Some scenes are further organized based on certain relationships. Also, in a broad sense, a single frame is a type of segment You can think.
- a segment in video data is a general term for a group of video data, including programs and frames, irrespective of the level of the hierarchy. It is defined as a continuous part.
- the segment may be an intermediate structure having some meaning, such as a structure formed from a series of the above-described continuous frames or an intermediate structure with the scene.
- segment X for example, if any segment X is completely contained within a different segment Y, then we define segment X to be a subsegment of segment Y.
- Such video data generally includes both video and audio information. That is, in this video data, the frame includes a video frame that is a single still image and an audio frame that represents audio information sampled in a short period of time, generally several H ⁇ to several hundred milliseconds.
- the segment includes a video segment and an audio segment.
- a segment is a so-called shot consisting of a series of video frames that are continuously shot by a single camera, and the shots are grouped into meaningful units using feature values representing these features.
- Including video segments such as scenes.
- segments can be formed, for example, by delimiting silence periods in video data detected by generally well-known methods,
- the video and audio processing apparatus shown as an embodiment to which the present invention is applied automatically extracts signatures, which are general features that characterize the contents of segments in the video data described above. It compares the similarity of two signatures and is applicable to both video and audio segments. The resulting similarity metric provides a generic tool for searching and classifying segments.
- a signature generally identifies a certain object, and is some data that identifies the object with high accuracy using less information than the object.
- fingerprints are one type of human signature. That is, comparing the similarity of two sets of fingerprints attached to a certain object makes it possible to accurately determine whether or not the same person has given the fingerprint.
- signatures for video and audio segments allow video and audio segments to be identified.
- this signature is given as a weighted set of the above-described sub-segments obtained by dividing the segment.
- a signature S for a certain segment X is, as described later, a representative segment R whose elements are sub-segments representing the segment X, and a weighting function that assigns weight to each element of the representative segment R.
- R, W> is defined as W.
- the term “repre sentative frame”, which is a term representing a representative frame, is extended to represent the representative segment as an r segment. From this, the set of all r segments included in a signature is called the r segment of that signature. Also, the type of the r segment is called the r type of the signature. And if it is necessary to specify the rtype of a signature, prefix it with the term "signature". For example, a video frame signature indicates a signature whose r segments are all video frames. Further, the shot signature indicates a signature whose r segment is the above-described shot. On the other hand, a segment described by a certain signature S is referred to as a target segment of the signature S. Signatures can use r segments that include video segments, audio segments, or a combination of both.
- Such a signature has several properties that effectively describe the segment.
- signatures are most important in that they not only describe short segments, such as shots, but also the entire scene or video data. — Allows you to describe longer segments, such as entire data.
- Signature makes it possible to characterize a segment with a small amount of data.
- the weight assigned to each r-segment indicates the importance or relevance of each r-segment and allows the segment to be identified to be identified.
- a segment can be decomposed into a set of simpler subsegments, those subsegments can be used as r segments.
- Such signatures can be created arbitrarily by the user via a computer-assisted user interface, but for most applications it is desirable to extract them automatically.
- the video frame signature for a shot is a signature whose r segment is a still image.
- One way to create such a signature is to use the keyframes for each shot as the r-segment, and use the in-shot video frames that almost match the keyframes in the shot. All video frames This is to use the ratio to the system weight as the weight.
- the shot signature for the scene is a signature whose shot is a shot.
- shots in a scene can be classified into n groups.
- a signature consisting of n r segments can be created. That is, for each group, one shot is selected to behave as an r-segment, where weighting is applied to each r-segment. It can be given as the ratio of the number of shots that make up each group to the total number of shots.
- signatures are not limited to using only visual information, and audio segment signatures for scenes can be cited as examples of signatures as shown in FIG.
- the audio segment signature for a scene uses a set of audio segments as an r segment. For example, consider a scene consisting of multiple people talking to each other. In this case, if it is possible to automatically distinguish the speakers, a short speech segment for each speaker can be used as the r segment.
- signatures can be used not only to describe short segments, but also to describe entire videos.
- a particular TV program can be clearly distinguished from other TV programs.
- Such a shot is used repeatedly in the television program. For example, a logo 'shot at the beginning of a news program as shown in FIG. 5 and a shot and a shot showing a newscaster are displayed. This corresponds to this. In this case, it is appropriate to assign the same weight to the logo shot and the newscaster shot, since the weighting indicates the importance of the shot.
- the video and audio processing device 10 that automatically extracts such signatures and compares the similarity of the two signatures controls the operation of each unit and stores them in the ROM 12 as shown in FIG.
- CPU Central Processing Unit
- CPU 11 1 which is a means of executing segmented programs to extract segment signatures, programs executed by CPU 11 1 to extract signatures, numerical values used, etc.
- Read Only Memory 12 which is a read-only memory for storing the input segment, and a function as a work area for storing the subsegment ⁇ , r segment, etc. obtained by dividing the input segment.
- a RAM (Random Access Memory) 13 as a memory
- HDD Hard Disk Drive
- IZF signature An interface
- the CPU 11 reads out and executes the program stored in the R ⁇ M 12, and performs a series of processes as shown in FIG. 7 to extract the signature. I do.
- step S1 the video and audio processing device 10 divides a segment input via the YZF 15 into sub-segments.
- the subsegment obtained here is r It becomes a candidate r segment that is a candidate for the segment.
- the video / audio processing apparatus 10 does not particularly limit the method of dividing the segment into sub-segments, and may use any method as long as it is applicable. Such methods are highly dependent on the type of subsegment used.
- a method of decomposing a segment into a smaller set of segments Specifically, for example, when the r segment is a video frame, the video / audio processing apparatus 10 can easily decompose the r segment, and a set of all video frames (still images) in the segment is a subsegment. Event candidate set. If the r-segment is a shot, the video-audio processing apparatus 10 may use, for example, “B.
- the video / audio processing apparatus 10 may use, for example, the above-mentioned “D. Kimber and L. Wilcox, Acoustic Segmentation for Audio Browsers, Xerox Pare Technics 1 Report” or “S Pf eif fer, S. Fischer and E. Wolfgang, Automatic Audio Content Analysis, Proceeding of ACM Multimedia 96,
- the video and audio processing device 10 divides a segment into sub-segments irrespective of the type of the segment. If the segment is a frame, the video and audio processing device 10 There is no need to perform the splitting step.
- step S2 the video and audio processing device 10 groups sub-segments similar to each other. That is, since a group of subsegments similar to each other is considered to best represent the content of the target segment, the video and audio processing apparatus 10 detects and groups subsegments similar to each other.
- the sub-segments that are similar to each other indicate sub-segments that have a small value of the dissimilarity metric in the later-described feature amount of each sub-segment.
- the video and audio processing device 10 does not particularly limit the method of grouping sub-segments similar to each other, and may employ any method that is applicable.
- the video / audio processing apparatus 10 is well known, for example, as described in “L. Kaufman and PJ Roussee, Finding Groups in Data: An Introduction to Cluster Analysis, John-Wiley and sons, 1990”. Average direct cluster! ;.
- features are the characteristics of segments An attribute of a segment that represents a feature and provides data for measuring the similarity between different segments.
- the video / audio processing apparatus 10 does not depend on any specific details of the features, the characteristic quantities considered to be effective in the video / audio processing apparatus 10 include, for example, There are the following video features, audio features, and video / audio common features.
- Color in a video is an important factor in determining whether two videos are similar. Judgment of similarity of images using color histograms is described in, for example, "G. Ahanger and TDC Little, A survey of technologies for parsing and indexing digital video, J. of Visual Communication and Image Representation 7: 28-4, It is well known, as described in 1996 ".
- the color histogram is obtained by dividing a three-dimensional color space such as HSV or RGB into n regions and calculating the relative proportion of the appearance frequency of pixels in the video in each region. -And the obtained information gives the n-dimensional vector.
- a color histogram can be directly extracted from the compressed data, for example, as described in U.S. Patent # 5,708,767.
- the structure in which multiple similar segments intersect each other is a powerful indicator that it is a single structure. For example, in a conversation scene, the camera position alternates between two speakers, but the camera usually returns to approximately the same position when re-shooting the same speaker.
- the correlation based on the reduced image of the delay-scale video is a good indicator of the similarity of the sub-segments.
- the image is decimated and reduced to a grayscale image of size, and the image correlation is calculated using this.
- small values of M and N are both sufficient, for example 8 ⁇ 8.
- these reduced grayscale images are interpreted as MN-dimensional feature vectors.
- a feature amount different from the video feature amount described above there is a feature amount related to audio.
- this feature will be referred to as a voice feature.
- the voice feature is a feature that can represent the content of the voice segment. Examples of the audio feature include frequency analysis, pitch, and level.
- the video / audio processing apparatus 10 may include, for example, an FFT (Fast Fourier Transform) component and a frequency histogram to represent the distribution of frequency information over one audio sub-segment. System, power spectrum, and other features.
- the video / audio processing apparatus 10 can also use the average pitch and the maximum pitch, and the audio level such as the average volume and the maximum volume as the effective audio feature amount representing the audio sub-segment. .
- Still another feature quantity is a video-audio common feature quantity. Although this is neither a video feature nor an audio feature in particular, it provides useful information for representing characteristics of a sub-segment in the video and audio processing device 10.
- the video / audio processing device 10 uses the segment length and the activity as the video / audio common feature.
- the video / audio processing device 10 can use the segment length as the video / audio common feature amount.
- This segment length is the length of time in a segment.
- a scene has rhythm characteristics that are unique to the scene.
- the rhythmic feature manifests itself as a change in segment length in the scene. For example, a short segment that runs quickly represents a commercial.
- the segments in the conversation scene are longer than in the commercial case, and the conversation scene is characterized in that the segments combined with each other are similar to each other.
- the video / audio processing apparatus 10 can use the segment length having such a characteristic as the video / audio common feature amount.
- the video / audio processing device 10 can use the activity as the video / audio common feature amount.
- An activity is an index that indicates how dynamic or static the content of a segment feels. For example, when visually dynamic, activity indicates the degree to which a camera moves quickly along an object or the object being photographed changes rapidly. This activity is calculated indirectly by measuring the average value of inter-frame dissimilarity of features such as color histograms.
- the dissimilarity metric for the feature value F measured between the frame i and the frame j is defined as d F (i, j)
- the video activity V F is expressed by the following equation (1). Is defined as
- b and f are the frame numbers of the first and last frames in one segment, respectively.
- the video and audio processing device 10 calculates the video activity VF using, for example, the above-described histogram.
- the video and audio processing apparatus 10 extracts such features from the sub-segments, detects sub-segments that are similar to each other by a clustering algorithm, and groups them.
- the dissimilarity criterion which is a function for calculating a real value for measuring the similarity between two sub-segments, will be described later.
- step S3 the video and audio processing apparatus 10 selects a target group for signature from the similar groups obtained by grouping the subsegments.
- the video and audio processing device 10 considers the number of sub-segments classified into each group.
- a threshold is set for the number of subsegments present in the group.
- this threshold is usually given as a ratio of the number of subsegments included in a certain group to the number of all subsegments. That is, the video and audio processing apparatus 10 sets a group whose number of elements exceeds the threshold among the obtained groups as a target group for signature.
- the video / audio processing apparatus 10 can also set an arbitrary constant k as the number of r segments. In this case, the video and audio processing apparatus 10 arranges all groups in the order of the number of elements included therein, and selects only k groups in descending order of the number of elements as target groups for signature.
- the video and audio processing device 10 selects a target group for signature from the groups.
- the video and audio processing device 10 selects the r segment in step S4. That is, the video and audio processing device 10 selects one of the sub-segments constituting each group selected in step S3.
- the video and audio processing device 10 can select an arbitrary subsegment from each group. Or a video and audio processor
- the video and audio processing apparatus 10 is a more sophisticated approach, in which the subsegment most similar to the mean or median of the subsegments in each group is selected as the r segment. C In this way, the video and audio processing apparatus 10 The r segment from the loop.
- step S5 the video and audio processing device 10 calculates a weight for each of the r segments.
- the video / audio processing apparatus 10 sets the weight as a ratio of the number of subsegments included in the group corresponding to each r segment to the total number.
- the video and audio processing apparatus 10 extracts the signature for each segment by performing the above-described series of steps for all the segments.
- This scene shows a scene in which two people are talking to each other, starting with a shot showing both people, and then alternately appearing according to the speaker.
- the shot continues.
- the video and audio processing device 10 divides the scene into shots as sub-segments in step S1 in FIG. That is, in this case, the video and audio processing apparatus 10 detects and divides nine different sub-segments as shown in FIG. 8 by using the shot detection method.
- step S2 in FIG. 7 the video and audio processing apparatus 10 classifies and groups sub-segments similar to each other. That is, in this case, based on the visual similarity of the shots, the video and audio processing device 10 performs the first show showing both of the two persons in the scene shown in FIG. The first group consists of only the cuts, and the second and third groups consist of four shots for each speaker. Into three groups.
- the video and audio processing device 10 selects a group necessary for characterizing a scene in step S3 in FIG.
- the video and audio processing apparatus 10 sets all of the first to third groups to the shot signature. Decide to use it.
- the video and audio processing apparatus 10 selects one shot from each group as an r segment in step S4 in FIG.
- the video and audio processing apparatus 10 selects each of the three shots shown in FIG. 9 as r segments from the first to third groups.
- step S5 in FIG. 7 the video and audio processing apparatus 10 calculates a weight corresponding to the ratio of the number of shots included in each group for each of the first to third groups.
- the first group has one shot as an element, and the second and third groups have four shots each. Therefore, the video and audio processing apparatus 10 obtains a weight of 1 Z 9, 4 9, 4/9 for each of the first to third groups.
- the video and audio processing apparatus 10 obtains the r segment and the weight shown in FIG. 9 as the signature for the scene shown in FIG.
- the similarity between two segments is defined as the similarity between signatures based on the r segment. Justify. It should be noted here that, in practice, the above-described dissimilarity metric or similarity metric is defined.
- (r, w) represents the r segment and its associated weighting function, as described above.
- dissimilarity metric For the dissimilarity metric, a small value indicates that the two segments are similar, and a large value indicates that they are dissimilar.
- the video / audio processor 10 introduces the L1 distance. I do.
- the L1 distance d (A, B) between A and B is given by the following equation (3).
- the subscript i indicates the i-th element of each of the n-dimensional vectors A and B.
- the video and audio processing device 10 measures the similarity between the two signatures represented by the above-described dissimilarity metric by the CPU 11 described above, and calculates the target segment of these two signatures. Similarity is defined by one of the following methods based on the similarity of the r segments.
- the video and audio processing apparatus 10 calculates a distance between two signatures using a weighted minimum value represented by the following equation (4).
- the video and audio processing device 10 calculates a distance between two signatures by using a weighted average distance represented by the following equation (5).
- the video and audio processing device 10 calculates the distance between the two signatures using the weighted median distance shown in the following equation (6).
- the video and audio processing apparatus 10 is described as a fourth method using "Y. Rubner, C. Tomasi and LJ Guibas, A Metric for Distributions with Applications to Image Databases, Proceedings of the on Computer Vision, Bombay, India, January 1998 ", which is used in the case of force short message for still images, using the distance metric described in the following equation (7). Mover) to calculate the distance between two signatures.
- Mover an mxn cost matrix C is defined.
- C ii is a value that minimizes the function.
- the video and audio processor 10 is based on "Y. Rubner, C. Tomasi and LJ Gu ibas, A Metric for Distributions with Applications to Image Databases, Proceedings of the 1998 IEEE International Conference on Computer. Vision, Bombay, India, January 1998 "By using a self-contained algorithm, the function shown in equation (7) can be minimized according to the constraint shown in equation (8). The value of Cii can be detected. In the video and audio processing device 10, the value of the distance between two signatures is defined as the minimum value of the function shown in equation (7).
- the video / audio processing device 10 obtains the similarity between the two segments as the similarity of the signature based on the r segment by any one of such methods. Then, the video and audio processing device 10 determines whether or not to group the segments based on the approximate similarity between the segments.
- the video and audio processing apparatus 10 can convert a unit of video data including a program and a frame into a unit. They can be grouped regardless of the level of the hierarchy.
- the video / audio processing apparatus 10 automatically extracts signatures at various levels of video data and compares the similarity of the two signatures. By doing so, it is possible to compare similarities between corresponding segments.
- the video / audio processing apparatus 10 is capable of grouping segments in various layers of video data, and is applicable to different types of video data.
- the video and audio processing device 10 can be a general-purpose tool for automatically searching for and extracting an arbitrary structure of video data.
- the present invention is not limited to the above-described embodiment.
- feature amounts used when grouping mutually similar sub-segments may be other than those described above.
- it is sufficient that sub-segments that are related to each other can be grouped based on some information.
- the signal processing method according to the present invention uses a sub-segment representing the contents of a segment among sub-segments included in a segment constituting a supplied signal.
- a representative segment and a weighting function that assigns weight to this representative segment A signal processing method for extracting a signature defined by: a group selection step of selecting a group to be a target of a signature from groups obtained by classifying sub-segments based on an arbitrary attribute; The method includes a representative segment selecting step of selecting one representative segment from the group selected in the step, and a weight calculating step of calculating a weight for the representative segment obtained in the representative segment selecting step.
- the signal processing method according to the present invention can extract signatures related to segments, and use this signature to compare similarities between different segments irrespective of the segment hierarchy of a signal. Can be.
- the signal processing method according to the present invention can search for segments having desired contents based on similarity with respect to segments of various layers in various signals.
- the video and audio processing apparatus includes a video and / or audio sub-segment included in a video and a Z or an audio segment constituting a supplied video signal, and A video and audio processing apparatus for extracting a signature defined by a representative segment, which is a representative video and / or audio subsegment, and a weighting function for assigning weight to the representative segment, comprising: From the groups obtained by classification based on arbitrary attributes, select the group to be signed, select one representative segment from the selected group, and calculate the weight for the obtained representative segment. Execution means for performing the operation.
- the video and audio processing apparatus provides a Can extract signatures for audio segments and use this signature to determine the similarity between different video and / or Z or audio segments regardless of the hierarchy of video and / or audio segments in the video signal. It becomes possible to compare. From this, the video and audio processing apparatus according to the present invention provides video and / or audio having desired contents based on similarity to video and / or audio segments of various layers in various video signals. You can search for segments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Library & Information Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Television Signal Processing For Recording (AREA)
- Image Analysis (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/673,232 US6710822B1 (en) | 1999-02-15 | 2000-02-10 | Signal processing method and image-voice processing apparatus for measuring similarities between signals |
EP00902920A EP1073272B1 (en) | 1999-02-15 | 2000-02-10 | Signal processing method and video/audio processing device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP11/36338 | 1999-02-15 | ||
JP3633899 | 1999-02-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2000048397A1 true WO2000048397A1 (fr) | 2000-08-17 |
Family
ID=12467056
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2000/000762 WO2000048397A1 (fr) | 1999-02-15 | 2000-02-10 | Procede de traitement de signal et dispositif de traitement video/audio |
Country Status (4)
Country | Link |
---|---|
US (1) | US6710822B1 (ja) |
EP (1) | EP1073272B1 (ja) |
KR (1) | KR100737176B1 (ja) |
WO (1) | WO2000048397A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7778470B2 (en) * | 2003-09-30 | 2010-08-17 | Kabushiki Kaisha Toshiba | Moving picture processor, method, and computer program product to generate metashots |
US8200061B2 (en) | 2007-09-12 | 2012-06-12 | Kabushiki Kaisha Toshiba | Signal processing apparatus and method thereof |
Families Citing this family (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000045596A1 (fr) * | 1999-01-29 | 2000-08-03 | Sony Corporation | Procede de description de donnees et unite de traitement de donnees |
KR20020059706A (ko) * | 2000-09-08 | 2002-07-13 | 요트.게.아. 롤페즈 | 저장 매체상에 저장된 정보 신호를 재생하는 장치 |
JP2002117407A (ja) * | 2000-10-10 | 2002-04-19 | Satake Corp | 動画像検索方法及びその装置 |
US7031980B2 (en) * | 2000-11-02 | 2006-04-18 | Hewlett-Packard Development Company, L.P. | Music similarity function based on signal analysis |
US20020108112A1 (en) * | 2001-02-02 | 2002-08-08 | Ensequence, Inc. | System and method for thematically analyzing and annotating an audio-visual sequence |
KR100438269B1 (ko) * | 2001-03-23 | 2004-07-02 | 엘지전자 주식회사 | 뉴스 비디오 브라우징 시스템에서 앵커 샷 자동 검출 방법 |
CA2386303C (en) | 2001-05-14 | 2005-07-05 | At&T Corp. | Method for content-based non-linear control of multimedia playback |
US20030033602A1 (en) * | 2001-08-08 | 2003-02-13 | Simon Gibbs | Method and apparatus for automatic tagging and caching of highlights |
US7091989B2 (en) * | 2001-08-10 | 2006-08-15 | Sony Corporation | System and method for data assisted chroma-keying |
US7319991B2 (en) * | 2001-12-11 | 2008-01-15 | International Business Machines Corporation | Computerized cost estimate system and method |
EP1531458B1 (en) * | 2003-11-12 | 2008-04-16 | Sony Deutschland GmbH | Apparatus and method for automatic extraction of important events in audio signals |
DE60319710T2 (de) * | 2003-11-12 | 2009-03-12 | Sony Deutschland Gmbh | Verfahren und Vorrichtung zur automatischen Dissektion segmentierte Audiosignale |
US7818444B2 (en) | 2004-04-30 | 2010-10-19 | Move Networks, Inc. | Apparatus, system, and method for multi-bitrate content streaming |
WO2006035883A1 (ja) * | 2004-09-30 | 2006-04-06 | Pioneer Corporation | 画像処理装置、画像処理方法、および画像処理プログラム |
US11216498B2 (en) * | 2005-10-26 | 2022-01-04 | Cortica, Ltd. | System and method for generating signatures to three-dimensional multimedia data elements |
US7602976B2 (en) * | 2006-02-17 | 2009-10-13 | Sony Corporation | Compressible earth mover's distance |
US20070204238A1 (en) * | 2006-02-27 | 2007-08-30 | Microsoft Corporation | Smart Video Presentation |
US7577684B2 (en) * | 2006-04-04 | 2009-08-18 | Sony Corporation | Fast generalized 2-Dimensional heap for Hausdorff and earth mover's distance |
US8682654B2 (en) * | 2006-04-25 | 2014-03-25 | Cyberlink Corp. | Systems and methods for classifying sports video |
EP1959449A1 (en) * | 2007-02-13 | 2008-08-20 | British Telecommunications Public Limited Company | Analysing video material |
US8478587B2 (en) * | 2007-03-16 | 2013-07-02 | Panasonic Corporation | Voice analysis device, voice analysis method, voice analysis program, and system integration circuit |
WO2008117232A2 (en) * | 2007-03-27 | 2008-10-02 | Koninklijke Philips Electronics N.V. | Apparatus for creating a multimedia file list |
US8195038B2 (en) * | 2008-10-24 | 2012-06-05 | At&T Intellectual Property I, L.P. | Brief and high-interest video summary generation |
WO2011062071A1 (ja) * | 2009-11-19 | 2011-05-26 | 日本電気株式会社 | 音響画像区間分類装置および方法 |
JP2012060238A (ja) * | 2010-09-06 | 2012-03-22 | Sony Corp | 動画像処理装置、動画像処理方法およびプログラム |
CN102591892A (zh) * | 2011-01-13 | 2012-07-18 | 索尼公司 | 数据分段设备和方法 |
TW201236470A (en) * | 2011-02-17 | 2012-09-01 | Acer Inc | Method for transmitting internet packets and system using the same |
CN105355214A (zh) * | 2011-08-19 | 2016-02-24 | 杜比实验室特许公司 | 测量相似度的方法和设备 |
TWI462576B (zh) * | 2011-11-25 | 2014-11-21 | Novatek Microelectronics Corp | 固定圖案的邊緣偵測方法與電路 |
US9185456B2 (en) | 2012-03-27 | 2015-11-10 | The Nielsen Company (Us), Llc | Hybrid active and passive people metering for audience measurement |
US8737745B2 (en) * | 2012-03-27 | 2014-05-27 | The Nielsen Company (Us), Llc | Scene-based people metering for audience measurement |
WO2013157190A1 (ja) * | 2012-04-20 | 2013-10-24 | パナソニック株式会社 | 音声処理装置、音声処理方法、プログラムおよび集積回路 |
KR101421984B1 (ko) * | 2012-10-16 | 2014-07-28 | 목포해양대학교 산학협력단 | 깊이정보의 시간적 필터링 기반 디지털 홀로그램의 고속 생성 방법 |
FR3004054A1 (fr) * | 2013-03-26 | 2014-10-03 | France Telecom | Generation et restitution d'un flux representatif d'un contenu audiovisuel |
US9396256B2 (en) * | 2013-12-13 | 2016-07-19 | International Business Machines Corporation | Pattern based audio searching method and system |
KR102306538B1 (ko) * | 2015-01-20 | 2021-09-29 | 삼성전자주식회사 | 콘텐트 편집 장치 및 방법 |
WO2017087003A1 (en) * | 2015-11-20 | 2017-05-26 | Hewlett Packard Enterprise Development Lp | Segments of data entries |
CN107888843A (zh) * | 2017-10-13 | 2018-04-06 | 深圳市迅雷网络技术有限公司 | 用户原创内容的混音方法、装置、存储介质及终端设备 |
US11315585B2 (en) * | 2019-05-22 | 2022-04-26 | Spotify Ab | Determining musical style using a variational autoencoder |
US11355137B2 (en) | 2019-10-08 | 2022-06-07 | Spotify Ab | Systems and methods for jointly estimating sound sources and frequencies from audio |
US11366851B2 (en) | 2019-12-18 | 2022-06-21 | Spotify Ab | Karaoke query processing system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07193748A (ja) * | 1993-12-27 | 1995-07-28 | Nippon Telegr & Teleph Corp <Ntt> | 動画像処理方法および装置 |
EP0711078A2 (en) * | 1994-11-04 | 1996-05-08 | Matsushita Electric Industrial Co., Ltd. | Picture coding apparatus and decoding apparatus |
JPH10257436A (ja) * | 1997-03-10 | 1998-09-25 | Atsushi Matsushita | 動画像の自動階層構造化方法及びこれを用いたブラウジング方法 |
EP0907147A2 (en) * | 1997-09-26 | 1999-04-07 | Matsushita Electric Industrial Co., Ltd. | Clip display method and display device therefor |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5664227A (en) * | 1994-10-14 | 1997-09-02 | Carnegie Mellon University | System and method for skimming digital audio/video data |
JPH08181995A (ja) | 1994-12-21 | 1996-07-12 | Matsushita Electric Ind Co Ltd | 動画像符号化装置および動画像復号化装置 |
US5805733A (en) * | 1994-12-12 | 1998-09-08 | Apple Computer, Inc. | Method and system for detecting scenes and summarizing video sequences |
US5870754A (en) * | 1996-04-25 | 1999-02-09 | Philips Electronics North America Corporation | Video retrieval of MPEG compressed sequences using DC and motion signatures |
US5872564A (en) * | 1996-08-07 | 1999-02-16 | Adobe Systems Incorporated | Controlling time in digital compositions |
US6195458B1 (en) * | 1997-07-29 | 2001-02-27 | Eastman Kodak Company | Method for content-based temporal segmentation of video |
US6373979B1 (en) * | 1999-01-29 | 2002-04-16 | Lg Electronics, Inc. | System and method for determining a level of similarity among more than one image and a segmented data structure for enabling such determination |
US6236395B1 (en) * | 1999-02-01 | 2001-05-22 | Sharp Laboratories Of America, Inc. | Audiovisual information management system |
-
2000
- 2000-02-10 US US09/673,232 patent/US6710822B1/en not_active Expired - Fee Related
- 2000-02-10 WO PCT/JP2000/000762 patent/WO2000048397A1/ja not_active Application Discontinuation
- 2000-02-10 KR KR1020007011374A patent/KR100737176B1/ko not_active IP Right Cessation
- 2000-02-10 EP EP00902920A patent/EP1073272B1/en not_active Expired - Lifetime
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07193748A (ja) * | 1993-12-27 | 1995-07-28 | Nippon Telegr & Teleph Corp <Ntt> | 動画像処理方法および装置 |
EP0711078A2 (en) * | 1994-11-04 | 1996-05-08 | Matsushita Electric Industrial Co., Ltd. | Picture coding apparatus and decoding apparatus |
JPH10257436A (ja) * | 1997-03-10 | 1998-09-25 | Atsushi Matsushita | 動画像の自動階層構造化方法及びこれを用いたブラウジング方法 |
EP0907147A2 (en) * | 1997-09-26 | 1999-04-07 | Matsushita Electric Industrial Co., Ltd. | Clip display method and display device therefor |
Non-Patent Citations (1)
Title |
---|
See also references of EP1073272A4 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7778470B2 (en) * | 2003-09-30 | 2010-08-17 | Kabushiki Kaisha Toshiba | Moving picture processor, method, and computer program product to generate metashots |
US8200061B2 (en) | 2007-09-12 | 2012-06-12 | Kabushiki Kaisha Toshiba | Signal processing apparatus and method thereof |
Also Published As
Publication number | Publication date |
---|---|
EP1073272A1 (en) | 2001-01-31 |
US6710822B1 (en) | 2004-03-23 |
EP1073272B1 (en) | 2011-09-07 |
EP1073272A4 (en) | 2004-10-06 |
KR100737176B1 (ko) | 2007-07-10 |
KR20010042672A (ko) | 2001-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2000048397A1 (fr) | Procede de traitement de signal et dispositif de traitement video/audio | |
US8467610B2 (en) | Video summarization using sparse basis function combination | |
US6724933B1 (en) | Media segmentation system and related methods | |
US8467611B2 (en) | Video key-frame extraction using bi-level sparsity | |
US6741655B1 (en) | Algorithms and system for object-oriented content-based video search | |
Ardizzone et al. | Automatic video database indexing and retrieval | |
US20120148149A1 (en) | Video key frame extraction using sparse representation | |
JP3568117B2 (ja) | ビデオ画像の分割、分類、および要約のための方法およびシステム | |
Avrithis et al. | A stochastic framework for optimal key frame extraction from MPEG video databases | |
JP4258090B2 (ja) | ビデオフレームの分類方法及びセグメント化方法、及びコンピュータ可読記憶媒体 | |
Priya et al. | Shot based keyframe extraction for ecological video indexing and retrieval | |
US20070030391A1 (en) | Apparatus, medium, and method segmenting video sequences based on topic | |
JP2009095013A (ja) | ビデオ要約システムおよびビデオ要約のためのコンピュータプログラム | |
JP2006508565A (ja) | 映像の未知の内容を要約する方法 | |
US8165983B2 (en) | Method and apparatus for resource allocation among classifiers in classification systems | |
US6996171B1 (en) | Data describing method and data processor | |
WO2002082328A2 (en) | Camera meta-data for content categorization | |
CN113766330A (zh) | 基于视频生成推荐信息的方法和装置 | |
Boujemaa et al. | Ikona: Interactive specific and generic image retrieval | |
Panchal et al. | Scene detection and retrieval of video using motion vector and occurrence rate of shot boundaries | |
Mohamadzadeh et al. | Content based video retrieval based on hdwt and sparse representation | |
Zhu et al. | Video scene segmentation and semantic representation using a novel scheme | |
EP1008064A1 (en) | Algorithms and system for object-oriented content-based video search | |
JP4224917B2 (ja) | 信号処理方法及び映像音声処理装置 | |
Mervitz et al. | Comparison of early and late fusion techniques for movie trailer genre labelling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): KR US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000902920 Country of ref document: EP |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 09673232 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020007011374 Country of ref document: KR |
|
WWP | Wipo information: published in national office |
Ref document number: 2000902920 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1020007011374 Country of ref document: KR |
|
WWR | Wipo information: refused in national office |
Ref document number: 1020007011374 Country of ref document: KR |