WO2012093430A1 - 興味区間抽出装置、興味区間抽出方法 - Google Patents
興味区間抽出装置、興味区間抽出方法 Download PDFInfo
- Publication number
- WO2012093430A1 WO2012093430A1 PCT/JP2011/006031 JP2011006031W WO2012093430A1 WO 2012093430 A1 WO2012093430 A1 WO 2012093430A1 JP 2011006031 W JP2011006031 W JP 2011006031W WO 2012093430 A1 WO2012093430 A1 WO 2012093430A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- section
- interest
- feature
- vector
- interval
- Prior art date
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 114
- 239000013598 vector Substances 0.000 claims abstract description 259
- 230000005236 sound signal Effects 0.000 claims abstract description 61
- 239000000284 extract Substances 0.000 claims abstract description 25
- 238000003860 storage Methods 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000002123 temporal effect Effects 0.000 claims description 3
- 238000009825 accumulation Methods 0.000 claims 2
- 239000012634 fragment Substances 0.000 abstract 1
- 238000000034 method Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 10
- 238000012545 processing Methods 0.000 description 9
- 238000013500 data storage Methods 0.000 description 7
- 238000001228 spectrum Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 4
- 230000001186 cumulative effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000000386 athletic effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 239000000796 flavoring agent Substances 0.000 description 1
- 235000019634 flavors Nutrition 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/14—Picture signal circuitry for video frequency region
- H04N5/147—Scene change detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L99/00—Subject matter not provided for in other groups of this subclass
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/02—Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
- G11B27/031—Electronic editing of digitised analogue information signals, e.g. audio or video signals
- G11B27/034—Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
- G11B27/28—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/81—Detection of presence or absence of voice signals for discriminating voice from music
Definitions
- the present invention relates to a technique for extracting a section of interest that is a user's interest from AV content, and particularly to a technique using an audio signal.
- an interest section a section of interest
- the controller is operated (for example, an operation of pressing the input button of the controller) to start the start time of the interest section.
- a moving image photographing device that can extract the interest section by operating the controller again to determine the end time of the interest section.
- the present invention has been made in view of the above reasons, and an object thereof is to reduce a user's work burden when extracting an interest section from AV content.
- An interest section extraction device is an interest section extraction device that extracts a user's interest section including a specified time based on an audio signal included in a moving image file, and each of a plurality of types of sound segments serving as a reference
- a likelihood vector is generated by using an anchor model, a likelihood vector generating means for generating a likelihood vector having each likelihood as a component, a candidate section that is a candidate of an interest section is calculated based on the likelihood vector, and a specified time Interest section extraction means for extracting all or part of candidate sections including the interest section.
- an appropriate interest section is extracted simply by specifying a specified time, so that the user's workload when extracting the interest section can be reduced. Can do.
- the interest interval extraction device generates a unit interval from the audio signal of the second unit interval that is N times as long as the first unit interval (N is a natural number of 2 or more).
- N is a natural number of 2 or more.
- frequency vector generating means for generating a frequency vector from the N likelihood vectors thus obtained, and the candidate section may be calculated based on the frequency vector.
- the interest interval extraction apparatus includes a component classification unit that classifies each component of the frequency vector into a plurality of component groups, and a feature interval calculation unit that calculates a plurality of feature intervals based on each of the plurality of component groups.
- the candidate section may be determined by a plurality of feature sections.
- a plurality of components are classified into a plurality of component groups based on each component of the centroid vector of the likelihood vector generated from all sections of the audio signal, which represents the appearance frequency of each sound segment in the entire AV content. If the components are classified according to the difference in the properties of the sound environment by defining the candidate intervals based on the feature intervals calculated based on the respective component groups, the candidate intervals can be determined from the feature intervals calculated based on the components having the same sound environment. Therefore, the characteristics of the sound environment can be reflected in the feature section.
- the component classification means generates a centroid vector from the likelihood vectors of all the intervals of the audio signal, and each component of the frequency vector based on the magnitude of each component of the centroid vector.
- the feature interval calculation means calculates a first feature interval based on each component belonging to the first component group in the centroid vector, and a second component group in the centroid vector
- the second feature section may be calculated based on each component belonging to and the candidate section may be determined by the first feature section and the second feature section.
- the feature section calculation means uses the centroid vector component corresponding to the anchor model corresponding to the component having a magnitude greater than or equal to the predetermined amount of the centroid vectors as the first component group, and less than the predetermined amount of the centroid vectors.
- the component of the center of gravity vector corresponding to the anchor model corresponding to the component of the size is the second component group
- the first feature section is calculated based on the first component group
- the second feature section is calculated based on the second component group Is calculated based on each component belonging to the first component group, and suddenly based on each component belonging to the second feature group, which is the duration of the sound environment having a stable property. Since the second characteristic interval, which is the duration of the sound environment having a natural property, can be calculated, it is possible to extract an interest interval including a sound environment having a stable property and a sound environment having a sudden property. .
- the interest interval may be included in the first feature interval and include the second feature interval.
- the section of interest is included in the first feature section and includes the second feature section, it is possible to accurately extract the section of interest including the sudden sound environment. It is possible to accurately extract a section of interest including a stable sound environment and a sudden sound environment.
- the interest interval extraction device includes an interest interval length acquisition unit that acquires the length of an interest interval preset by the user, and the second feature interval while shifting the time by a second unit interval from the specified time.
- Feature time extraction means for searching for and extracting feature time, and the interest interval extraction means shifts the time by a second unit interval from the specified time toward the feature time extracted by the feature time extraction means. It is determined whether or not the length between the target time and the specified time belongs to the first feature section is shorter than the length of the preset interest section, and the target time belongs to the first feature section and the target If it is determined that the length between the time and the specified time is shorter than the preset length of the interest interval, the second unit interval including the target time may be set as the interest interval.
- the processing load of the interest section extraction unit can be reduced.
- the moving image file may correspond to a moving image representing one content.
- the moving image file corresponds to a moving image representing one content
- the first feature section and the second feature section can be extracted from one entire content. Can be extracted more accurately.
- the interest area extraction device is such that the interest area extraction means arranges the interest area data indicating the plurality of interest areas corresponding to the plurality of designated times in the order of the designated times and stores them in the external storage device. There may be.
- the designated time acquisition means is designated by the user based on a temporal change in the feature amount of each image data included in the moving image file and corresponding to the interval designated by the user.
- the designated time may be automatically acquired from the section.
- the user can specify the specified time automatically from the middle of the section specified by the user simply by specifying the section including the time that the user wants to set as the specified time. It is possible to reduce the burden of designating the designated time.
- the present invention is also an interest section extraction method for extracting a user's section of interest including a specified time based on an audio signal included in a moving image file, and expressing the characteristics of each of a plurality of types of sound elements serving as a reference.
- An anchor model storing step for storing the anchor model to be stored in advance, a specified time acquiring step for acquiring the specified time, and the likelihood for the feature vector expressing the feature of the audio signal for each unit section of the audio signal.
- the present invention is a program for realizing an interest interval extraction process for extracting an interest interval of a user including a specified time based on an audio signal included in a moving image file by a computer.
- Anchor model storage step for storing in advance an anchor model that expresses the characteristics of each of a plurality of reference sound elements, a specified time acquisition step for acquiring a specified time, and an audio signal for each unit section of the audio signal
- Candidate sections are calculated, and all or part of the candidate sections including the specified time are extracted as interest sections. It may be interested section extracting program including a flavor section extracting step.
- the present invention is an integrated circuit for extracting an interest interval that extracts a user's interest interval including a specified time on the basis of an audio signal included in a moving image file.
- An anchor model storage unit that stores an anchor model that expresses in advance, a specified time acquisition unit that acquires a specified time, and a likelihood for a feature vector that represents a feature amount of the audio signal for each unit section of the audio signal Is calculated using an anchor model, a likelihood vector generation unit that generates a likelihood vector having each likelihood as a component, a candidate section that is a candidate of an interest section is calculated based on the likelihood vector, and includes a specified time
- An interest interval extraction integrated circuit including an interest interval extraction unit that extracts all or part of candidate intervals as an interest interval may be used.
- the section of interest extraction apparatus expresses the feature amount of an audio signal using each of a plurality of types of anchor models Ar for each first unit section (10 msec) of an audio signal included in a moving image file.
- FIG. 1 it is assumed that a moving image file is taken of an athletic meet scene. Then, it is assumed that the user wants to edit only a scene within a predetermined length of time before and after the start time of the student race from the moving image file. In this case, when the user designates a time near the start time in the competition scene, first, a part of the first feature section corresponding to the entire competition scene is included as an interest section. Extract.
- the interest area is extracted in a form that includes a shooting scene (second feature area in FIG. 1) that announces the start of the competition in the area to be extracted as the interest area. can do.
- the interest section is extracted with the second unit section (1 sec) 100 times the first unit section as the minimum unit.
- the duration of the sudden sound environment such as a shooting scene is appropriately set, and only the end time (feature point Tk) of the second feature section is obtained, and the duration is traced back from the end time. The processing is performed by regarding the determined time as the start time of the second feature section.
- Moving Image File A moving image file is composed of an audio signal and a plurality of image data.
- the audio signal has a waveform as shown in FIG. It should be noted that the audio signal is a time series of amplitude values ⁇ 2-2> feature quantity vector Below, an outline from generation of the feature quantity vector M from the audio signal will be described.
- the power spectrum S ( ⁇ ) is calculated for each first unit interval (interval between time T n and time T n + 1 , 10 msec) of the audio signal extracted by the audio extraction device 102. Calculate (see FIG. 2B).
- a vector (hereinafter, characteristic) consisting of 26 mel frequency cepstrum coefficients MFCC (Mel-Frequency Cepstrum Coefficients) in the first unit interval. (Referred to as a quantity vector).
- This feature vector M is calculated for each first unit section (every 10 msec) as shown in FIG. Accordingly, 100 feature quantity vectors M are generated from the audio signal between time 0 sec and time 1 sec.
- Anchor model The anchor model according to the present embodiment expresses the characteristics of each of the 1024 types of sound elements used as a reference when calculating the likelihood, and is created for each sound element. ing. And it is comprised from the parameter which prescribes
- GMM Global Model
- each anchor model Ar is configured by a feature amount appearance probability function b Ar (M) corresponding to each of 1024 types of sound elements in the first unit section.
- the feature quantity appearance probability function b Ar is a probability function that exists for each anchor model Ar.
- an MFCC 26th order vector (feature quantity vector) M Likelihood is calculated using as an argument. It is not distinguished which anchor model corresponds to which sound element.
- the likelihood Lr calculated for the feature quantity vector M to be used is a component. Therefore, the likelihood vector is expressed by a 1024-dimensional vector.
- This feature vector M is generated for each first unit interval of the audio signal extracted by the audio extraction device 102 as described in ⁇ 2-3>.
- FIG. 5 shows likelihood vectors Fn and Fm (n ⁇ m) calculated using the anchor model Ar of each of the 1024 types of sound elements.
- the vertical axis in FIG. 5 is the likelihood
- the horizontal axis indicates the type of the anchor model Ar.
- Likelihood vectors Fn and Fm are the likelihood vector corresponding to the nth first unit interval from time 0 (that is, the interval between time (10 ⁇ n) msec and time (10 ⁇ (n + 1)) msec) and The likelihood vector Fm in the m-th first unit interval (that is, the interval between time (10 ⁇ m) msec and time (10 ⁇ (m + 1)) msec) is shown (see FIG. 2A). As shown in FIG.
- FIG. 6 shows a video editing apparatus 100 equipped with the section of interest extraction apparatus 104 according to the present embodiment.
- the video editing apparatus 100 includes an input device 101, a content storage device 103, a voice extraction device 102, an interest interval extraction device 104, and an interest interval storage device 105.
- the input device 101 is composed of a disk drive device or the like. When the recording medium 110 is loaded, the input device 101 reads a moving image file from the recording medium 110 and stores it in the content storage device 103.
- the content storage device 103 is configured by a hard disk device or the like, and stores a moving image file acquired from the recording medium 110 by the input device 101.
- the voice extraction device 102 acquires a moving image file from the content storage device 103, extracts an audio signal from the acquired moving image file, and inputs the extracted audio signal to the interested section extraction device 104.
- the voice extraction device 102 generates an audio signal as shown in FIG. 2A by performing a decoding process on the encoded audio signal.
- the output device 106 displays an image on the display device 120.
- the output device 106 acquires interest interval data from the interest interval storage device 105, and selects a plurality of image data constituting a part of the moving image file from the content storage device 103 based on the acquired interest interval data. That is, a plurality of pieces of image data associated with time data indicating the time determined from the interest section data are selected.
- the output device 106 causes the external display device 120 to display a digest moving image in which moving images are joined in order from the earliest designated time corresponding to each interest section.
- the voice data storage device 130 is composed of a hard disk device or the like, and stores voice data used when the anchor model creation device 108 creates an anchor model Ar that represents the characteristics of each of a plurality of types of sound elements.
- This audio data is composed of audio signals obtained by extracting from a plurality of moving image files in advance and performing decoding processing separately from the moving image file from which the interest interval is extracted.
- the interface device 109 includes an operation unit (not shown) such as a keyboard, and has a function of accepting an input operation from a user and notifying the input information to the interested section extracting device 104 and the anchor model creating device 108.
- the user inputs information related to the designated time and the length of the interest section to the interest section extraction device 104 via the interface device 109.
- Interest Section Extraction Device The interest section extraction device 104 is composed of a memory (not shown) and a processor (not shown), and the processor executes a program read into the memory so that FIG. Each configuration shown in FIG. Hereinafter, each configuration will be described in detail.
- ⁇ 3-2-1> Feature Quantity Vector Generation Unit The feature quantity vector generation unit 201 generates a feature quantity vector from the input audio signal.
- the feature vector generation unit 201 first performs acoustic analysis for each first unit interval on the audio signal input from the speech extraction device 102 to calculate a power spectrum S ( ⁇ ).
- the feature vector generation unit 201 generates a feature vector M (M (1), M (2),..., M (26)) from the calculated power spectrum S ( ⁇ ).
- the feature vector generation unit 201 generates 100 feature vectors M (see FIG. 3).
- the likelihood vector generation unit 202 calculates the likelihood Lr for the feature quantity vector M using the anchor model Ar of each sound element, and the calculated likelihood Lr for each A likelihood vector F as a component is generated.
- the likelihood vector generation unit 202 acquires each parameter constituting the anchor model Ar from the anchor model storage unit 107.
- the likelihood vector buffer 203 includes a partial area of the memory, and stores the likelihood vector F generated by the likelihood vector generation unit 202.
- Component Classification Unit The component classification unit 205 reads all likelihood vectors F generated from all sections of the audio signal from the likelihood vector buffer 203 according to the relational expression [Equation 1], and Is calculated by dividing each component of the sum of likelihood vectors F by the number of first unit sections included in all sections (centroid vector G).
- each component of the centroid vector G is an average value (also referred to as normalized cumulative likelihood) of the likelihood Lr of each anchor model Ar in each first unit section included in all sections of the audio signal. That is, it represents the appearance frequency of the sound segment indicated by each anchor model Ar in the entire section of the audio signal. Therefore, the larger the component of the centroid vector G, the higher the frequency of appearance of the sound segment indicated by the anchor model Ar corresponding to the component.
- the appearance frequency of sound segments is expressed by normalized cumulative likelihood has been described, but the expression of the appearance frequency is not limited to this.
- the component classification unit 205 arranges the components of the calculated centroid vector G in ascending order. At this time, a component having a higher rank than the rank corresponding to 1 ⁇ 4 of the total number of types of the anchor model Ar, that is, a component having a rank of 256 or higher is used as an anchor model Ar (high frequency group) having a high appearance frequency, The component is classified as an anchor model Ar (low frequency group) with a low appearance frequency.
- FIG. 8 shows how the component classification unit 205 performs processing. In histograms (a) and (b) of FIG. 8, the vertical axis indicates the size of each component of the centroid vector G, and the horizontal axis indicates each component Gr of the centroid vector G and the anchor model Ar corresponding to each component Gr.
- the frequency vector generation unit 206 generates the frequency vector NF while shifting the section used when generating the frequency vector NF by the second unit section (1 sec). As shown in FIG. 9, the second unit section corresponds to a set of a plurality of first unit sections. Each component of the frequency vector NF corresponds to the normalized cumulative likelihood of each component of the likelihood vector F included in the second unit interval. An example of the frequency vector NF is shown in the upper part of FIG.
- the frequency vector generation unit 206 ends the generation of the frequency vector NF when a frequency vector creation end instruction described later is notified from the section of interest extraction unit 209 described later, and on the other hand, when the frequency vector generation start instruction is notified, the frequency vector generation unit 206 The generation of the vector NF is started.
- the frequency vector generation unit 206 receives the attribute information of each anchor model Ar input from the component classification unit 205 (that is, information indicating whether each anchor model Ar belongs to the low frequency group or the high frequency group). Is used to generate a high-frequency vector NFh and a low-frequency vector NFl.
- Frequency vector buffer The frequency vector buffer 207 includes a partial area of the memory, and stores the low frequency vector NFl and the high frequency vector NFh generated by the frequency vector generation unit 206.
- a lower graph of FIG. 10 shows each component of the low frequency vector NFl and the high frequency vector NFh stored in the frequency vector buffer 207 as a line graph.
- Reference vector / threshold generation unit The reference vector / threshold generation unit 204 includes a high-frequency vector included in a plurality of second unit sections including the high-frequency vector NFh corresponding to the specified time from the frequency vector buffer 203. NFh is acquired and a reference vector NFh0 is calculated.
- the reference vector NFh0 includes nine high-frequency vectors included in the four second unit sections (total of nine second unit sections) before and after the second unit section corresponding to the specified time. It is obtained by dividing the total sum of NFh by the number of second unit sections (9).
- the reference vector / threshold generation unit 204 further calculates the Euclidean distance between the plurality of high-frequency vectors NFh and the reference vector NFh0 used when generating the reference vector NFh0, The Euclidean distance between the highest frequency vector NFh with the longest distance and the reference vector NFh0 is set as a threshold value Rth used for determining whether or not it belongs to the first feature section.
- FIG. 12 illustrates this state using the concept of a high-frequency vector space.
- the reference vector / threshold generation unit 204 inputs the generated reference vector NF0 and threshold Rth to the interest interval extraction unit 209.
- Specified Time Acquisition Unit The specified time acquisition unit 210 acquires information related to the specified time from the interface device 109 and sends it to the reference vector / threshold generation unit 204, the feature point extraction unit 208, and the interest interval extraction unit 209. input.
- Feature Point Extraction Unit The feature point extraction unit 208 performs the difference ⁇ (NFl (T1) between the low-frequency vectors NFl (T) and NFl (T-1) while going back in time from the designated time T0 in increments of 1 sec. ) -NFl (T-1)) norm is calculated.
- the feature point extraction unit 208 calculates the time closest to the specified time T0 among the times when the norm of the difference ⁇ (NFl (T) ⁇ NFl (T ⁇ 1)) exceeds the threshold Th, and calculates the time as the feature point. (Characteristic time). That is, the feature point extraction unit 208 searches for the feature point Tk that is the end time of the second feature section while shifting the time in the past by the second unit section from the specified time T0, and extracts the feature point Tk. .
- FIG. 13 shows an example of the norm of the difference ⁇ (NFl (T) ⁇ NFl (T ⁇ 1)) of the low frequency vector NFl (T).
- exceeds the threshold Th at time Tk time Tk is used as a feature point.
- the feature point extraction unit 208 uses the specified time information input from the specified time acquisition unit 210 and the low frequency vector stored in the frequency vector buffer 207 to indicate the end time of the second feature section. Feature points Tk are extracted (see FIG. 1). This specified time information is indicated by the elapsed time from the start time of the moving image file.
- the interest section extraction unit 209 determines the first feature section (candidate section) based on the information about the specified time input from the specified time acquisition unit 210 and the high-frequency vector NFh. Extract.
- the interest section extraction unit 209 first identifies the second unit section including the designated time (time T0 in FIG. 14A). Then, the interest section extraction unit 209 calculates the Euclidean distance between the reference vector NFhc input from the reference vector / threshold generation unit 204 and the high-frequency vector NFh at other times.
- a section between two times TL1 and TL2 in which the Euclidean distance between the high-frequency vector NFh and the reference vector NFhc exceeds the threshold value Rth input from the reference vector / threshold value generation unit 204 is a first feature section. This corresponds to FIG. 14 (a).
- FIG. 13B shows the relationship between the threshold in the high-frequency vector NFh space and the Euclidean distance.
- the high-frequency vector NFh in the first feature section exists inside a sphere having a radius Rth centered on the reference vector NFhc in the high-frequency vector space shown in FIG.
- the interest interval extraction unit 209 moves the second unit interval toward the feature point Tk extracted by the feature point extraction unit 208 from the specified time T0 while tracing back the time.
- the Euclidean distance between the high-frequency vector NFh and the reference vector NFhc at the target time is calculated while shifting the time one by one, and whether or not the calculated Euclidean distance exceeds the threshold Rth (that is, the second unit including the target time) It is determined whether or not the section is included in the first feature section.
- the interest interval extraction unit 209 notifies the frequency vector generation unit 206 of a frequency vector creation end instruction.
- the interested section extracting unit 209 determines whether or not the length between the target time and the designated time T0 is shorter than the length of the preset interested section at the same time.
- the Euclidean distance does not exceed the threshold value Rth (included in the first feature section), and the length between the target time and the specified time T0 is shorter than the length le of the preset interest section (that is, the interest If it is determined that the condition of the section is satisfied), the second unit section including the target time becomes the interesting section.
- the interest section extraction unit 209 determines whether or not the length between the target time and the specified time T0 is shorter than the length le of the preset interest section. If it is determined that the Euclidean distance is short, the Euclidean distance between the high-frequency vector NFh and the reference vector NFhc at the target time is sequentially calculated while proceeding from the time T0 by the second unit interval, and the same determination as described above is performed. Go.
- the section of interest extraction unit 209 notifies the frequency vector generation unit 206 of a frequency vector creation start instruction.
- the interest section extraction unit 209 ends the process when the calculated Euclidean distance exceeds the predetermined threshold Rth or the total length of the section specified as the interest section exceeds the length of the preset interest section.
- the frequency vector generation unit 206 is notified of a frequency vector creation end instruction. At this time, a section of length le including the feature point Tk is extracted from the first feature section as an interest section (see FIG. 16).
- the length le of the interest section is determined in advance by user evaluation using a simple editing application (for example, 60 sec is determined by user evaluation). If the feature point Tk is 60 seconds or more away from the specified time T0, a period of 60 sec long that does not include the feature point Tk is extracted as an interest interval as shown in FIG. This is the case, for example, when the specified time T0 is 1000 sec, the feature point Tk is 900 sec, and the length of the interested section le is 50 sec. In this case, the length le of the interest section is shorter than the time 100 sec from the feature point Tk to the specified time T0.
- a simple editing application for example, 60 sec is determined by user evaluation.
- Anchor model storage unit 107 includes a part of the memory, and stores the anchor model Ar created by the anchor model creation device 108.
- the anchor model storage unit 107 stores the anchor model Ar in advance before performing the interest section extraction process.
- FIG. 18 shows functional blocks of the anchor model creation device 108 according to the present embodiment.
- the anchor model creation device 108 creates an anchor model Ar from the voice data stored in the voice data storage device 130 and stores it in the anchor model storage unit 107.
- the anchor model creation device 108 includes a memory (not shown) and a processor (not shown), and the processor executes programs read into the memory, thereby realizing each configuration shown in FIG. . That is, the anchor model creation device 108 realizes a feature vector generation unit 301, a feature vector classification unit 302, and an anchor model generation unit 303 as shown in FIG. ⁇ 3-3-1> Feature Quantity Vector Generation Unit
- the feature quantity vector generation unit 301 is the same as the feature quantity vector generation unit 201 described in ⁇ 3-2-1> above.
- the data is divided into first unit sections, acoustic analysis is performed for each first unit section to calculate a power spectrum S ( ⁇ ), and a feature vector M is generated from the calculated power spectrum S ( ⁇ ).
- Anchor Model Generation Unit The anchor model generation unit 303 calculates a feature quantity appearance probability function b Ar (M) corresponding to each anchor model Ar based on the cluster feature quantity vector of each cluster.
- Operation Operation
- Operation of Video Editing Device The operation of the video editing device 100 equipped with the interested section extracting device 104 according to the present embodiment will be described below.
- the input device 101 acquires a moving image file instructed by the user to extract and display an interest section from the recording medium 110 and stores it in the content storage unit 102.
- the audio extraction device 102 extracts an audio signal from the moving image file stored in the content storage unit 102.
- the section of interest extraction device 104 performs the section of interest extraction described below based on the audio signal extracted by the voice extraction device 102.
- the interest interval data extracted by the interest interval extraction device 105 is stored.
- the output device 106 selects a plurality of pieces of image data corresponding to the interest section data extracted from the moving image file by the interest section extraction process, and causes the display device 120 to display the selected image data.
- the audio extraction device 102 extracts an audio signal included in a moving image file designated by the user from the content recording device 103 (arrow P1) and inputs it to the feature vector generation unit 201 (arrow P2).
- the feature vector generation unit 201 generates a feature vector from the input audio signal and inputs it to the likelihood vector generation unit 202 (arrow P3).
- the likelihood vector generation unit 202 generates a likelihood vector F for each first unit interval from the input feature quantity vector and the anchor model Ar acquired from the anchor model storage unit 107 (arrow P4), and the likelihood. Stored in the degree vector buffer 203 (arrow P5).
- the component classification unit 205 acquires all likelihood vectors F stored in the likelihood vector buffer 203 (arrow P6), calculates these centroid vectors G, and for each component of the centroid vectors G, An anchor model Ar corresponding to a component larger than a predetermined threshold is classified as a high frequency group, an anchor model Ar corresponding to a component smaller than the predetermined threshold is classified as a low frequency group, and attribute information indicating the result is generated as a frequency vector Input to the unit 206 (arrow P7).
- the frequency vector generation unit 206 acquires a plurality of likelihood vectors F stored in the likelihood vector buffer 203 (arrow P8), and generates the frequency vector F. Then, for each frequency vector NF, the frequency vector generation unit 206 calculates the high frequency vector NFh and the low frequency vector NFl based on the attribute information input from the component classification unit 205 and stores it in the frequency vector buffer 207 (arrow) P10). This process ends when a frequency vector generation end instruction is notified from the interest section extraction unit 209, and resumes when a frequency vector generation start instruction is notified (arrow P9).
- the feature point extraction unit 208 acquires the low frequency vector NFl from the frequency vector buffer 207 (arrow P11), and is input from the acquired low frequency vector NFl and the specified time acquisition unit 210 (arrow P12) at the specified time. A feature point indicating the second feature section is extracted using the information. At this time, the feature point extraction unit 208 searches for the feature point Tk that is the end time of the second feature section while shifting the time by the second unit section from the specified time T0, and extracts the feature point Tk.
- the feature point extraction unit 208 traces the time from the specified time T0 in the section (first feature section) in which music symbolizing a scene of competition is flowing (first feature section) by the second unit section, The end time Tk of the section (second feature section) where the gunfire sound is extracted is extracted.
- the feature point extraction unit 208 inputs the extracted feature point information to the interest section extraction unit 209 (arrow P13).
- the reference vector / threshold generation unit 204 acquires a plurality of high-frequency vectors NFh including the high-frequency vector NFh corresponding to the specified time from the frequency vector buffer 207 (arrow P17). Information is acquired (arrow P19), the reference vector NFh0 is generated, and the threshold value Rth is calculated. Then, the reference vector / threshold value generation unit 204 inputs the generated reference vector NFh0 and threshold value Rth to the interest interval extraction unit 209 (arrow P18).
- the interest section extraction unit 209 obtains from the frequency vector buffer 207 (arrow P14), the reference vector NFhc and threshold Rth input from the reference vector / threshold generation unit 204 (arrow P18), and the specified time acquisition unit 210. (Arrow P15) is used to determine whether the target time belongs to the first feature section.
- the interested section extracting unit 209 shifts the time by the second unit section from the designated time T0 toward the feature point Tk extracted by the feature point extracting unit 208, while the target time belongs to the first feature section and the target time. Whether the target time belongs to the first feature section and the target time is between the specified time T0 and the specified time T0. Is determined to be shorter than the length of the preset interest section, the second unit section including the target time is set as the interest section.
- the interested section extracting unit 209 determines whether or not the target time is included in a section in which music or the like symbolizing a competition scene in the athletic meet scene is flowing.
- the interest section extraction unit 209 extracts the interest section that is included in the first feature section and includes the second feature section by using the calculated first feature section and the feature point information.
- Interest interval data indicating the interval is stored in the interest interval storage device 105 (arrow P16).
- the interest interval extraction unit 209 stores the interest interval data in the order of the specified time (for example, the address numbers in the order of the specified time). Store it in a young storage area).
- the output device 106 acquires a plurality of pieces of interest section data from the interest section storage device 105, the output device 106 does not need to perform a process of determining the front-rear relationship between the interest section data and the designated time corresponding to the interest section data. Therefore, the processing load on the output device 106 is reduced.
- the interest section extracting apparatus 104 has been described based on the first and second embodiments. However, the present invention is not limited to the interest section extracting apparatus 104 described in the first and second embodiments. is there.
- the likelihood for a feature quantity vector that represents a feature quantity of an audio signal using each of a plurality of types of anchor models Ar for each first unit section (10 msec) of the audio signal included in the moving image file A likelihood vector having a degree as a component is generated, each component of the likelihood vector is classified into two component groups, and the first feature section (candidate section) and the second feature section are classified based on the components belonging to each component group.
- the example which calculates end time was demonstrated, it is not limited to this.
- the section of interest extraction device 104 uses a similarity vector whose component is the similarity between a feature quantity vector generated from an audio signal included in a video file and a vector representing an anchor model of a plurality of types of sound elements.
- An interest interval may be extracted based on the amount of change.
- the audio data storage device 130 has been described as storing audio data corresponding to a plurality of AV contents.
- the number and type of AV contents are not particularly limited.
- the feature point Tk may be extracted while the time is advanced from the specified time T0.
- This feature point Tk corresponds to the start time of the second feature section.
- the section after the specified time T0 in the first feature section is extracted as the interest section.
- the specified time acquisition unit 210 acquires the specified time T0 input by the user using the interface device 109 .
- the present invention is not limited to this.
- the specified time acquisition unit 210 may automatically acquire the specified time T0 based on the temporal change of the feature amount of each of the plurality of image data included in the moving image file.
- the designated time acquisition unit 210 calculates a plurality of shift feature amounts for each of a plurality of image data included in the moving image file by a general clustering method, and designates the difference from a predetermined shift feature amount difference between the image data.
- the time T0 may be calculated. For example, paying attention to a shift feature amount representing a background image of each of a plurality of image data, a place where the difference of the shift feature amount between two adjacent image data on the time axis greatly changes to a designated time T0 automatically. It is possible to do.
- the present invention is not limited to this.
- a section determined by two times designated by the user You may make it give by. Examples of the two times that define this section include the start time and end time of the section of interest roughly specified by the user.
- information on two given times may be passed to the reference vector / threshold generation unit 204, and the reference vector and the threshold may be generated based on the second unit interval between the two times.
- the feature point extraction unit does not pass the information of these two times to the reference vector / threshold generation unit 204, but uses the so-called midpoint time of the two times as the designated time T0. You may make it pass to 208.
- the specified time acquisition unit 210 may acquire the specified time automatically in addition to acquiring the specified time by user input.
- the low frequency vector generated by the frequency vector generation unit 206 is used to calculate the Euclidean distance between the previous time and the current time of the low frequency vector from the top of the data indicating the low frequency vector.
- a time exceeding a preset threshold value may be automatically determined as the designated time T0.
- an anchor model Ar for each of a plurality of types of sound elements is automatically created from voice data stored in advance in the voice data storage device 130 (an anchor model is created without a so-called teacher).
- an anchor model is created without a so-called teacher.
- the present invention is not limited to this.
- the types of sound segments are limited to a small number (for example, several tens of types)
- the user corresponds to each sound segment. It is also possible to select data, assign a type label to each, and create an anchor model of the corresponding sound segment from audio data with the same type label (create an anchor model Ar with supervision). .
- a program comprising a program code for causing the processor of the section of interest extraction apparatus and various circuits connected to the processor of interest section extraction processing shown in the first embodiment to be recorded on a recording medium or various communications It can also be distributed and distributed via a road or the like. Examples of such a recording medium include an IC card, a hard disk, an optical disk, a flexible disk, and a ROM.
- the distributed and distributed control program is used by being stored in a memory or the like that can be read by the processor, and the processor executes the control program to realize the functions shown in the embodiments. Become so. A part of the control program is transmitted to a device (processor) capable of executing a program separate from the image management apparatus via various networks, and the part of the control program is executed in the separate program-executable device. It is also possible to make it.
- a part or all of the constituent elements constituting the section of interest extraction device described in the embodiment may be implemented as one or a plurality of integrated circuits (IC, LSI, etc.). Other components may be added to the components to form an integrated circuit (one chip).
- LSI is used, but depending on the degree of integration, it may be called IC, system LSI, super LSI, or ultra LSI. Further, the method of circuit integration is not limited to LSI's, and implementation using dedicated circuitry or general purpose processors is also possible.
- An FPGA Field Programmable Gate Array
- a reconfigurable processor that can reconfigure the connection or setting of the circuit cells inside the LSI may be used.
- integrated circuit technology comes out to replace LSI's as a result of the advancement of semiconductor technology or a derivative other technology, it is naturally also possible to carry out function block integration using this technology. Biotechnology can be applied.
- An interest section extraction apparatus and an interest section extraction method extract an interest section that a user is interested in from an audio signal of AV content including voice, sound in a house, sound when going out, and the like. This is useful as a technique for editing AV content.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Television Signal Processing For Recording (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
<1>概要
本実施の形態に係る興味区間抽出装置は、動画ファイルに含まれるオーディオ信号の第1単位区間(10msec)毎に複数種類のアンカーモデルArそれぞれを用いてオーディオ信号の特徴量を表現する特徴量ベクトルに対する尤度を成分とする尤度ベクトルを生成し、尤度ベクトルの各成分を2つの成分群に分類し、各成分群に属する成分に基づいて第1特徴区間(候補区間)および第2特徴区間の終了時刻を算出する。
<2>データ
本実施の形態に係る興味区間抽出装置で使用するデータについて説明する。
<2-1>動画ファイル
動画ファイルは、オーディオ信号と複数の画像データとから構成されている。そして、オーディオ信号は、図2(a)に示すような波形を有する。なお、オーディオ信号は、振幅値の時系列である
<2-2>特徴量ベクトル
以下、オーディオ信号から特徴量ベクトルMを生成するまでの概要を述べる。
<2-3>アンカーモデル
本実施の形態に係るアンカーモデルは、尤度を算出する際に基準となる1024種類のサウンド素片それぞれの特徴を表現するものであり、サウンド素片ごとに作成されている。そして、各アンカーモデルを規定するパラメータから構成される。
<2-4>尤度ベクトル
尤度ベクトルFは、複数のサウンド素片それぞれに対応するアンカーモデルAr(r=1,2,・・・,1024)を用いて、オーディオ信号の特徴量を表現する特徴量ベクトルMに対して算出された尤度Lrを成分とする。従って、尤度ベクトルは、1024次元のベクトルで表現される。この特徴量ベクトルMは、前述<2-3>のように、音声抽出装置102が抽出したオーディオ信号の第1単位区間毎に生成されるものである。
<3>構成
本実施の形態に係る興味区間抽出装置104を搭載した映像編集装置100を図6に示す。
<3-1>全体構成
映像編集装置100は、図6に示すように、入力装置101と、コンテンツ記憶装置103と、音声抽出装置102と、興味区間抽出装置104と、興味区間記憶装置105と、出力装置106と、アンカーモデル作成装置108と、音声データ記憶装置130と、インターフェース装置109とを備える。
<3-2>興味区間抽出装置
興味区間抽出装置104は、メモリ(図示せず)とプロセッサ(図示せず)とから構成され、プロセッサがメモリに読み込まれたプログラムを実行することにより、図7に示す各構成を実現している。以下、各構成について詳述する。
<3-2-1>特徴量ベクトル生成部
特徴量ベクトル生成部201は、入力されるオーディオ信号から特徴量ベクトルを生成する。この特徴量ベクトル生成部201は、まず、音声抽出装置102から入力されるオーディオ信号に対して第1単位区間毎に音響分析を行い、パワースペクトラムS(ω)を算出する。特徴量ベクトル生成部201は、算出したパワースペクトラムS(ω)から特徴量ベクトルM(M(1),M(2),・・・,M(26))を生成する。特徴量ベクトル生成部201は、100個の特徴量ベクトルMを生成することになる(図3参照)。
<3-2-2>尤度ベクトル生成部
尤度ベクトル生成部202は、各サウンド素片のアンカーモデルArを用いて特徴量ベクトルMに対する尤度Lrを算出し、算出した尤度Lrを各成分とする尤度ベクトルFを生成する。尤度ベクトル生成部202は、アンカーモデルArを構成する各パラメータをアンカーモデル蓄積部107から取得する。
<3-2-3>尤度ベクトルバッファ
尤度ベクトルバッファ203は、メモリの一部の領域により構成され、尤度ベクトル生成部202で生成された尤度ベクトルFを記憶する。
<3-2-4>成分分類部
成分分類部205は、[数1]の関係式に従って、尤度ベクトルバッファ203からオーディオ信号の全区間から生成された全ての尤度ベクトルFを読み出し、これらの尤度ベクトルFの総和の各成分を全区間に含まれる第1単位区間の数で割ったもの(重心ベクトルG)を算出する。
<3-2-5>頻度ベクトル生成部
頻度ベクトル生成部206は、頻度ベクトルNFを生成するときに用いる区間を第2単位区間ずつ(1secずつ)ずらしながら頻度ベクトルNFを生成していく。この第2単位区間は、図9に示すように、複数個の第1単位区間の集合に相当する。この頻度ベクトルNFの各成分は、第2単位区間に含まれる尤度ベクトルFの各成分の正規化累積尤度に相当する。この頻度ベクトルNFの一例を図10上段に示す。
<3-2-6>頻度ベクトルバッファ
頻度ベクトルバッファ207は、メモリの一部の領域により構成され、頻度ベクトル生成部206が生成した低頻度ベクトルNFlと高頻度ベクトルNFhを記憶する。
<3-2-7>基準ベクトル・閾値生成部
基準ベクトル・閾値生成部204は、頻度ベクトルバッファ203から指定時刻に対応する高頻度ベクトルNFhを含む複数の第2単位区間に含まれる高頻度ベクトルNFhを取得して基準ベクトルNFh0を算出する。図11の例では、基準ベクトルNFh0が、指定時刻に対応する第2単位区間の前後4個の第2単位区間(合計で9個の第2単位区間)内に含まれる9個の高頻度ベクトルNFhの総和を、第2単位区間の数(9個)で割って得られる。
<3-2-8>指定時刻取得部
指定時刻取得部210は、インターフェース装置109から指定時刻に関する情報を取得し、基準ベクトル・閾値生成部204、特徴点抽出部208および興味区間抽出部209に入力する。
<3-2-9>特徴点抽出部
特徴点抽出部208は、指定時刻T0から1sec刻みで時刻を遡りながら低頻度ベクトルNFl(T),NFl(T-1)の差分Δ(NFl(T)-NFl(T-1))のノルムを算出していく。
<3-2-10>興味区間抽出部
興味区間抽出部209では、指定時刻取得部210から入力される指定時刻に関する情報と、高頻度ベクトルNFhとに基づいて第1特徴区間(候補区間)を抽出する。
<3-2-10>アンカーモデル蓄積部
アンカーモデル蓄積部107は、メモリの一部により構成され、アンカーモデル作成装置108が作成したアンカーモデルArを蓄積している。このアンカーモデル蓄積部107は、興味区間抽出処理を行う前に予めアンカーモデルArを蓄積している。
<3-3>アンカーモデル作成装置
本実施の形態に係るアンカーモデル作成装置108の機能ブロックを図18に示す。アンカーモデル作成装置108は、音声データ蓄積装置130に蓄積されている音声データからアンカーモデルArを作成してアンカーモデル蓄積部107に蓄積する。
<3-3-1>特徴量ベクトル生成部
特徴量ベクトル生成部301は、前述<3-2-1>で説明した特徴量ベクトル生成部201と同様に、音声データ蓄積部130から取得した音声データを第1単位区間に分割し、第1単位区間毎に音響分析を行ってパワースペクトラムS(ω)を算出し、算出したパワースペクトラムS(ω)から特徴量ベクトルMを生成する。
<3-3-2>特徴量ベクトル分類部
特徴量ベクトル分類部302は、インターフェース装置109から入力されるアンカーモデルArの個数Kに基づいて、K-means法により複数の特徴量ベクトルMをK個のクラスタに分離し、各クラスタを表す代表的な特徴量ベクトル(以下、クラスタ特徴量ベクトルと称す。)を算出する。この各クラスタが各アンカーモデルArに対応することになる。なお、本実施の形態では、K=1024に設定されている。
<3-3-3>アンカーモデル生成部
アンカーモデル生成部303は、各クラスタのクラスタ特徴量ベクトルに基づいて、各アンカーモデルArに対応する特徴量出現確率関数bAr(M)を算出する。
<4>動作
<4-1>映像編集装置の動作
以下、本実施の形態に係る興味区間抽出装置104を搭載した映像編集装置100の動作について説明する。
<4-2>興味区間抽出処理
興味区間抽出処理について、図8に基づいて更に詳細に説明する。
<変形例>
以上、本発明に係る興味区間抽出装置104について実施形態1および2に基づいて説明したが、本発明は前述の実施形態1および2で示した興味区間抽出装置104に限られないことはもちろんである。
103 コンテンツ記憶装置
104 興味区間抽出装置
105 興味区間記憶装置
106 興味区間抽出部
107 アンカーモデル蓄積部
108 アンカーモデル作成装置
109 インターフェース装置
130 音声データ蓄積装置
201,301 特徴量ベクトル生成部
202 尤度ベクトル生成部
202b 頻度ベクトル生成部
203c 成分分類部
204 特徴点抽出部
205 興味区間抽出部
302 特徴量ベクトル分類部
303 アンカーモデル生成部
Claims (12)
- 動画ファイルに含まれるオーディオ信号に基づいて、指定時刻を含むユーザの興味区間を抽出する興味区間抽出装置であって、
基準となる複数種類のサウンド素片それぞれの特徴を表現するアンカーモデルを予め蓄積しているアンカーモデル蓄積手段と、
前記指定時刻を取得する指定時刻取得手段と、
オーディオ信号の単位区間毎に、前記オーディオ信号の特徴量を表現する特徴量ベクトルに対する尤度を前記アンカーモデルを用いて求め、各尤度を成分とする尤度ベクトルを生成する尤度ベクトル生成手段と、
前記尤度ベクトルに基づいて興味区間の候補となる候補区間を算出し、前記指定時刻を含む前記候補区間の全部または一部を興味区間として抽出する興味区間抽出手段とを備える
ことを特徴とする興味区間抽出装置。 - 前記単位区間を第1単位区間として、当該第1単位区間のN倍(Nは2以上の自然数)の長さの第2単位区間のオーディオ信号から生成されたN個の尤度ベクトルから頻度ベクトルを生成する頻度ベクトル生成手段を備え、
前記候補区間は、前記頻度ベクトルに基づいて算出される
ことを特徴とする請求項1記載の興味区間抽出装置。 - 前記頻度ベクトルの各成分を複数の成分群に分類する成分分類手段と、
複数の前記成分群それぞれに基づいて複数の特徴区間を算出する特徴区間算出手段とを備え、
前記候補区間は、複数の前記特徴区間により定まる
ことを特徴とする請求項2記載の興味区間抽出装置。 - 前記成分分類手段は、オーディオ信号の全区間の尤度ベクトルから重心ベクトルを生成し当該重心ベクトルの各成分の大きさに基づいて、前記頻度ベクトルの各成分を第1成分群と第2成分群とに分類し、
前記特徴区間算出手段は、前記重心ベクトルにおける前記第1成分群に属する各成分に基づいて第1特徴区間を算出し、前記重心ベクトルにおける前記第2成分群に属する各成分に基づいて第2特徴区間を算出し、
前記候補区間は、前記第1特徴区間および前記第2特徴区間により定まる
ことを特徴とする請求項3記載の興味区間抽出装置。 - 前記興味区間は、前記第1特徴区間に含まれ且つ前記第2特徴区間を包含する区間である
ことを特徴とする請求項4記載の興味区間抽出装置。 - ユーザが予め設定した興味区間の長さを取得する興味区間長取得手段と、
前記指定時刻から前記第2単位区間ずつ時刻をずらしながら前記第2特徴区間に含まれる特徴時刻を検索して抽出する特徴時刻抽出手段を備え、
前記興味区間抽出手段は、前記指定時刻から前記特徴時刻抽出手段が抽出した前記特徴時刻に向かって前記第2単位区間ずつ時刻をずらしながら、対象時刻が前記第1特徴区間に属し且つ当該対象時刻と前記指定時刻との間の長さが予め設定された興味区間の長さよりも短いか否かを判断し、対象時刻が前記第1特徴区間に属し且つ当該対象時刻と前記指定時刻との間の長さが予め設定された興味区間の長さよりも短いと判断すると前記対象時刻を含む第2単位区間を興味区間とする
ことを特徴とする請求項5記載の興味区間抽出装置。 - 前記動画ファイルは、1つのコンテンツを表す動画に対応する
ことを特徴とする請求項6記載の興味区間抽出装置。 - 前記興味区間抽出手段は、複数の前記指定時刻に対応する複数の前記興味区間を前記指定時刻の順に整列して外部記憶装置に記憶する
ことを特徴とする請求項7記載の興味区間抽出装置。 - 前記指定時刻取得手段は、動画ファイルに含まれ且つユーザの指定した区間に対応する画像データそれぞれの特徴量の時間変化に基づいて、当該ユーザの指定した区間の中から自動的に前記指定時刻を取得する
ことを特徴とする請求項8記載の興味区間抽出装置。 - 動画ファイルに含まれるオーディオ信号に基づいて、指定時刻を含むユーザの興味区間を抽出する興味区間抽出方法であって、
基準となる複数種類のサウンド素片それぞれの特徴を表現するアンカーモデルを蓄積するアンカーモデル蓄積ステップと、
前記指定時刻を取得する指定時刻取得ステップと、
オーディオ信号の単位区間毎に、前記オーディオ信号の特徴量を表現する特徴量ベクトルに対する尤度を前記アンカーモデルを用いて求め、各尤度を成分とする尤度ベクトルを生成する尤度ベクトル生成ステップと、
前記尤度ベクトルに基づいて興味区間の候補となる候補区間を算出し、前記指定時刻を含む前記候補区間の全部または一部を興味区間として抽出する興味区間抽出ステップとを含む
ことを特徴とする興味区間抽出方法。 - コンピュータにより動画ファイルに含まれるオーディオ信号に基づいて、指定時刻を含むユーザの興味区間を抽出する興味区間抽出処理を実現させるためのプログラムであって、前記興味区間抽出処理は、
基準となる複数種類のサウンド素片それぞれの特徴を表現するアンカーモデルを蓄積するアンカーモデル蓄積ステップと、
前記指定時刻を取得する指定時刻取得ステップと、
オーディオ信号の単位区間毎に、前記オーディオ信号の特徴量を表現する特徴量ベクトルに対する尤度を前記アンカーモデルを用いて求め、各尤度を成分とする尤度ベクトルを生成する尤度ベクトル生成ステップと、
前記尤度ベクトルに基づいて興味区間の候補となる候補区間を算出し、前記指定時刻を含む前記候補区間の全部または一部を興味区間として抽出する興味区間抽出ステップとを含む
ことを特徴とする興味区間抽出プログラム。 - 動画ファイルに含まれるオーディオ信号に基づいて、指定時刻を含むユーザの興味区間を抽出する興味区間抽出用集積回路であって、
基準となる複数種類のサウンド素片それぞれの特徴を表現するアンカーモデルを予め蓄積しているアンカーモデル蓄積部と、
前記指定時刻を取得する指定時刻取得部と、
オーディオ信号の単位区間毎に、前記オーディオ信号の特徴量を表現する特徴量ベクトルに対する尤度を前記アンカーモデルを用いて求め、各尤度を成分とする尤度ベクトルを生成する尤度ベクトル生成部と、
前記尤度ベクトルに基づいて興味区間の候補となる候補区間を算出し、前記指定時刻を含む前記候補区間の全部または一部を興味区間として抽出する興味区間抽出部とを備える
ことを特徴とする興味区間抽出用集積回路。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/582,206 US8942540B2 (en) | 2011-01-05 | 2011-10-28 | Interesting section extracting device, interesting section extracting method |
CN201180012516.2A CN102782750B (zh) | 2011-01-05 | 2011-10-28 | 兴趣区间抽取装置、兴趣区间抽取方法 |
JP2012551746A JP5658285B2 (ja) | 2011-01-05 | 2011-10-28 | 興味区間抽出装置、興味区間抽出方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011000839 | 2011-01-05 | ||
JP2011-000839 | 2011-01-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012093430A1 true WO2012093430A1 (ja) | 2012-07-12 |
Family
ID=46457300
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2011/006031 WO2012093430A1 (ja) | 2011-01-05 | 2011-10-28 | 興味区間抽出装置、興味区間抽出方法 |
Country Status (4)
Country | Link |
---|---|
US (1) | US8942540B2 (ja) |
JP (1) | JP5658285B2 (ja) |
CN (1) | CN102782750B (ja) |
WO (1) | WO2012093430A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102789780A (zh) * | 2012-07-14 | 2012-11-21 | 福州大学 | 基于谱时幅度分级向量辨识环境声音事件的方法 |
CN114255741A (zh) * | 2022-02-28 | 2022-03-29 | 腾讯科技(深圳)有限公司 | 重复音频检测方法、设备、存储介质 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012164818A1 (ja) * | 2011-06-02 | 2012-12-06 | パナソニック株式会社 | 興味区間特定装置、興味区間特定方法、興味区間特定プログラム、及び、興味区間特定集積回路 |
US9544704B1 (en) * | 2015-07-16 | 2017-01-10 | Avaya Inc. | System and method for evaluating media segments for interestingness |
US11341185B1 (en) * | 2018-06-19 | 2022-05-24 | Amazon Technologies, Inc. | Systems and methods for content-based indexing of videos at web-scale |
CN111107442B (zh) * | 2019-11-25 | 2022-07-12 | 北京大米科技有限公司 | 音视频文件的获取方法、装置、服务器及存储介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000298498A (ja) * | 1999-03-11 | 2000-10-24 | Fuji Xerox Co Ltd | オーディオ・ビジュアル記録物をセグメント化する方法およびコンピュータ記憶媒体、並びにコンピュータシステム |
JP2002140712A (ja) * | 2000-07-14 | 2002-05-17 | Sony Corp | Av信号処理装置および方法、プログラム、並びに記録媒体 |
JP2005331940A (ja) * | 2004-05-07 | 2005-12-02 | Mitsubishi Electric Research Laboratories Inc | マルチメディア中の事象を検出する方法 |
JP2008022103A (ja) * | 2006-07-11 | 2008-01-31 | Matsushita Electric Ind Co Ltd | テレビ番組動画像ハイライト抽出装置及び方法 |
JP2008175955A (ja) * | 2007-01-17 | 2008-07-31 | Toshiba Corp | インデキシング装置、方法及びプログラム |
JP2008185626A (ja) * | 2007-01-26 | 2008-08-14 | Toshiba Corp | ハイライトシーン検出装置 |
WO2011033597A1 (ja) * | 2009-09-19 | 2011-03-24 | 株式会社 東芝 | 信号分類装置 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2960939B2 (ja) | 1989-08-24 | 1999-10-12 | 日本電信電話株式会社 | シーン抽出処理方法 |
US5758257A (en) * | 1994-11-29 | 1998-05-26 | Herz; Frederick | System and method for scheduling broadcast of and access to video programs and other data using customer profiles |
JP3955418B2 (ja) | 1999-08-17 | 2007-08-08 | 株式会社日立国際電気 | 動画像編集装置 |
US7302451B2 (en) | 2004-05-07 | 2007-11-27 | Mitsubishi Electric Research Laboratories, Inc. | Feature identification of events in multimedia |
CN100570712C (zh) * | 2005-12-13 | 2009-12-16 | 浙江大学 | 基于锚模型空间投影序数比较的快速说话人确认方法 |
JP5088030B2 (ja) * | 2007-07-26 | 2012-12-05 | ヤマハ株式会社 | 演奏音の類似度を評価する方法、装置およびプログラム |
JP5206378B2 (ja) * | 2008-12-05 | 2013-06-12 | ソニー株式会社 | 情報処理装置、情報処理方法、及びプログラム |
-
2011
- 2011-10-28 JP JP2012551746A patent/JP5658285B2/ja active Active
- 2011-10-28 CN CN201180012516.2A patent/CN102782750B/zh active Active
- 2011-10-28 US US13/582,206 patent/US8942540B2/en active Active
- 2011-10-28 WO PCT/JP2011/006031 patent/WO2012093430A1/ja active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000298498A (ja) * | 1999-03-11 | 2000-10-24 | Fuji Xerox Co Ltd | オーディオ・ビジュアル記録物をセグメント化する方法およびコンピュータ記憶媒体、並びにコンピュータシステム |
JP2002140712A (ja) * | 2000-07-14 | 2002-05-17 | Sony Corp | Av信号処理装置および方法、プログラム、並びに記録媒体 |
JP2005331940A (ja) * | 2004-05-07 | 2005-12-02 | Mitsubishi Electric Research Laboratories Inc | マルチメディア中の事象を検出する方法 |
JP2008022103A (ja) * | 2006-07-11 | 2008-01-31 | Matsushita Electric Ind Co Ltd | テレビ番組動画像ハイライト抽出装置及び方法 |
JP2008175955A (ja) * | 2007-01-17 | 2008-07-31 | Toshiba Corp | インデキシング装置、方法及びプログラム |
JP2008185626A (ja) * | 2007-01-26 | 2008-08-14 | Toshiba Corp | ハイライトシーン検出装置 |
WO2011033597A1 (ja) * | 2009-09-19 | 2011-03-24 | 株式会社 東芝 | 信号分類装置 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102789780A (zh) * | 2012-07-14 | 2012-11-21 | 福州大学 | 基于谱时幅度分级向量辨识环境声音事件的方法 |
CN114255741A (zh) * | 2022-02-28 | 2022-03-29 | 腾讯科技(深圳)有限公司 | 重复音频检测方法、设备、存储介质 |
Also Published As
Publication number | Publication date |
---|---|
JP5658285B2 (ja) | 2015-01-21 |
US20120321282A1 (en) | 2012-12-20 |
CN102782750A (zh) | 2012-11-14 |
JPWO2012093430A1 (ja) | 2014-06-09 |
US8942540B2 (en) | 2015-01-27 |
CN102782750B (zh) | 2015-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10262239B2 (en) | Video content contextual classification | |
US10679063B2 (en) | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics | |
JP5658285B2 (ja) | 興味区間抽出装置、興味区間抽出方法 | |
US8750681B2 (en) | Electronic apparatus, content recommendation method, and program therefor | |
JP5691289B2 (ja) | 情報処理装置、情報処理方法、及び、プログラム | |
JP5533861B2 (ja) | 表示制御装置、表示制御方法、及び、プログラム | |
US8948515B2 (en) | Method and system for classifying one or more images | |
US8892497B2 (en) | Audio classification by comparison of feature sections and integrated features to known references | |
CN109691124B (zh) | 用于自动生成视频亮点的方法和系统 | |
WO2012020667A1 (ja) | 情報処理装置、情報処理方法、及び、プログラム | |
JP2011223287A (ja) | 情報処理装置、情報処理方法、及び、プログラム | |
JP2011215963A (ja) | 電子機器、画像処理方法及びプログラム | |
JP6039577B2 (ja) | 音声処理装置、音声処理方法、プログラムおよび集積回路 | |
JP5723446B2 (ja) | 興味区間特定装置、興味区間特定方法、興味区間特定プログラム、及び、興味区間特定集積回路 | |
JP2013126233A (ja) | 映像処理装置、方法及びプログラム | |
JP2010161722A (ja) | データ処理装置、データ処理方法、及び、プログラム | |
US9113269B2 (en) | Audio processing device, audio processing method, audio processing program and audio processing integrated circuit | |
TWI780333B (zh) | 動態處理並播放多媒體內容的方法及多媒體播放裝置 | |
JP5254900B2 (ja) | 映像再構成方法、映像再構成装置および映像再構成プログラム | |
Hauptmann et al. | Informedia@ trecvid2008: Exploring new frontiers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201180012516.2 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2012551746 Country of ref document: JP |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11855105 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13582206 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 11855105 Country of ref document: EP Kind code of ref document: A1 |