WO2015133782A1 - 컨텐츠 분석 방법 및 디바이스 - Google Patents
컨텐츠 분석 방법 및 디바이스 Download PDFInfo
- Publication number
- WO2015133782A1 WO2015133782A1 PCT/KR2015/002014 KR2015002014W WO2015133782A1 WO 2015133782 A1 WO2015133782 A1 WO 2015133782A1 KR 2015002014 W KR2015002014 W KR 2015002014W WO 2015133782 A1 WO2015133782 A1 WO 2015133782A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio content
- section
- content
- feature value
- information
- Prior art date
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 claims abstract description 72
- 230000006870 function Effects 0.000 claims description 33
- 239000000284 extract Substances 0.000 claims description 18
- 230000007613 environmental effect Effects 0.000 claims description 17
- 238000010586 diagram Methods 0.000 description 15
- 238000004891 communication Methods 0.000 description 14
- 230000008859 change Effects 0.000 description 8
- 238000001514 detection method Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000003384 imaging method Methods 0.000 description 5
- 238000000354 decomposition reaction Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 230000005674 electromagnetic induction Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 229920001621 AMOLED Polymers 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006698 induction Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000005672 electromagnetic field Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004907 flux Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000060 site-specific infrared dichroism spectroscopy Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 210000000707 wrist Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/434—Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
Definitions
- the present invention is directed to a method and device for analyzing content.
- the multimedia content may include audio content and video content.
- the device may analyze the multimedia content or summarize the content by analyzing the audio content.
- MFCC Mel-Frequency Cepstral Coefficients
- the present invention relates to a method and a device for analyzing content, and to a method of classifying audio content by sections and selectively analyzing the classified audio content by sections.
- the content analysis performance may be improved by selectively analyzing the content according to the feature value of the audio content for each section.
- FIG. 1 is a block diagram illustrating an internal structure of a device for analyzing content according to an exemplary embodiment.
- FIG. 2 is a block diagram illustrating an internal configuration of the primary analyzer 200 of the AME 110 according to an exemplary embodiment.
- FIG. 3 is a block diagram illustrating an internal structure of a feature extractor according to an exemplary embodiment.
- FIG. 4 is a block diagram illustrating an internal structure of the secondary analyzer 400 according to an embodiment.
- FIG. 5 is a block diagram illustrating an internal structure of a content summary unit according to an exemplary embodiment.
- FIG. 6 is a flowchart illustrating a method of analyzing audio content according to an exemplary embodiment.
- FIG. 7 is a flowchart illustrating a method of determining a topic of audio content according to an exemplary embodiment.
- FIG. 8 is a flowchart illustrating a method of generating sports highlight information according to an exemplary embodiment.
- FIG. 9 is an exemplary diagram illustrating an example of sports highlight information according to an embodiment.
- FIG. 10 is a flowchart illustrating a method of generating viewing rating information of content according to an exemplary embodiment.
- 11 and 12 are block diagrams illustrating an internal structure of a device for analyzing content according to at least one example embodiment.
- a method of analyzing audio content comprising: extracting feature values of the audio content; Classifying the audio content for each section based on the feature value of the extracted audio content; Selecting at least one section for analyzing the audio content based on a class to which the audio content of each section belongs, and performing analysis on the audio content of the selected section.
- the classification may be performed by comparing a feature value of the audio content with a feature value of the database by using a database including information on feature values of at least one audio content belonging to each type. Classifying.
- extracting the feature value may include decomposing the audio content into at least one elementary function; Selecting at least one of the basic functions as a dominant elementary function for each section of the decomposed audio content; And extracting a basis function for each section as a feature value of the audio content by using the selected main basic function.
- the extracting of the feature value may include extracting at least one instantaneous feature value in a predetermined section of the audio content; Extracting a statistical feature value from the at least one instantaneous feature value belonging to the predetermined section.
- performing the analysis may include selecting a section of the audio content belonging to a speech class; And performing at least one of speech recognition and speaker recognition on the audio content of the selected section.
- the performing of the analysis may include determining a topic for the audio content of a predetermined section by using the speech recognition or speaker recognition result.
- performing the analysis may include selecting a section of the audio content belonging to an environmental noise class; Detecting an acoustic event included in the audio content for each of the selected sections.
- performing the analysis may include performing analysis on video content corresponding to the selected section; Correcting the analysis result of the audio content by using the analysis result of the video content.
- a device may include a receiver configured to receive audio content; Extracts a feature value of the audio content, classifies the audio content for each section based on the extracted feature value of the audio content, and based on a class to which the audio content of each section belongs And a controller configured to select at least one section for analyzing the audio content and to analyze the audio content of the selected section.
- part refers to a hardware component, such as software, FPGA or ASIC, and “part” plays certain roles. However, “part” is not meant to be limited to software or hardware.
- the “unit” may be configured to be in an addressable storage medium and may be configured to play one or more processors.
- a “part” refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, procedures, Subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables.
- the functionality provided within the components and “parts” may be combined into a smaller number of components and “parts” or further separated into additional components and “parts”.
- FIG. 1 is a block diagram illustrating an internal structure of a device for analyzing content according to an exemplary embodiment.
- the device 100 may be a terminal device that can be used by a user.
- the device 100 may be a smart television (television), ultra high definition (UHD) TV, a monitor, a personal computer (PC), a notebook computer, a mobile phone, a tablet PC, a navigation terminal, a smart Smart phones, personal digital assistants (PDAs), portable multimedia players (PMPs), and digital broadcast receivers.
- TV television
- UHD ultra high definition
- PC personal computer
- PMPs portable multimedia players
- digital broadcast receivers digital broadcast receivers.
- the device 100 may analyze audio content in two steps.
- the device 100 may extract the characteristic of the audio content and classify the audio content for each section according to the extracted characteristic.
- the device 100 may select a section of audio content to be analyzed according to a class to which each section belongs, and may analyze the selected audio content section.
- the device 100 may select the audio content section according to the analysis method.
- the device 100 may also analyze the video content corresponding to the audio content.
- the device 100 may finally determine analysis information about the multimedia content including the video content and the audio content by comparing the analysis result of the video content with the analysis result of the audio content.
- the analysis information about the content may include, for example, keyword information included in the content for each section, speaker information, and information on whether to include an acoustic event having a predetermined characteristic.
- the device 100 may include an audio mining engine (AME) 110, a video mining engine 120, a determination module 130, and a content summary unit 140.
- AME audio mining engine
- video mining engine 120 may include a video mining engine 120, a determination module 130, and a content summary unit 140.
- the AME 110 and the VME 120 may analyze audio content and video content, respectively.
- the AME 110 may analyze the audio content in two stages and output analysis information about the audio content.
- the VME 120 may output analysis information about video content.
- the determination module 130 may finally compare the content analysis information output by the AME 110 and the VME 120 with each other to finally determine the analysis information for each section of the content.
- the determination module 120 may finally determine the analysis information for each section of the multimedia content by correcting the content analysis information output by the AME 110 using the content analysis information output by the VME 120.
- the analysis information for each section of the content may include, for example, keyword information included in a predetermined section, speaker information, information about whether or not to include an acoustic event of a predetermined feature, and the corresponding section as location information of the section. It may further include information about the start and end of the point.
- the determination module 130 may output content analysis information in real time by continuously performing analysis on content input in real time.
- the content summary unit 140 may generate and output summary information, topics, highlight scene information, rating information, and the like, for the content of a predetermined section based on the finally determined content analysis information.
- the content summary unit 140 may generate summary information, rating information, and the like, by using keyword information and speaker information included in each section included in the content analysis information.
- the device 100 may output content analysis information or summary information about the content being viewed in real time or the content selected by the user.
- the user may view the content summary information output by the device 100, and may grasp a summary of the content currently being viewed or selected by the user, a key theme, and the like. Therefore, according to the content analysis method according to an embodiment, since the user can check the summary and the main information on the content without viewing the entire content, user convenience can be increased.
- the device 100 may store the content analysis information generated by the determination module 130 and the summary information generated by the content summary unit 140 in a storage space for future retrieval.
- the device 100 may use content analysis and summary information to search for content. For example, the device 100 may search for a content including a specific keyword using previously stored content analysis and summary information.
- FIG. 2 is a block diagram illustrating an internal configuration of the primary analyzer 200 of the AME 110 according to an exemplary embodiment.
- the primary analyzer 200 of FIG. 2 corresponds to the primary analyzer 111 of FIG. 1.
- the primary analyzer 200 may classify the audio content by extracting a feature of the audio content by a predetermined section and determining a class to which the audio content belongs to each section according to the extracted feature.
- the primary analyzer 200 may output class information of audio content determined for each section.
- the primary analyzer 200 may include a signal input unit 210, a sound source separator 220, a feature extractor 230, a classifier 240, and a section divider 250. have.
- the signal input unit 210 may receive the audio content and transmit the audio content to the source separation unit 220.
- the sound source separating unit 220 may separate the input audio content for each sound source.
- the audio content may include various sounds for each sound source.
- the audio content may include various sounds such as a human voice, a musical instrument sound, and a horn sound.
- the sound source separator 220 may selectively perform sound source separation of the audio content, and the primary analyzer 200 may classify the audio content for each sound source separated by the sound source separator 220.
- the feature extractor 230 may extract a feature of the audio content.
- the feature extractor 230 may extract the feature of the audio content for each sound source.
- the feature extractor 230 may extract features of the audio content by extracting instantaneous features and extracting statistical features.
- Instantaneous feature extraction can be performed for a very short period of audio content, and statistical feature extraction can be performed by obtaining statistical values from a certain amount of instantaneous features.
- statistical features are extracted from the mean, standard deviation, skewness, kurtosis, first- and second-order derivatives of a certain amount of instantaneous feature values, and so on. Can be.
- the method of extracting the instantaneous feature of the audio content by the feature extractor 230 may be performed using two methods.
- the first method is to extract instantaneous features by extracting perceptual features of audio content.
- intuitive features include spectral centroid, spectral flatness, spectral flux, spectral rolloff, and zero crossing rate for acoustic signals in audio content. (zero crossing rate) and the like.
- the device 100 may extract the features of the audio content by using the intuitive features of the audio content together with the MFCC method, which is one of the methods of extracting the features of the audio content.
- the second method is to extract instantaneous features by extracting intrinsic properties of audio content.
- the feature of the audio content may be extracted by obtaining a basis function for the audio content signal. The second method will be described in more detail later with reference to FIG. 3.
- the feature extractor 230 may extract a feature using only one of the two methods, or may extract the feature of the audio content by mixing the first method and the second method. For example, the feature extractor 230 may extract the feature value of the final audio content by comparing the feature value of the first method with the feature value of the second method.
- the feature extractor 230 may extract the instantaneous features in a predetermined section by using the above-described method of extracting the instantaneous features of the audio content, and extract statistical features from a plurality of instantaneous features for the predetermined section. Can be extracted.
- the feature extractor 230 may output statistical features of the audio content acquired for each section to the classifier 240.
- the classifier 240 may classify the audio content of each section according to the feature value for each section output by the feature extractor 230.
- the classifier 240 may classify the audio content for each unit section of a preset size.
- the classifier 240 may classify each section of the audio content into one of three types: voice, back ground music, and environment noise.
- the section including the speaker's voice may be classified as a voice.
- the section including the musical instrument and the music sound may be classified as a background sound.
- a section that includes noises that may be generated in a predetermined environment or that does not correspond to voice or background sounds may be classified as environmental noise.
- Models of audio content feature values, which may be classified as voice, background sound, environmental noise, and the like, may be stored in the database 260 in advance.
- the classifier 240 may classify the audio content by sections by comparing the audio content feature values belonging to the voice, the background sound, the environmental noise, etc., respectively, stored in the database 260 with the feature values extracted by the feature extractor 230. have.
- the classifier 240 classifies the class to which the feature value stored in the database 260 belongs when the similarity between the feature value stored in the database 260 and the feature value extracted by the feature extractor 230 is a predetermined level or more. It may be determined as the class of the audio content of the current section.
- the classifier 240 may update the audio content feature values stored in the database 260 by using the feature value of the audio content used in class determination.
- the classifier 240 may store the feature value of the currently extracted audio content in the database 260 in association with the class to which the feature value belongs.
- voice, background sound, environmental noise, and the like are merely examples, and audio content may be classified into various classes.
- the segmentation unit 250 may re-determine the class of the audio content based on the longer section with respect to the audio content of each section classified by the classifying unit 240. For example, the section divider 250 may reclassify the audio content as one class in a section of 10 seconds for the plurality of audio contents classified by sections of 1 to 5 seconds. If only one second of the section of the 10-second section is classified as environmental noise, and the remaining nine seconds of the section is classified as voice, the section divider 250 is a section that is classified as environmental noise Can be reclassified.
- an error that may be generated as the audio content is classified in each short section may be corrected.
- the length of the long section described above may be a preset value or a value that can be variably determined according to the characteristics of the audio content.
- the section divider 250 may output class information of the audio content for the section corresponding to the start point and the end point.
- the class information may include a start point, an end point, and labeling information.
- the start point and end point mean the location of the reclassified long section described above.
- the class information for consecutive content sections classified into the same class may be output as one class information so that the same labeling information does not continuously exist.
- the labeling information is information on a class to which the audio content belongs, and may include, for example, one of voice, background sound, and environmental noise.
- FIG. 3 is a block diagram illustrating an internal structure of a feature extractor according to an exemplary embodiment.
- the feature extractor 230 shown in FIG. 3 is based on a method of extracting an instantaneous feature by extracting an intrinsic feature of audio content by the second method described above. According to the second method, the basis function, which is a unique characteristic of the content, can be extracted as an instantaneous feature.
- the feature extractor 230 may include a signal resolver 231, a main basic function selector 232, a basis function generator 233, and a statistical feature determiner 234.
- the signal decomposition unit 231 may decompose the audio content signal using a preset elementary function.
- the signal decomposition unit 231 may decompose the audio content signal by using a sparse coding method. As a result of the decomposition, the audio content signal may be decomposed into at least one basic function.
- the main basic function selector 232 may determine a dominant elementary function with respect to the audio content signal decomposed by the signal splitter 231.
- the main basic function selector 232 may select at least one of the at least one basic function which is a result of decomposition of the above-described audio content as the main basic function.
- the base function generator 233 may generate a base function by synthesizing a base function selected as the main base function.
- the basis function may be output as an instantaneous feature value of the audio content.
- the statistical characteristic determiner 234 may output statistical characteristics from instantaneous characteristic values of the plurality of audio contents. Extraction of statistical features may be performed by obtaining statistical values from a certain amount of instantaneous features.
- FIG. 4 is a block diagram illustrating an internal structure of the secondary analyzer 400 according to an embodiment.
- the secondary analyzer 400 of FIG. 4 corresponds to the secondary analyzer 112 of FIG. 1.
- the secondary analyzer 400 may perform secondary analysis using class information and audio content output by the primary analyzer 111.
- the secondary analyzer 400 may select an audio content section to be analyzed using class information and perform a secondary analysis on the selected audio content section.
- the secondary analyzer 400 may include a section selector 410, a signal analyzer 420, and a tagging unit 430.
- the section selector 410 may select a section of audio content to be analyzed based on the class information. For example, when extracting a keyword and analyzing audio content based on the extracted keyword, it is preferable to analyze an audio content section including a voice signal from which the keyword can be extracted. Therefore, the section selector 410 may select the audio content section classified as voice.
- the signal analyzer 420 may analyze the audio content section selected by the section selector 410. For example, the signal analyzer 420 may extract a keyword included in the selected section by performing voice recognition on the selected section. The extracted keyword may be output as analysis information of the content.
- the tagging unit 430 may tag the analysis information output by the signal analyzer 420 with respect to the section of the corresponding audio content. For example, the tagging unit 430 may output the keyword information extracted by the signal analysis unit 420 as analysis information on a corresponding audio content section.
- FIG. 5 is a block diagram illustrating an internal structure of a content summary unit according to an exemplary embodiment.
- the content summary unit 500 of FIG. 5 corresponds to the content summary unit 140 of FIG. 1.
- the content summary unit 500 may generate and output content information about the audio content using at least one analysis information tagged for each section of the audio content.
- the content summary unit 500 may include a scene detector 510, a scene divider 520, a statistics acquirer 530, and a content information generator 540.
- the scene according to an embodiment is a section including one event occurring at a continuous time in one space, and the scene may be divided according to a contextual meaning.
- the scene detector 510 may detect a scene of the content. For example, the scene detector 510 may detect a scene based on the similarity of video signals between two consecutive frames. Also, the scene detector 510 may detect that the scene is switched to a new scene by detecting an unvoiced section from the audio signal. The scene detector 510 may detect a scene by determining a scene change point, which is a point at which the scene is switched. In the case of multimedia content, the scene detector 510 may detect a scene change point in consideration of audio content and video content. For example, the scene detector 510 may finally detect the scene change point of the multimedia content by comparing the scene change point detected with respect to the audio content and the scene change point detected with respect to the video content.
- the content summary unit 500 may detect a scene included in the content and perform content summary on a content section of each scene.
- the scene dividing unit 520 may detect the boundary of the final scene by including an isolated shot that does not belong to any scene as a result of the scene detection in the front or rear scene.
- the shot means a section photographed at one time without changing the screen, and the scene may be composed of a plurality of shots.
- the scene dividing unit 520 detects a similarity between the isolated shots and shots belonging to the front or back scene, and includes the isolated shots in the scene detected by the scene detecting unit 510 according to the detected similarity. Can be.
- the statistics acquisition unit 530 may obtain content analysis information corresponding to the scene section finally determined by the scene dividing unit 520, and obtain statistical information for generating content summary information from the analysis information.
- the content analysis information refers to information output by the determination module 130 described above.
- the statistics acquirer 530 may obtain keywords included in at least one analysis information that may correspond to one scene section, and may obtain a frequency number of each keyword.
- the statistics acquisition unit 530 may determine whether keywords acquired in at least one scene section correspond to a word registered in advance as a reserved word, and may obtain a frequency number of each reserved word. In the case of a word pre-registered as a reserved word, information on a topic that may be determined according to the frequency of appearance may exist together.
- the content information generator 540 may generate content summary information according to the statistical information by the statistics acquirer 530. For example, the content information generator 540 may finally determine a topic related to each keyword or reserved word according to the frequency of the keyword or the reserved word.
- the content information generator 540 may determine the topic as a weather forecast.
- FIG. 6 is a flowchart illustrating a method of analyzing audio content according to an exemplary embodiment.
- the device 100 may extract a feature value from audio content to be analyzed.
- the device 100 may extract the feature value of the audio content by extracting the instantaneous feature value and extracting a statistical feature value for a predetermined section from the instantaneous feature value.
- the device 100 may classify the audio content for each section based on the feature value of the audio content extracted in operation S601.
- the device 100 may determine a class to which the audio content belongs by comparing the feature value of the audio content stored in the database 260 with the feature value of the audio content extracted in step S601.
- the class may be determined according to the feature value of the audio content. For example, the audio content may be classified into one of voice, background sound, and environmental noise.
- the device 100 may obtain a feature value of the audio content that is most similar to the feature value of the extracted audio content among values stored in the database 260.
- the device 100 may determine a class to which the feature value of the database 260 determined to be the most similar belongs as a class of audio content of the current section.
- the device 100 may output class information of audio content of each section as a result of the primary analysis.
- the class information may include location information of each section and information about a class to which each section belongs.
- the device 100 may select a section for analyzing the audio content based on a class to which the audio content of each section belongs.
- the device 100 may select a section for analyzing audio content based on class information of each section as a result of the first analysis in order to perform the second analysis.
- the device 100 may select a section according to a method of analyzing content. For example, when the device 100 wants to extract a keyword from content, the device 100 may select a content section belonging to a voice class including content capable of speech recognition.
- the device 100 may improve analysis performance of audio content by selectively analyzing the audio content according to the characteristics of the audio content.
- speech recognition technology for keyword extraction an algorithm is written on the premise that the input signal is a speech signal, and thus speech recognition may be optimally performed on audio content including the speech signal.
- the device 100 may analyze the audio content of the section selected in operation S605.
- the device 100 may generate content analysis information by performing secondary analysis on the audio content of the selected section.
- the device 100 may extract the keyword by performing voice recognition on an audio content section classified into a voice class.
- the device 100 may detect a word or sentence included in the audio content, and extract a keyword by extracting words included in a pre-stored reserved word list from the voice recognized result.
- the device 100 may perform speaker recognition on the audio content section classified into the voice class.
- FIGS. 7 to 10 in the secondary analysis, a method of analyzing content for each scenario will be described in more detail.
- the content analysis method illustrated in FIGS. 7 to 10 relates to the second analysis method of the device 100, and it is assumed that the first analysis is already performed.
- a topic determination method As a secondary analysis method according to an exemplary embodiment, a topic determination method, a sports highlight information generation method, and a viewing grade information generation method will be described in more detail with reference to FIGS. 7 to 10.
- FIG. 7 is a flowchart illustrating a method of determining a topic of audio content according to an exemplary embodiment.
- the device 100 may select an audio content section in which audio content includes voice.
- the device 100 may select the audio content section using the information about the class determined for each section as a result of the first analysis. Since the device 100 may determine the topic of the audio content through analysis of the voice section, the device 100 may select the audio content section classified into the voice class.
- the device 100 may perform at least one of speech recognition and speaker recognition on the audio content of the section selected in operation S701.
- Speech recognition is for recognizing keywords included in the audio content
- speaker recognition is for recognizing the speaker of the voice included in the audio content.
- Topics may be generated based on recognized keywords and speakers.
- the device 100 may tag keyword information that is a result of speech recognition in step S703 and speaker recognition information that is a result of speaker recognition for each section of the audio content.
- the device 100 may tag keyword information and speaker recognition information of each section of the audio content to the audio content.
- the device 100 may finally determine content analysis information including keyword information and speaker recognition information to be tagged using the content analysis information determined by the VME 120.
- the device 100 may tag using the start and end time information of the content section, keyword information, and speaker recognition information.
- the device 100 may detect a scene of the multimedia content including the audio content to determine a topic. Scenes can be distinguished in a contextual sense, so new topics are more likely to start at the beginning of a new scene. Accordingly, the device 100 may detect a scene and determine a topic on a scene basis.
- the device 100 may acquire a detection frequency of the keyword and the recognized speaker using the tagged information on the content corresponding to the predetermined section of the scene detected in operation S707.
- the topic may be finally determined based on the frequency detected in operation S709.
- the device 100 may obtain a topic corresponding to the pre-registered reserved words detected in the predetermined section.
- the device 100 may determine a topic including recognized speaker information in a predetermined section.
- FIG. 8 is a flowchart illustrating a method of generating sports highlight information according to an exemplary embodiment.
- the device 100 may continuously analyze the sports program and generate content analysis information while the sports program is being broadcasted.
- the device 100 may generate sports highlight information based on content analysis information generated according to a user input. Even if the user does not watch the sports program, the user may later watch the main scenes of the corresponding content by using the sports highlight information generated by the device 100.
- the device 100 may select an audio content section including a voice and detect an excited speech with respect to the selected audio content section.
- the device 100 may detect an audio content section including the excited voice.
- the device 100 may select the audio content section using the information about the class determined for each section as a result of the first analysis.
- the announcer can speak with an excited voice. Accordingly, the device 100 may detect the excited voice from the audio content classified into the voice class to generate the sports highlight information using the dramatic scene or the scored scene.
- the excited voice may have a loud voice or a high frequency. Accordingly, the device 100 may detect the audio content section including the excited voice by using the fact that the excited voice has different voice signal characteristics as compared with the normal voice.
- the device 100 may use feature information regarding the excited voice to detect the audio content section including the excited voice among the audio content sections including the voice.
- Feature information about the excited voice may be stored in advance in another storage space.
- the device 100 may select an audio content section including environmental noise and detect an acoustic event with respect to the selected audio content section.
- Acoustic events may include sounds other than voice or music, and may include shouting, whistle, ball hit sounds, etc. of the crowd associated with dramatic or scoring scenes.
- the device 100 may detect an audio content section including an acoustic event.
- the device 100 may select an audio content section including environmental noise by using the information about the class determined for each section as a result of the first analysis.
- the device 100 may select an audio content section including an acoustic event among the selected audio content sections.
- the device 100 may generate sports highlight information using an audio content section including an acoustic event.
- the device 100 may use feature information about the acoustic event to detect an audio content section including a preset acoustic event among audio content sections including environmental noise.
- the characteristic information about the acoustic event may be stored in advance in another storage space.
- the device 100 may tag the audio content section including the excited voice and acoustic event detected in operations S801 and S803.
- the taggable information may include start and end time information of the audio content section, information about an excited voice and an acoustic event.
- the device 100 may generate sports highlight information using a content section including at least one of an excited voice and an acoustic event using the tagged information.
- the device 100 may generate sports highlight information by generating a clip image including a content section including at least one of an excited voice and an acoustic event.
- FIG. 9 is an exemplary diagram illustrating an example of sports highlight information according to an embodiment.
- sports highlight scenes 911 and 912 are shown.
- the sports highlight scene may be generated based on a content section including at least one of an excited voice and an acoustic event.
- FIG. 10 is a flowchart illustrating a method of generating viewing rating information of content according to an exemplary embodiment.
- the device 100 may generate the viewing rating information including information on the degree of sensibility or violence for the content currently being viewed or viewed.
- the user may check the degree of sensibility or violence of the corresponding content by referring to the viewing rating information of the content.
- the device 100 may detect an slang word by selecting an audio content section including a voice and performing voice recognition from the selected audio content section.
- the device 100 may detect the profanity included in the selected audio content section by using the pre-stored information about the profanity.
- the device 100 may select an audio content section including environmental noise and detect an acoustic event from the selected audio content section.
- Acoustic events may include sounds other than voice or music, and may include gun shots, bombings, screams, and the like, which are related to sensationality and violence.
- the device 100 may use feature information about the acoustic event to detect an audio content section including a preset acoustic event among audio content sections including environmental noise.
- the characteristic information about the acoustic event may be stored in advance in another storage space.
- the device 100 may perform tagging on an audio content section including slang and acoustic events detected in operations S1001 and S1003.
- the taggable information may include start and end time information of the audio content section, slang and information about an acoustic event.
- the device 100 may generate viewing rating information by using a content section including at least one of slang and acoustic event using the tagged information.
- the device 100 may detect a scene of the multimedia content including the audio content to generate the audience rating information. Scenes can be distinguished in a contextual sense, so new content is likely to begin at the beginning of a new scene. Accordingly, the device 100 may detect a scene and generate viewing grade information in units of scenes.
- the device 100 may acquire detection frequencies of slang words and acoustic events using information tagged for content corresponding to a predetermined section of the scene detected in operation S1007.
- the device 100 may obtain statistics on detection frequencies of profanity and acoustic events.
- the viewing grade information may be generated based on the frequency detected in operation S709.
- the device 100 may determine the section where the slang and the acoustic event are detected as the sensational section or the violent section.
- the device 100 may set different weights for sensational or violent degrees for each slang and acoustic event.
- the device 100 may determine the degree of sensibility or violence in each section according to the number of times of detection and weight of each of the slang words and the acoustic event.
- the weight may be a predetermined value.
- the device 100 may generate viewing rating information by obtaining a ratio of the sensibility section or the violent section to the entire content or the corresponding section, and the degree of sensibility or violence of each section.
- the device 100 may generate viewing rating information indicating the selectivity or the violent degree of the sensibility or the violent section of the content by setting the x-axis and the y-axis as violent in the two-dimensional space.
- FIG. 11 is a block diagram illustrating an internal structure of a device for analyzing content according to an exemplary embodiment.
- the device 1100 may include a receiver 1110 and a controller 1120.
- the device 1100 of FIG. 11 may correspond to the device 100 of FIG. 1.
- the receiver 1110 may receive content to be analyzed. In addition, the receiver 1110 may obtain feature values of audio content for each class, reserved word information for keyword extraction, and the like, which may be used in content analysis.
- the controller 1120 may analyze the content received by the receiver 1110.
- the controller 1120 may extract feature values of the received audio content and classify the audio content for each section based on the extracted feature values.
- the controller 1120 may select at least one section for analyzing the audio content based on the type to which the audio content of each section belongs, and analyze the audio content of the selected section.
- the controller 1120 may analyze the content by selecting an audio content section classified into a voice class and performing voice recognition and speaker recognition on the selected audio content section.
- FIG. 12 is a block diagram illustrating an internal structure of a device 1200 according to an embodiment.
- the device 1200 may be, for example, a mobile phone, a smartphone, a tablet PC, a PDA, an MP3 player, a kiosk, an electronic picture frame, a navigation device, a digital TV, a smart TV, or a wrist watch.
- the present invention may include various types of devices usable by a user such as a wearable device such as a head-mounted display (HMD).
- HMD head-mounted display
- the device 1200 may correspond to the devices 100 and 1100 of FIGS. 1 and 11 described above, and may analyze the received audio content and output content summary information.
- the device 1200 may include a display unit 1210, a controller 1270, a memory 1220, a GPS chip 1225, a communication unit 1230, The video processor 1235, the audio processor 1240, the user input unit 1245, the microphone unit 1250, the image capture unit 1255, the speaker unit 1260, and the motion detector 1265 may be included.
- the display unit 1210 may include a display panel 1211 and a controller (not shown) for controlling the display panel 1211.
- the display panel 1211 includes various types of displays such as a liquid crystal display (LCD), an organic light emitting diodes (OLED) display, an active-matrix organic light-emitting diode (AM-OLED), a plasma display panel (PDP), and the like. Can be.
- the display panel 1211 may be implemented to be flexible, transparent, or wearable.
- the display unit 1210 may be combined with the touch panel 1247 of the user input unit 1245 and provided as a touch screen.
- the touch screen may include an integrated module in which the display panel 1211 and the touch panel 1247 are combined in a stacked structure.
- the display unit 1210 may display a result of analyzing audio content and summary information of the audio content under the control of the controller 1270.
- the memory 1220 may include at least one of an internal memory (not shown) and an external memory (not shown).
- the built-in memory may be, for example, volatile memory (for example, dynamic RAM (DRAM), static RAM (SRAM), synchronous dynamic RAM (SDRAM), etc.), nonvolatile memory (for example, one time programmable ROM). ), Programmable ROM (PROM), Eraseable and Programmable ROM (EPROM), Electrically Erasable and Programmable ROM (EEPROM), Mask ROM, Flash ROM, etc.), Hard Disk Drive (HDD) or Solid State Drive (SSD) It may include.
- the controller 1270 may load a command or data received from at least one of the nonvolatile memory or another component into the volatile memory and process the same.
- the controller 1270 may store data received or generated from another component in the nonvolatile memory.
- the external memory may include at least one of Compact Flash (CF), Secure Digital (SD), Micro Secure Digital (Micro-SD), Mini Secure Digital (Mini-SD), Extreme Digital (xD), and a Memory Stick. It may include.
- the memory 1220 may store various programs and data used for the operation of the device 1200.
- the memory 1220 may temporarily or semi-permanently store at least one of feature values of audio content belonging to each class, reserved words, feature information of an excited voice, and feature information of an acoustic event.
- the controller 1270 may control the display 1210 such that a part of the information stored in the memory 1220 is displayed on the display 1210. In other words, the controller 1270 may display the multimedia content and the content summary information on the display 1210 in the memory 1220. Alternatively, when a user gesture is made in one area of the display unit 1210, the controller 1270 may perform a control operation corresponding to the gesture of the user.
- the control unit 1270 may include at least one of a RAM 1271, a ROM 1272, a CPU 1273, a Graphic Processing Unit (GPU) 1274, and a bus 1275.
- the RAM 1271, the ROM 1272, the CPU 1273, the GPU 1274, and the like may be connected to each other through the bus 1275.
- the CPU 1273 accesses the memory 1220 and performs booting using an operating system stored in the memory 1220. In addition, various operations are performed using various programs, contents, data, etc. stored in the memory 1220.
- the ROM 1272 stores a command set for system booting. For example, when the turn-on command is input and the power is supplied, the device 1200 copies the O / S stored in the memory 1220 to the RAM 1271 according to the command stored in the ROM 1272. You can boot the system by running / S. When the booting is completed, the CPU 1273 copies various programs stored in the memory 1220 to the RAM 1271, and executes the programs copied to the RAM 1271 to perform various operations. When booting of the user device 1200 is completed, the GPU 1274 displays a UI screen on an area of the display unit 1210. In detail, the GPU 1274 may generate a screen on which an electronic document including various objects such as content, icons, menus, and the like is displayed.
- the GPU 1274 calculates attribute values such as coordinate values, shapes, sizes, colors, and the like in which objects are displayed according to the layout of the screen.
- the GPU 1274 may generate a screen of various layouts including the object based on the calculated attribute value.
- the screen generated by the GPU 1254 may be provided to the display unit 1210 and displayed on each area of the display unit 1210.
- the GPS chip 1225 may receive a GPS signal from a Global Positioning System (GPS) satellite to calculate a current position of the device 1200.
- the controller 1270 may calculate the user location using the GPS chip 1225 when using the navigation program or when the current location of the user is required.
- GPS Global Positioning System
- the communication unit 1230 may communicate with various types of external devices according to various types of communication methods.
- the communication unit 1230 may include at least one of a Wi-Fi chip 1231, a Bluetooth chip 1232, a wireless communication chip 1233, and an NFC chip 1234.
- the controller 1270 may communicate with various external devices using the communication unit 1230. For example, the controller 1270 may receive a control signal from an external device through the communication unit 1230 and transmit a result according to the control signal to the external device.
- the Wi-Fi chip 1231 and the Bluetooth chip 1232 may communicate with each other by WiFi or Bluetooth.
- various connection information such as SSID and session key may be transmitted and received first, and then various communication information may be transmitted and received using the same.
- the wireless communication chip 1233 refers to a chip that performs communication according to various communication standards such as IEEE, Zigbee, 3rd Generation (3G), 3rd Generation Partnership Project (3GPP), Long Term Evoloution (LTE), and the like.
- the NFC chip 1234 refers to a chip operating in a near field communication (NFC) method using a 13.56 MHz band among various RF-ID frequency bands such as 135 kHz, 13.56 MHz, 433 MHz, 860-960 MHz, 2.45 GHz, and the like.
- NFC near field communication
- the video processor 1235 may process multimedia content received through the communication unit 1230 or video content included in the multimedia content stored in the memory 1220.
- the video processor 1235 may perform various image processing such as decoding, scaling, noise filtering, frame rate conversion, resolution conversion, and the like on the video content.
- the audio processor 1240 may process the multimedia content received through the communication unit 1230 or the audio content included in the multimedia content stored in the memory 1220.
- the audio processor 1240 may perform various processing such as decoding, amplification, noise filtering, etc. in order to play or analyze audio content.
- the controller 1270 may drive the video processor 1235 and the audio processor 1240 to play the multimedia content.
- the speaker unit 1260 may output audio content generated by the audio processor 1240.
- the user input unit 1245 may receive various commands from the user.
- the user input unit 1245 may include at least one of a key 1246, a touch panel 1247, and a pen recognition panel 1248.
- the device 1200 may display various contents or a user interface according to a user input received from at least one of the key 1246, the touch panel 1247, and the pen recognition panel 1248.
- the key 1246 may include various types of keys, such as mechanical buttons, wheels, and the like, formed in various areas such as a front portion, a side portion, a back portion, etc. of the main body exterior of the device 1200.
- the touch panel 1247 may detect a user's touch input and output a touch event value corresponding to the detected touch signal.
- the touch screen may be implemented by various types of touch sensors such as capacitive, pressure sensitive, and piezoelectric.
- the capacitive type is a method of calculating touch coordinates by detecting fine electricity generated by the human body of a user when a part of the user's body is touched by the touch screen surface by using a dielectric coated on the touch screen surface.
- the pressure-sensitive type includes two electrode plates embedded in the touch screen, and when the user touches the screen, the touch panel calculates touch coordinates by detecting that the upper and lower plates of the touched point are in contact with current.
- the touch event occurring in the touch screen may be mainly generated by a human finger, but may also be generated by an object of conductive material that can apply a change in capacitance.
- the pen recognition panel 1248 detects a proximity input or touch input of a pen according to the operation of a user's touch pen (eg, a stylus pen or a digitizer pen) and detects a detected pen proximity event or pen. A touch event can be output.
- the pen recognition panel 1248 may be implemented by, for example, an EMR method, and detect a touch or a proximity input according to a change in the intensity of an electromagnetic field due to a proximity or a touch of a pen.
- the pen recognition panel 1248 includes an electromagnetic induction coil sensor (not shown) having a grid structure and an electronic signal processor (not shown) that sequentially provides an AC signal having a predetermined frequency to each loop coil of the electromagnetic induction coil sensor. It may be configured to include).
- the magnetic field transmitted from the loop coil generates a current based on mutual electromagnetic induction in the resonant circuit in the pen. Based on this current, an induction magnetic field is generated from the coils constituting the resonant circuit in the pen, and the pen recognition panel 1248 detects the induction magnetic field in the loop coil in a signal receiving state so that the pen's approach position or The touch position can be detected.
- the pen recognition panel 1248 may be provided at a lower portion of the display panel 1211 to cover a predetermined area, for example, an area of the display panel 1211.
- the microphone unit 1250 may convert a user voice or other sound into audio data.
- the controller 1270 may use the user's voice input through the microphone unit 1250 in a call operation or convert the user's voice into audio data and store it in the memory 1220.
- the imaging unit 1255 may capture a still image or a moving image under the control of the user.
- the imaging unit 1255 may be implemented in plurality, such as a front camera and a rear camera.
- the controller 1270 may perform a control operation according to a user voice input through the microphone unit 1250 or a user motion recognized by the imaging unit 1255. It may be.
- the device 1200 may operate in a motion control mode or a voice control mode.
- the controller 1270 may activate the image capturing unit 1255 to capture a user, track a motion change of the user, and perform a control operation corresponding thereto.
- the controller 1270 may generate and output summary information of content currently being viewed according to the motion input of the user sensed by the imaging unit 1255.
- the controller 1270 may operate in a voice recognition mode that analyzes a user voice input through the microphone unit 1250 and performs a control operation according to the analyzed user voice.
- the motion detector 1265 may detect body movement of the user device 1200.
- the user device 1200 may be rotated or tilted in various directions.
- the motion detector 1265 may detect a movement characteristic such as a rotation direction, an angle, and an inclination by using at least one of various sensors such as a geomagnetic sensor, a gyro sensor, an acceleration sensor, and the like.
- the motion detector 1265 may receive a user input by detecting a body motion of the device 1200, and may perform a control operation according to the received input.
- the embodiment may include a USB port to which a USB connector may be connected in the user device 1200, various external input ports for connecting to various external terminals such as a headset, a mouse, a LAN, and the like.
- DMB digital multimedia broadcasting
- the names of the components of the device 1200 described above may vary.
- the device 1200 according to the present disclosure may be configured to include at least one of the above-described components, some components may be omitted or further include other additional components.
- the content analysis performance may be improved by selectively analyzing the content according to the feature value of the audio content for each section.
- the method according to some embodiments may be embodied in the form of program instructions that may be executed by various computer means and recorded on a computer readable medium.
- the computer readable medium may include program instructions, data files, data structures, etc. alone or in combination.
- Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts.
- Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks.
- Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (15)
- 오디오 컨텐츠를 분석하는 방법에 있어서,상기 오디오 컨텐츠의 특징값을 추출하는 단계;상기 추출된 오디오 컨텐츠의 특징값에 기초하여, 상기 오디오 컨텐츠를 각 구간별로 분류(classify)하는 단계;상기 각 구간의 오디오 컨텐츠가 속한 종류(class)에 기초하여, 상기 오디오 컨텐츠를 분석하기 위한 구간을 적어도 하나 선택하고, 상기 선택된 구간의 오디오 컨텐츠에 대하여 분석을 수행하는 단계를 포함하는, 방법.
- 제1항에 있어서, 상기 분류하는 단계는상기 각 종류에 속하는 적어도 하나의 오디오 컨텐츠의 특징값에 관한 정보를 포함하는 데이터베이스를 이용하여, 상기 오디오 컨텐츠의 특징값과 상기 데이터베이스의 특징값을 비교함으로써 상기 오디오 컨텐츠를 분류하는 단계를 포함하는, 방법.
- 제1항에 있어서, 상기 특징값을 추출하는 단계는상기 오디오 컨텐츠를 적어도 하나의 기본 함수(elementary function)로 분해(decomposition)하는 단계;상기 분해된 오디오 컨텐츠에 대하여, 상기 각 구간별로 상기 기본 함수 중 적어도 하나를 주요 기본 함수(dominant elementary function)로 선택하는 단계;상기 선택된 주요 기본 함수를 이용하여, 상기 각 구간별로 기저 함수를 상기 오디오 컨텐츠의 특징값으로 추출하는 단계를 포함하는 방법.
- 제1항에 있어서, 상기 특징값을 추출하는 단계는상기 오디오 컨텐츠의 소정 구간에서, 순간적인 특징(instantaneous feature) 값을 적어도 하나 추출하는 단계;상기 소정 구간에 속하는 상기 적어도 하나의 순간적인 특징값으로부터 통계적 특징(Statistical feature) 값을 추출하는 단계를 포함하는, 방법.
- 제1항에 있어서, 상기 분석을 수행하는 단계는음성 클래스에 속하는 상기 오디오 컨텐츠의 구간을 선택하는 단계;상기 선택된 구간의 오디오 컨텐츠에 대하여 음성 인식 및 화자 인식(speaker recognition) 중 적어도 하나를 수행하는 단계를 포함하는, 방법.
- 제5항에 있어서, 상기 분석을 수행하는 단계는상기 음성 인식 또는 화자 인식된 결과를 이용하여, 소정 구간의 상기 오디오 컨텐츠에 대한 토픽을 결정하는 단계를 포함하는, 방법.
- 제1항에 있어서, 상기 분석을 수행하는 단계는환경 잡음 클래스에 속하는 상기 오디오 컨텐츠의 구간을 선택하는 단계;상기 선택된 구간 별로 상기 오디오 컨텐츠에 포함된 어쿠스틱 이벤트를 검출하는 단계를 포함하는 방법.
- 제1항에 있어서, 상기 분석을 수행하는 단계는상기 선택된 구간과 대응되는 비디오 컨텐츠에 대하여 분석을 수행하는 단계;상기 비디오 컨텐츠에 대한 분석 결과를 이용하여, 상기 오디오 컨텐츠에 대한 분석 결과를 보정하는 단계를 포함하는, 방법.
- 오디오 컨텐츠를 수신하는 수신부;상기 오디오 컨텐츠의 특징값을 추출하고, 상기 추출된 오디오 컨텐츠의 특징값에 기초하여, 상기 오디오 컨텐츠를 각 구간별로 분류(classify)하고, 상기 각 구간의 오디오 컨텐츠가 속한 종류(class)에 기초하여, 상기 오디오 컨텐츠를 분석하기 위한 구간을 적어도 하나 선택하고, 상기 선택된 구간의 오디오 컨텐츠에 대하여 분석을 수행하는 제어부를 포함하는, 디바이스.
- 제9항에 있어서, 상기 제어부는상기 각 종류에 속하는 적어도 하나의 오디오 컨텐츠의 특징값에 관한 정보를 포함하는 데이터베이스를 이용하여, 상기 오디오 컨텐츠의 특징값과 상기 데이터베이스의 특징값을 비교함으로써 상기 오디오 컨텐츠를 분류하는, 디바이스.
- 제9항에 있어서, 상기 제어부는상기 오디오 컨텐츠를 적어도 하나의 기본 함수(elementary function)로 분해(decomposition)하고, 상기 분해된 오디오 컨텐츠에 대하여, 상기 각 구간별로 상기 기본 함수 중 적어도 하나를 주요 기본 함수(dominant elementary function)로 선택하고, 상기 선택된 주요 기본 함수를 이용하여, 상기 각 구간별로 기저 함수를 추출하는, 디바이스.
- 제9항에 있어서, 상기 제어부는상기 오디오 컨텐츠의 소정 구간에서, 순간적인 특징값을 적어도 하나 추출하고, 상기 소정 구간에 속하는 상기 적어도 하나의 순간적인 특징값으로부터 통계적 특징값을 추출하는, 디바이스.
- 제9항에 있어서, 상기 제어부는음성 클래스에 속하는 상기 오디오 컨텐츠의 구간을 선택하고, 상기 선택된 구간의 오디오 컨텐츠에 대하여 음성 인식 및 화자 인식(speaker recognition) 중 적어도 하나를 수행하는, 디바이스.
- 제13항에 있어서, 상기 제어부는상기 음성 인식 또는 화자 인식된 결과를 이용하여, 소정 구간의 상기 오디오 컨텐츠에 대한 토픽을 결정하는, 디바이스.
- 제9항에 있어서, 상기 제어부는환경 잡음 클래스에 속하는 상기 오디오 컨텐츠의 구간을 선택하고, 상기 선택된 구간 별로 상기 오디오 컨텐츠에 포함된 어쿠스틱 이벤트를 검출하는, 디바이스.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/122,344 US10014008B2 (en) | 2014-03-03 | 2015-03-03 | Contents analysis method and device |
KR1020167021907A KR101844516B1 (ko) | 2014-03-03 | 2015-03-03 | 컨텐츠 분석 방법 및 디바이스 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201461946977P | 2014-03-03 | 2014-03-03 | |
US61/946,977 | 2014-03-03 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015133782A1 true WO2015133782A1 (ko) | 2015-09-11 |
Family
ID=54055531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2015/002014 WO2015133782A1 (ko) | 2014-03-03 | 2015-03-03 | 컨텐츠 분석 방법 및 디바이스 |
Country Status (3)
Country | Link |
---|---|
US (1) | US10014008B2 (ko) |
KR (1) | KR101844516B1 (ko) |
WO (1) | WO2015133782A1 (ko) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113900617A (zh) * | 2021-08-03 | 2022-01-07 | 钰太芯微电子科技(上海)有限公司 | 具有声线接口的麦克风阵列系统及电子设备 |
Families Citing this family (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9992593B2 (en) * | 2014-09-09 | 2018-06-05 | Dell Products L.P. | Acoustic characterization based on sensor profiling |
US11868354B2 (en) | 2015-09-23 | 2024-01-09 | Motorola Solutions, Inc. | Apparatus, system, and method for responding to a user-initiated query with a context-based response |
CA3036778C (en) * | 2016-09-21 | 2022-02-01 | Motorola Solutions, Inc. | Method and system for optimizing voice recognition and information searching based on talkgroup activities |
GB2563953A (en) | 2017-06-28 | 2019-01-02 | Cirrus Logic Int Semiconductor Ltd | Detection of replay attack |
GB201801527D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Method, apparatus and systems for biometric processes |
GB201801532D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Methods, apparatus and systems for audio playback |
GB2567503A (en) | 2017-10-13 | 2019-04-17 | Cirrus Logic Int Semiconductor Ltd | Analysing speech signals |
GB201804843D0 (en) | 2017-11-14 | 2018-05-09 | Cirrus Logic Int Semiconductor Ltd | Detection of replay attack |
GB201801664D0 (en) | 2017-10-13 | 2018-03-21 | Cirrus Logic Int Semiconductor Ltd | Detection of liveness |
US10798446B2 (en) * | 2018-01-04 | 2020-10-06 | International Business Machines Corporation | Content narrowing of a live feed based on cognitive profiling |
US11475899B2 (en) * | 2018-01-23 | 2022-10-18 | Cirrus Logic, Inc. | Speaker identification |
US11264037B2 (en) | 2018-01-23 | 2022-03-01 | Cirrus Logic, Inc. | Speaker identification |
US11735189B2 (en) | 2018-01-23 | 2023-08-22 | Cirrus Logic, Inc. | Speaker identification |
US20190296844A1 (en) * | 2018-03-23 | 2019-09-26 | Social Media Labs, Inc. | Augmented interactivity for broadcast programs |
US20200037022A1 (en) * | 2018-07-30 | 2020-01-30 | Thuuz, Inc. | Audio processing for extraction of variable length disjoint segments from audiovisual content |
US11025985B2 (en) | 2018-06-05 | 2021-06-01 | Stats Llc | Audio processing for detecting occurrences of crowd noise in sporting event television programming |
US11264048B1 (en) | 2018-06-05 | 2022-03-01 | Stats Llc | Audio processing for detecting occurrences of loud sound characterized by brief audio bursts |
US10692490B2 (en) | 2018-07-31 | 2020-06-23 | Cirrus Logic, Inc. | Detection of replay attack |
US10915614B2 (en) | 2018-08-31 | 2021-02-09 | Cirrus Logic, Inc. | Biometric authentication |
US10880604B2 (en) * | 2018-09-20 | 2020-12-29 | International Business Machines Corporation | Filter and prevent sharing of videos |
KR102131751B1 (ko) | 2018-11-22 | 2020-07-08 | 에스케이텔레콤 주식회사 | 인식 메타 정보를 이용한 구간 구분 정보 처리 방법 및 이를 지원하는 서비스 장치 |
CN113132753A (zh) | 2019-12-30 | 2021-07-16 | 阿里巴巴集团控股有限公司 | 数据处理方法及装置、视频封面生成方法及装置 |
US11328721B2 (en) * | 2020-02-04 | 2022-05-10 | Soundhound, Inc. | Wake suppression for audio playing and listening devices |
CN111757195B (zh) * | 2020-06-09 | 2022-10-21 | 北京未来居科技有限公司 | 一种基于智能音箱的信息输出方法及装置 |
US20210407493A1 (en) * | 2020-06-30 | 2021-12-30 | Plantronics, Inc. | Audio Anomaly Detection in a Speech Signal |
US11922967B2 (en) | 2020-10-08 | 2024-03-05 | Gracenote, Inc. | System and method for podcast repetitive content detection |
CN113948085B (zh) * | 2021-12-22 | 2022-03-25 | 中国科学院自动化研究所 | 语音识别方法、系统、电子设备和存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100792016B1 (ko) * | 2006-07-25 | 2008-01-04 | 한국항공대학교산학협력단 | 오디오 및 비디오 정보를 이용한 등장인물 기반 비디오요약 장치 및 그 방법 |
JP2012090337A (ja) * | 2012-01-13 | 2012-05-10 | Toshiba Corp | 電子機器および表示処理方法 |
KR20120098211A (ko) * | 2011-02-28 | 2012-09-05 | 삼성전자주식회사 | 음성 인식 방법 및 그에 따른 음성 인식 장치 |
JP5243365B2 (ja) * | 2009-08-10 | 2013-07-24 | 日本電信電話株式会社 | コンテンツ生成装置,コンテンツ生成方法およびコンテンツ生成プログラム |
KR20130090570A (ko) * | 2012-02-06 | 2013-08-14 | 한국과학기술원 | 구간설정이 가능한 음성기반 멀티미디어 컨텐츠 태깅 방법 및 장치 |
Family Cites Families (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7751596B2 (en) * | 1996-11-12 | 2010-07-06 | Digimarc Corporation | Methods and arrangements employing digital content items |
US6550057B1 (en) * | 1999-08-31 | 2003-04-15 | Accenture Llp | Piecemeal retrieval in an information services patterns environment |
US6970860B1 (en) | 2000-10-30 | 2005-11-29 | Microsoft Corporation | Semi-automatic annotation of multimedia objects |
US7478125B2 (en) | 2001-09-13 | 2009-01-13 | Intel Corporation | Automatic annotation of audio and/or visual data |
US20050238238A1 (en) * | 2002-07-19 | 2005-10-27 | Li-Qun Xu | Method and system for classification of semantic content of audio/video data |
US8055503B2 (en) | 2002-10-18 | 2011-11-08 | Siemens Enterprise Communications, Inc. | Methods and apparatus for audio data analysis and data mining using speech recognition |
DE20321797U1 (de) * | 2002-12-17 | 2010-06-10 | Sony France S.A. | Vorrichtung zum automatischen Erzeugen einer allgemeinen Extraktionsfunktion, die aus einem Eingabesignal berechenbar ist, z.B. einem Audiosignal, um daraus einen vorbestimmten globalen charakteristischen Wert seines Inhalts zu erzeugen, z.B. einen Deskriptor |
US7379875B2 (en) | 2003-10-24 | 2008-05-27 | Microsoft Corporation | Systems and methods for generating audio thumbnails |
US8838452B2 (en) | 2004-06-09 | 2014-09-16 | Canon Kabushiki Kaisha | Effective audio segmentation and classification |
US8521529B2 (en) | 2004-10-18 | 2013-08-27 | Creative Technology Ltd | Method for segmenting audio signals |
US7788575B2 (en) | 2005-01-31 | 2010-08-31 | Hewlett-Packard Development Company, L.P. | Automated image annotation |
US8694317B2 (en) | 2005-02-05 | 2014-04-08 | Aurix Limited | Methods and apparatus relating to searching of spoken audio data |
CN1889172A (zh) | 2005-06-28 | 2007-01-03 | 松下电器产业株式会社 | 可增加和修正声音类别的声音分类系统及方法 |
KR100704631B1 (ko) | 2005-08-10 | 2007-04-10 | 삼성전자주식회사 | 음성 주석 생성 장치 및 방법 |
US20070124293A1 (en) | 2005-11-01 | 2007-05-31 | Ohigo, Inc. | Audio search system |
KR101128521B1 (ko) | 2005-11-10 | 2012-03-27 | 삼성전자주식회사 | 오디오 데이터를 이용한 이벤트 검출 방법 및 장치 |
EP1796080B1 (en) * | 2005-12-12 | 2009-11-18 | Gregory John Gadbois | Multi-voice speech recognition |
US20080091719A1 (en) | 2006-10-13 | 2008-04-17 | Robert Thomas Arenburg | Audio tags |
US7796984B2 (en) | 2007-01-11 | 2010-09-14 | At&T Mobility Ii Llc | Automated tagging of targeted media resources |
JP4909854B2 (ja) | 2007-09-27 | 2012-04-04 | 株式会社東芝 | 電子機器および表示処理方法 |
US20090265165A1 (en) | 2008-04-21 | 2009-10-22 | Sony Ericsson Mobile Communications Ab | Automatic meta-data tagging pictures and video records |
US8428949B2 (en) * | 2008-06-30 | 2013-04-23 | Waves Audio Ltd. | Apparatus and method for classification and segmentation of audio content, based on the audio signal |
US20110044512A1 (en) | 2009-03-31 | 2011-02-24 | Myspace Inc. | Automatic Image Tagging |
US20110013810A1 (en) | 2009-07-17 | 2011-01-20 | Engstroem Jimmy | System and method for automatic tagging of a digital image |
US20110029108A1 (en) | 2009-08-03 | 2011-02-03 | Jeehyong Lee | Music genre classification method and apparatus |
US9031243B2 (en) | 2009-09-28 | 2015-05-12 | iZotope, Inc. | Automatic labeling and control of audio algorithms by audio recognition |
US20110096135A1 (en) | 2009-10-23 | 2011-04-28 | Microsoft Corporation | Automatic labeling of a video session |
KR101298740B1 (ko) | 2010-04-02 | 2013-08-21 | 에스케이플래닛 주식회사 | 키워드 스파팅 방식에서 단어 연관성을 이용한 키워드 재탐색 방법 및 장치 |
US9183560B2 (en) * | 2010-05-28 | 2015-11-10 | Daniel H. Abelow | Reality alternate |
US20120197648A1 (en) | 2011-01-27 | 2012-08-02 | David Moloney | Audio annotation |
US9129605B2 (en) | 2012-03-30 | 2015-09-08 | Src, Inc. | Automated voice and speech labeling |
US9244924B2 (en) | 2012-04-23 | 2016-01-26 | Sri International | Classification, search, and retrieval of complex video events |
US8843952B2 (en) | 2012-06-28 | 2014-09-23 | Google Inc. | Determining TV program information based on analysis of audio fingerprints |
-
2015
- 2015-03-03 WO PCT/KR2015/002014 patent/WO2015133782A1/ko active Application Filing
- 2015-03-03 US US15/122,344 patent/US10014008B2/en active Active
- 2015-03-03 KR KR1020167021907A patent/KR101844516B1/ko active IP Right Grant
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100792016B1 (ko) * | 2006-07-25 | 2008-01-04 | 한국항공대학교산학협력단 | 오디오 및 비디오 정보를 이용한 등장인물 기반 비디오요약 장치 및 그 방법 |
JP5243365B2 (ja) * | 2009-08-10 | 2013-07-24 | 日本電信電話株式会社 | コンテンツ生成装置,コンテンツ生成方法およびコンテンツ生成プログラム |
KR20120098211A (ko) * | 2011-02-28 | 2012-09-05 | 삼성전자주식회사 | 음성 인식 방법 및 그에 따른 음성 인식 장치 |
JP2012090337A (ja) * | 2012-01-13 | 2012-05-10 | Toshiba Corp | 電子機器および表示処理方法 |
KR20130090570A (ko) * | 2012-02-06 | 2013-08-14 | 한국과학기술원 | 구간설정이 가능한 음성기반 멀티미디어 컨텐츠 태깅 방법 및 장치 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113900617A (zh) * | 2021-08-03 | 2022-01-07 | 钰太芯微电子科技(上海)有限公司 | 具有声线接口的麦克风阵列系统及电子设备 |
CN113900617B (zh) * | 2021-08-03 | 2023-12-01 | 钰太芯微电子科技(上海)有限公司 | 具有声线接口的麦克风阵列系统及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
US10014008B2 (en) | 2018-07-03 |
KR101844516B1 (ko) | 2018-04-02 |
KR20160110433A (ko) | 2016-09-21 |
US20160372139A1 (en) | 2016-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015133782A1 (ko) | 컨텐츠 분석 방법 및 디바이스 | |
WO2017078361A1 (en) | Electronic device and method for recognizing speech | |
WO2017065494A1 (en) | Portable device and screen display method of portable device | |
WO2017090947A1 (en) | Question and answer processing method and electronic device for supporting the same | |
WO2015030488A1 (en) | Multi display method, storage medium, and electronic device | |
WO2019031707A1 (en) | MOBILE TERMINAL AND METHOD FOR CONTROLLING A MOBILE TERMINAL USING MACHINE APPRENTICESHIP | |
WO2016093518A1 (en) | Method and apparatus for arranging objects according to content of background image | |
WO2014157846A1 (en) | Portable terminal, hearing aid, and method of indicating positions of sound sources in the portable terminal | |
WO2014010974A1 (en) | User interface apparatus and method for user terminal | |
WO2014007545A1 (en) | Method and apparatus for connecting service between user devices using voice | |
WO2019112342A1 (en) | Voice recognition apparatus and operation method thereof cross-reference to related application | |
WO2014025185A1 (en) | Method and system for tagging information about image, apparatus and computer-readable recording medium thereof | |
WO2012169737A2 (en) | Display apparatus and method for executing link and method for recognizing voice thereof | |
WO2017164567A1 (en) | Intelligent electronic device and method of operating the same | |
WO2018194273A1 (en) | Image display apparatus and method | |
WO2020162709A1 (en) | Electronic device for providing graphic data based on voice and operating method thereof | |
WO2017209568A1 (ko) | 전자 장치 및 그의 동작 방법 | |
WO2016114432A1 (ko) | 영상 정보에 기초하여 음향을 처리하는 방법, 및 그에 따른 디바이스 | |
WO2016182361A1 (en) | Gesture recognition method, computing device, and control device | |
WO2020091519A1 (en) | Electronic apparatus and controlling method thereof | |
WO2015005728A1 (ko) | 이미지 표시 방법 및 장치 | |
WO2015199430A1 (en) | Method and apparatus for managing data | |
WO2013191408A1 (en) | Method for improving touch recognition and electronic device thereof | |
WO2018164534A1 (ko) | 휴대 장치 및 휴대 장치의 화면 제어방법 | |
WO2021040180A1 (ko) | 디스플레이장치 및 그 제어방법 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15758025 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20167021907 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15122344 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15758025 Country of ref document: EP Kind code of ref document: A1 |