WO2018094723A1 - Détection automatique de contenus exprimant des émotions à partir d'une vidéo et enrichissement d'un index d'images - Google Patents

Détection automatique de contenus exprimant des émotions à partir d'une vidéo et enrichissement d'un index d'images Download PDF

Info

Publication number
WO2018094723A1
WO2018094723A1 PCT/CN2016/107456 CN2016107456W WO2018094723A1 WO 2018094723 A1 WO2018094723 A1 WO 2018094723A1 CN 2016107456 W CN2016107456 W CN 2016107456W WO 2018094723 A1 WO2018094723 A1 WO 2018094723A1
Authority
WO
WIPO (PCT)
Prior art keywords
clip
clips
video
features associated
classification
Prior art date
Application number
PCT/CN2016/107456
Other languages
English (en)
Inventor
Yilong TANG
Bo Han
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US16/342,496 priority Critical patent/US11328159B2/en
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to EP16922433.4A priority patent/EP3545428A4/fr
Priority to CN201680082647.0A priority patent/CN108701144B/zh
Priority to PCT/CN2016/107456 priority patent/WO2018094723A1/fr
Publication of WO2018094723A1 publication Critical patent/WO2018094723A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the present disclosure provides a method for detecting contents expressing emotions from a video.
  • the method may comprise: dividing the video into a plurality of clips; extracting, from a first clip and at least one second clip of the plurality of clips, features associated with the first clip; determining whether the first clip expresses emotions based on the features associated with the first clip; and building an index containing the first clip based on the features associated with the first clip if the first clip expresses emotions.
  • the present disclosure provides an apparatus for detecting contents expressing emotions from a video.
  • the apparatus may comprise: a dividing module configured to divide the video into a plurality of clips; an extracting module configured to extract, from a first clip and at least one second clip of the plurality of clips, features associated with the first clip; a determining module configured to determine whether the first clip expresses emotions based on the features associated with the first clip; and a building module configured to build an index containing the first clip based on the features associated with the first clip if the first clip expresses emotions.
  • the present disclosure provides a method for enriching an image index.
  • the method may comprise: dividing a video into a plurality of clips; extracting features from a first clip of the plurality of clips; determining whether there is at least one image from web pages which is similar to the first clip based on the features extracted from the first clip; and if there is the at least one image, enriching an image index containing the at least one image based on the features extracted from the first clip and the features extracted from at least one second clip of the plurality of clips.
  • the present disclosure provides an apparatus for enriching an image index.
  • the apparatus may comprise: a dividing module configured to divide a video into a plurality of clips; an extracting module configured to extract features from a first clip of the plurality of clips; a determining module configured to determine whether there is at least one image from web pages which is similar to the first clip based on the features extracted from the first clip; and an enriching module configured to, if there is the at least one image, enrich an image index containing the at least one image based on the features extracted from the first clip and the features extracted from at least one second clip of the plurality of clips.
  • the present disclosure provides a system for detecting contents expressing emotions from a video.
  • the system may comprise one or more processors and a memory.
  • the memory may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for detecting contents expressing emotions from a video according to various aspects of the present disclosure.
  • the present disclosure provides a system for enriching an image index.
  • the system may comprise one or more processors and a memory.
  • the memory may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for enriching an image index according to various aspects of the present disclosure.
  • FIG. 1 illustrates an exemplary environment in which the techniques disclosed herein may be implemented.
  • FIG. 2 illustrates a flow chart of a method for detecting contents expressing emotions from a video.
  • FIG. 3 illustrates a flow chart of a method for enriching an image index.
  • FIG. 4 illustrates an exemplary apparatus for detecting contents expressing emotions from a video.
  • FIG. 5 illustrates an exemplary apparatus for enriching an image index.
  • FIG. 6 illustrates an exemplary system for detecting contents expressing emotions from a video.
  • More and more people search images such as GIF images and meme, which can be used to express their emotions, to send to or share with others.
  • a traditional search engine needs to crawl all the web pages to find these contents that are created by people. Whether such a content expressing emotions can be found relies on its spreading scale in the web. Someone must create and share the content first and the content must be spread in the web in a certain scale, otherwise the search engine will not understand its importance. Furthermore, the information about the original video from which the content origins has been lost in the second-creation content and it’s hard to retrieve these media with video related searches.
  • a video may be divided into a plurality of clips. For each clip of the plurality of clips, features may be extracted from the clip. Then it may be determined whether one of the plurality of clips expresses emotions based on features associated with the clip including the features extracted from the clip and the features extracted from at least one other clip of the plurality of clips, such as at least one clip next to the clip.
  • the features extracted from the clip may be called as content features associated with the clip, and the features extracted from the clip next to it may be called as contextual features associated with the clip.
  • an index containing the one or more clips such as a reverse index, may be built based on the features associated with the one or more clips.
  • the contents expressing emotions may be found proactively from a video and the built index may contain rich information to help retrieval of the contents.
  • FIG. 1 illustrates an exemplary environment 100 in which the techniques described in the present disclosure may be implemented. It should be understood that this and other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions and grouping of functions, etc. ) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For instance, various functions may be carried out by a processor executing instructions stored in memory.
  • the illustrated exemplary environment 100 may include a user device 110, a content creation server 120, and a search engine server 140.
  • Each of the components shown in FIG. 1 may be any type of computing device.
  • the components may communicate with each other via a network 130, which may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs) .
  • LANs local area networks
  • WANs wide area networks
  • Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • LANs local area networks
  • WANs wide area network
  • Any number of user devices, content creation servers and search engine servers may be employed within the environment 100 within the scope of the present disclosure.
  • Each may comprise a single device or multiple devices cooperating in a distributed way.
  • the content creation server 120 may comprise multiple devices arranged in a distributed way that collectively provide the functionality of the content creation server 120 described herein.
  • the user device 110 may be configured in a variety ways.
  • the user device 110 may be configured as a traditional computer (such as a desktop personal computer, laptop personal computer and so on) , an entertainment appliance, a smart phone, a netbook, a tablet and so on.
  • the user device 110 may range from a full resource device with substantial memory and processor resources (e.g., personal computers) to a low-resource device with limited memory and/or processing resources (e.g., hand-held game controls) .
  • the content creation server 120 may be configured to create all types of contents and build an index for these contents.
  • the content creation server 120 may be configured to detect contents expressing emotions from one or more videos.
  • the content creation server 120 may include a video analyzing component 122 and an index building component 124.
  • the content creation server 120 may further include an index enriching component 126.
  • the video analyzing component 122 may be configured to divide a video, such as a hot video, into a plurality of clips and analyze the plurality of clips to detect one or more clips expressing emotions from the plurality of clips.
  • the analyzing may comprise performing, on each of the plurality of clips, speech recognition to know what the people in the clip is speaking.
  • the analyzing may further comprise performing face recognition on the clip to know who the most important person is in the video and identify if the person is a celebrity.
  • the analyzing may further comprise performing face emotion classification and speech emotion classification on the clip to know the person’s emotion.
  • the analyzing may further comprise performing action detection on the clip to know what the person is doing in the clip.
  • the analyzing may further comprise performing human gesture classification on the clip to know the person’s gesture.
  • the analyzing may further comprise performing scene classification on the clip to know what the overall scene looks like.
  • the analyzing may further comprise performing sound (e.g., music) classification on the clip to obtain the sound category so as to determine a scene type from an audio.
  • the analyzing may further comprise performing bullet-curtain detection and classification on the clip to know comment count and category on the clip. Those skilled in the art will understand that the analyzing is not limited to those as described.
  • the video analyzing component 122 may be configured to extract features from each of the plurality of clips based on analyzing the clip. Then the video analyzing component 122 may be configured to determine, for each of the plurality of clips, whether the clip expresses emotions based on the features associated with the clip which may include the features extracted from the clip and the features extracted from at least one clip next to the clip of the plurality of clips. Thereafter, the video analyzing component 122 may determine whether there is one or more clips expressing emotions among the plurality of clips. If there is the one or more clips, then the video analyzing component 122 may provide the one or more clips and the features associated with them to the index building component 124.
  • the index building component 124 may be configured to build an index containing clips expressing human emotions provided by the video analyzing component 122 based on the features associated with these clips.
  • the index enriching component 126 may be configured to enrich an existing image index and/or the index built by the index building component 124. For each clip, the index enriching component 126 may search web pages or a database containing images pre-retrieved from web pages for an image similar to the clip based on the features extracted from the clip. If there is at least one similar image, then the index enriching component 126 may be configured to enrich an image index containing the at least one similar image using the features extracted from the clip and the features extracted from at least one clip next to the clip. Furthermore, the index enriching component 126 may be configured to enrich the image index using metadata associated with the video. Moreover, the index enriching component 126 may be further configured to enrich the index built by the index building component 124 using an image index containing images from web pages similar to at least one clip contained in the built index.
  • the search engine server 140 generally may operate to receive search queries from user devices, such as the user device 110, and provide search results in response to the search queries.
  • the search engine server 140 may include a user interface component 142, a matching component 144 and a ranking component 146.
  • the user interface component 142 may provide an interface to the user device 110, which allows users to submit search queries to the search engine server 140 and to receive search results from the search engine server 140.
  • the user device 110 may be any type of computing devices employed by a user to submit search queries and receive search results.
  • the user device 110 may be a desktop computer, a laptop computer, a tablet computer, a mobile device, or other type of computing device.
  • the user device 110 may include an application that allows a user to enter a search query and submit the search query to the search engine server 140 to retrieve search results.
  • the user device 110 may include a web browser that includes a search input box or allows a user to access a search page to submit a search query.
  • Other mechanisms for submitting search queries to search engines are contemplated to be within the scope of the present disclosure.
  • the matching component 144 may understand the query intention and break the query into different terms and/or entities, and employ these terms/entities to query the index built by the index building component 124 and enriched by the index enriching component 126, and identify a set of matching clips.
  • the set of matching clips may be evaluated by the ranking component 146 to provide a set of ranked clips.
  • the ranking component 146 may indicate the set of ranked clips to the user interface component 142.
  • the user interface component 142 may then communicate search results including at least a portion of the set of ranked clips to the user device 110. For instance, the user interface component 142 may generate or otherwise provide a search engine results page listing search results based on the set of ranked clips.
  • the present disclosure may not be limited to the environment.
  • a user may download the built index to the user device 110.
  • relevant clips will be populated for the user to choose. This can help the user to express their emotions by showing representative GIF instead of text in SMS, Email, SNS, etc.
  • FIGs. 2-3 methodologies that can be performed according to various aspects set forth herein are illustrated. While, for purposes of simplicity of explanation, the methodologies are shown as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts can, in accordance with one or more aspects, occur in different orders and/or concurrently with other acts from that shown and described herein.
  • FIG. 2 illustrates a flow chart of a method 200 for detecting contents expressing emotions from a video according to an embodiment of the present disclosure.
  • the method 200 may start at block 202.
  • the video may be divided into a plurality of clips.
  • each clip may be about 1-5 seconds level.
  • the start and end time of each clip may be determined by at least one of shot detection, sub-shot detection, speech detection, speech recognition and object motion detection. Clear clip boundaries may be obtained by composition of shot detection, sub-shot detection, speech detection, speech recognition and object motion detection.
  • each of the plurality of clips may be analyzed. For example, speech recognition may be performed on each of the plurality of clips so as to know what a person is speaking. Face recognition/identification may be performed on each of the plurality of clips so as to know who the most important person is in the clip, and identify if the person is a celebrity. Face emotion classification and speech emotion classification may be performed on each of the plurality of clips so as to know the person’s emotion. Action detection may be performed on each of the plurality of clips so as to know what the person is doing in the clip. Human gesture classification may be performed on each of the plurality of clips so as to know the person’s gesture. Scene classification may be performed on each of the plurality of clips so as to know what the overall scene looks like.
  • Sound/music classification may be performed on each of the plurality of clips so as to obtain the scene type from an audio.
  • camera motion detection may be performed on each of the plurality of clips so as to know camera motion, such as pan, tilt, zoom etc.
  • Bullet-curtain detection and classification may also be performed on each of the plurality of clips so as to know comment count and category on the clip.
  • features may be extracted from each of the plurality of clips based on analyzing the clip.
  • speech transcripts may be extracted from a clip based on the speech recognition of the clip.
  • Human identity may be extracted from the clip based on the face recognition/identification of the clip.
  • Human emotions may be extracted from the clip based on the face emotion classification and speech emotion classification of the clip.
  • Action or event may be extracted from the clip based on the action detection of the clip.
  • Human gestures may be extracted from the clip based on the human gesture classification of the clip.
  • Scene categories may be extracted from the clip based on the scene classification of the clip.
  • Sound category may be extracted from the clip based on the sound classification of the clip.
  • each of the plurality of clips it may be determined whether the clip expresses emotions based on the features associated with the clip including the features extracted from the clip and at least one other clip of the plurality of clips.
  • the clip and the at least one other clip may be next to each other.
  • a pre-trained model which was trained with a training dataset may be used to determine whether the clip expresses emotions based on the features associated with the clip.
  • the method 200 may go to block 220 to end.
  • an index containing the one or more clips may be built based on the features associated with the one or more clips. In the present disclosure, when building the index, not only the features extracted from the one or more clips, but also the features extracted from the clips next to the one or more clips are used, which will help retrieve the these contents.
  • Metadata associated with the video may also be used to build the index.
  • the metadata may include title, creator, description, actors, and characters and so on.
  • Web pages or a database containing images pre-retrieved from web pages may be searched for such a similar image based on the features extracted from the at least one clip.
  • the method 200 may go to block 220 to end. If it is determined that there is at least one image similar to at least one of the one or more clips at diamond 216, the method 200 may go to block 218. At block 218, an image index containing the at least one image may be enriched based on the features associated with the at least one clip. Then the method 200 may go to block 220 to end.
  • the method shown in FIG. 2 may be performed on a plurality of videos so as to proactively find clips expressing emotions without human interfering. Then an index containing all clips expressing emotions detected from the plurality of videos may be built based on the features associated with these clips. Moreover, the built index may be enriched based on the image index.
  • FIG. 3 illustrates a flow chart of a method 300 for enriching an image index according to an embodiment of the present disclosures.
  • the method 300 may start at block 302.
  • a video may be divided into a plurality of clips.
  • the start and end time of each clip may be determined by at least one of shot detection, sub-shot detection, speech detection, speech recognition and object motion detection.
  • Clear clip boundaries may be obtained by composition of shot detection, sub-shot detection, speech detection, speech recognition and object motion detection.
  • features may be extracted from each of the plurality of clips.
  • the features extracted from each of the plurality of clips may include at least one of speech transcripts, human identity, human emotions, action or event, human gesture, scene category, sound category camera motion and comment count and category and so on extracted from the clip.
  • one of the plurality of clips may be selected as a current clip to be processed.
  • the first clip of the plurality of clips may be selected as the current clip.
  • the method 300 may go to block 314. If it is determined that there is at least one similar image at diamond 310, the method 300 may go to block 312. At block 312, an image index containing the at least one similar image may be enriched based on the features extracted from the current clip and the features extracted from at least one clip next to the current clip. Moreover, the image index may be further enriched based on metadata associated with the video.
  • the method 300 may go to block 316 to end. Otherwise, the method 300 may go to block 308 at which one of the unprocessed clips may be selected as the current clip, and diamond 310, block 312 and diamond 314 may be repeated.
  • FIG. 4 illustrates an exemplary apparatus 400 for detecting contents expressing emotions from a video.
  • the apparatus 400 may comprise a dividing module 410 configured to divide the video into a plurality of clips.
  • the apparatus 400 may further comprise an extracting module 420 configured to extract, from a first clip and at least one second clip of the plurality of clips, features associated with the first clip.
  • the first clip and the at least one second clip may be next to each other.
  • the apparatus 400 may further comprise a determining module 430 configured to determine whether the first clip expresses emotions based on the features associated with the first clip.
  • the apparatus 400 may further comprise a building module 440 configured to, build an index containing the first clip based on the features associated with the first clip if the first clip expresses emotions.
  • the dividing module 410 may be further configured to divide the video into the plurality of clips by at least one of shot detection, sub-shot detection, speech detection, speech recognition and object motion detection.
  • shot detection sub-shot detection
  • speech detection speech recognition and object motion detection
  • clear clip boundaries may be obtained.
  • the apparatus 400 may further comprise an analyzing module configured to analyze the first clip and the at least one second clip.
  • the extracting module 420 may be further configured to extract, from the first clip and the at least one second clip, the features associated with the firs clip based on the analysis.
  • the analyzing module may be further configured to perform at least one of speech recognition, face recognition, action detection, human gesture classification, face emotion classification, speech emotion classification, scene classification, sound classification, camera motion detection and bullet-curtain detection and classification on the first clip and the at least one second clip.
  • the features associated with the first clip may include at least one of speech transcripts, action or event, person gesture, person identity, person emotions, scene category, sound category, camera motion and comment count and category extracted from the first clip and the at least one second clip.
  • the determining module 430 may be further configured to, determine, by a pre-trained model, whether the first clip expresses emotions based on the features associated with the first clip.
  • the pre-trained model may be a classifier trained with a training dataset for identifying a clip expressing emotions.
  • the building module 440 may be further configured to build the index based on the features associated with the first clip and metadata associated with the video.
  • the metadata may include at least one of title, creator, description, actors, and characters of the video.
  • the apparatus 400 may further comprise an enriching module configured to enrich, based on the features associated with the first clip, an image index containing at least one image from web pages which is similar to the first clip.
  • the at least one image may be found from the image index based on feature similarity.
  • the apparatus 400 may also comprise any other modules configured for implementing functionalities corresponding to any steps of the method for detecting contents expressing emotions according to the present disclosure.
  • FIG. 5 illustrates an exemplary apparatus 500 for enriching an image index.
  • the apparatus 500 may comprise a dividing module 510 configured to divide a video into a plurality of clips.
  • the apparatus 500 may further comprise an extracting module 520 configured to extract features from a first clip of the plurality of clips.
  • the apparatus 500 may further comprise a determining module 530 configured to determine whether there is at least one image from web pages that is similar to the first clip based on the features extracted from the first clip.
  • the apparatus 500 may further comprise an enriching module 540 configured to if there is the at least one image that is similar to the first clip, enrich an image index containing the at least one image based on the features extracted from the first clip and the features extracted from at least one second clip of the plurality of clips.
  • the at least one second clip may be next to the first clip.
  • the apparatus 500 may further comprise an analyzing module configured to analyze the first clip.
  • the extracting module 520 may be further configured to extract the features from the first clip based on the analysis.
  • the analyzing module may be configured to perform at least one of speech recognition, face recognition, face emotion classification, speech emotion classification, action detection, human gesture detection, scene classification, sound (e.g., music) classification, camera motion detection and bullet-curtain detection and classification on the first clip.
  • the extracting module 520 may be further configured to extract at least one of speech transcripts, action or event, person gesture, person emotions, person identity, scene category, sound category, camera motion, comment count and category and other features from the first clip.
  • the apparatus 500 may further comprise an obtaining module configured to obtain metadata associated with the video.
  • the metadata may include at least one of title, creator, description, actors, and characters of the video.
  • the enriching module 530 may be further configured to enrich the image index containing the at least one similar image based on the metadata.
  • FIG. 6 illustrates an exemplary system 600 for detecting contents expressing emotions.
  • the system 600 may comprise one or more processors 610.
  • the system 600 may further comprise a memory 620 that is coupled with the one or more processors.
  • the memory 620 may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for detecting contents expressing emotions according to the present disclosure.
  • the system 600 shown in FIG. 6 may also be used for enriching an image index according to the present disclosure.
  • the memory 620 may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for enriching an image index according to the present disclosure.
  • the solution of the present disclosure may be embodied in a non-transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any steps of the method for detecting contents expressing emotions or the method for enriching an image index according to the present disclosure.
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • a state machine gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip) , an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk.
  • RAM random access memory
  • ROM read only memory
  • PROM programmable ROM
  • EPROM erasable PROM
  • EEPROM electrically erasable PROM
  • a method for detecting contents expressing emotions from a video may comprise: dividing the video into a plurality of clips; extracting, from a first clip and at least one second clip of the plurality of clips, features associated with the first clip; determining whether the first clip expresses emotions based on the features associated with the first clip; and building an index containing the first clip based on the features associated with the first clip if the first clip expresses emotions.
  • the at least one second clip may be next to the first clip.
  • the dividing may comprise: dividing the video into the plurality of clips by at least one of shot detection, sub-shot detection, speech detection, speech recognition and object motion detection.
  • the extracting may comprise: analyzing the first clip and the at least one second clip; and extracting, from the first clip and the at least one second clip, the features associated with the first clip based on the analyzing.
  • the analyzing may comprise: performing at least one of speech recognition, face recognition, action detection, human gesture classification, face emotion classification, speech emotion classification, scene classification, sound classification, camera motion detection and bullet-curtain detection and classification on the first clip and the at least one second clip.
  • the features associated with the first clip may include at least one of speech transcripts, action or event, person gesture, person identity, person emotions, scene category, sound category, camera motion and comment count and category extracted from the first clip and the at least one second clip.
  • the determining may comprise: determining, by a pre-trained model, whether the first clip expresses emotions based on the features associated with the first clip.
  • the building may be further based on metadata associated with the video.
  • the metadata may include at least one of title, creator, description, actors, and characters of the video.
  • the method may further comprise enriching, based on the features associated with the first clip, an image index containing at least one image from web pages which is similar to the first clip.
  • the method may further comprise: searching web pages or a database containing images pre-retrieved from web pages for the at least one image based on the features extracted from the first clip.
  • an apparatus for detecting contents expressing emotions from a video may comprise: a dividing module configured to divide the video into a plurality of clips; an extracting module configured to extract, from a first clip and at least one second clip of the plurality of clips, features associated with the first clip; a determining module configured to determine whether the first clip expresses emotions based on the features associated with the first clip; and a building module configured to build an index containing the first clip based on the features associated with the first clip if the first clip expresses emotions.
  • the dividing module may be further configured to: divide the video into the plurality of clips by at least one of shot detection, sub-shot detection, speech detection, speech recognition and object motion detection.
  • the apparatus may further comprise an analyzing module configured to analyze the first clip and the at least one second clip.
  • the extracting module may be further configured to extract, from the first clip and the at least one second clip, the features associated with the first clip based on the analysis.
  • the analyzing module may be further configured to perform at least one of speech recognition, face recognition, action detection, human gesture classification, face emotion classification, speech emotion classification, scene classification, sound classification, camera motion detection and bullet-curtain detection and classification on the first clip and the at least one second clip.
  • the features associated with the first clip may include at least one of speech transcripts, action or event, person gesture, person identity, person emotions, scene category, sound category, camera motion and comment count and category extracted from the first clip and the at least one second clip.
  • the determining module may be further configured to determine, by a pre-trained model, whether the first clip expresses emotions based on the features associated with the first clip.
  • the building module may be further configured to build the index based on the features associated with the first clip and metadata associated with the video.
  • the metadata may include at least one of title, creator, description, actors, and characters of the video.
  • the apparatus may further comprise an enriching module configured to enrich, based on the features associated with the first clip, an image index containing at least one image from web pages which is similar to the first clip.
  • a method for enriching an image index may comprise: dividing a video into a plurality of clips; extracting features from a first clip of the plurality of clips; determining whether there is at least one image from web pages that is similar to the first clip based on the features extracted from the first clip; and if there is the at least one image, enriching an image index containing the at least one image based on the features extracted from the first clip and the features extracted from at least one second clip of the plurality of clips.
  • the first clip and the at least one second clip may be next to each other.
  • the method may further comprise analyzing the first clip.
  • the extracting may comprise extracting features from the first clip based on the analysis.
  • the analyzing may comprise performing at least one of speech recognition, face recognition, action detection, human gesture classification, face emotion classification, speech emotion classification, scene classification, sound classification, camera motion detection and bullet-curtain detection and classification on the first clip.
  • the features extracted from the first clip may include at least one of speech transcripts, action or event, person gesture, person identity, person emotions, scene category, sound category, camera motion and comment count and category extracted from the first clip.
  • the method may further comprise obtaining metadata associated with the video.
  • the enriching may comprise enriching the image index based on the metadata.
  • the metadata may include at least one of title, creator, description, actors, and characters of the video.
  • an apparatus for enriching an image index may comprise: a dividing module configured to divide a video into a plurality of clips; an extracting module configured to extract features from a first clip of the plurality of clips; a determining module configured to determine whether there is at least one image from web pages that is similar to the first clip based on the features extracted from the first clip; and an enriching module configured to if there is the at least one image, enrich an image index containing the at least one image based on the features extracted from the first clip and the features extracted from at least one second clip of the plurality of clips.
  • the first clip and the at least one second clip may be next to each other.
  • the apparatus may further comprise an analyzing module configured to analyze the first clip.
  • the extracting module may be further configured to extract features from the first clip based on the analysis.
  • the analyzing module may be configured to perform at least one of speech recognition, face recognition, action detection, human gesture classification, face emotion classification, speech emotion classification, scene classification, sound classification, camera motion detection and bullet-curtain detection and classification on the first clip.
  • the features extracted from the first clip may include at least one of speech transcripts, action or event, person gesture, person identity, person emotions, scene category, sound category, camera motion and comment count and category extracted from the first clip.
  • the apparatus may further comprise an obtaining module configured to obtain metadata associated with the video.
  • the enriching module may be further configured to enrich the image index based on the metadata.
  • the metadata may include at least one of title, creator, description, actors, and characters of the video.
  • a system may comprise: one or more processors; and a memory, storing computer-executable instructions that, when executed, cause the one or more processors to perform the method according to the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé, un appareil et un système de détection de contenus exprimant des émotions, à partir d'une vidéo. Le procédé peut comprendre les étapes suivantes consistant à : diviser la vidéo en une pluralité de clips ; extraire, à partir d'un premier clip et d'au moins un second clip de la pluralité de clips, des caractéristiques associées au premier clip ; déterminer si le premier clip exprime des émotions sur la base des caractéristiques associées au premier clip ; et construire un index contenant le premier clip sur la base des caractéristiques associées au premier clip si le premier clip exprime des émotions.
PCT/CN2016/107456 2016-11-28 2016-11-28 Détection automatique de contenus exprimant des émotions à partir d'une vidéo et enrichissement d'un index d'images WO2018094723A1 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US16/342,496 US11328159B2 (en) 2016-11-28 2016-10-28 Automatically detecting contents expressing emotions from a video and enriching an image index
EP16922433.4A EP3545428A4 (fr) 2016-11-28 2016-11-28 Détection automatique de contenus exprimant des émotions à partir d'une vidéo et enrichissement d'un index d'images
CN201680082647.0A CN108701144B (zh) 2016-11-28 2016-11-28 自动从视频中检测表达情感的内容以及丰富图像索引
PCT/CN2016/107456 WO2018094723A1 (fr) 2016-11-28 2016-11-28 Détection automatique de contenus exprimant des émotions à partir d'une vidéo et enrichissement d'un index d'images

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/107456 WO2018094723A1 (fr) 2016-11-28 2016-11-28 Détection automatique de contenus exprimant des émotions à partir d'une vidéo et enrichissement d'un index d'images

Publications (1)

Publication Number Publication Date
WO2018094723A1 true WO2018094723A1 (fr) 2018-05-31

Family

ID=62194597

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/107456 WO2018094723A1 (fr) 2016-11-28 2016-11-28 Détection automatique de contenus exprimant des émotions à partir d'une vidéo et enrichissement d'un index d'images

Country Status (4)

Country Link
US (1) US11328159B2 (fr)
EP (1) EP3545428A4 (fr)
CN (1) CN108701144B (fr)
WO (1) WO2018094723A1 (fr)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10909400B2 (en) * 2008-07-21 2021-02-02 Facefirst, Inc. Managed notification system
KR102660124B1 (ko) * 2018-03-08 2024-04-23 한국전자통신연구원 동영상 감정 학습용 데이터 생성 방법, 동영상 감정 판단 방법, 이를 이용하는 동영상 감정 판단 장치
CN110267113B (zh) * 2019-06-14 2021-10-15 北京字节跳动网络技术有限公司 视频文件加工方法、系统、介质和电子设备
CN110675858A (zh) * 2019-08-29 2020-01-10 平安科技(深圳)有限公司 基于情绪识别的终端控制方法和装置
CN111163366B (zh) * 2019-12-30 2022-01-18 厦门市美亚柏科信息股份有限公司 一种视频处理方法及终端
US11386712B2 (en) * 2019-12-31 2022-07-12 Wipro Limited Method and system for multimodal analysis based emotion recognition
EP3895036A1 (fr) * 2020-02-21 2021-10-20 Google LLC Systèmes et procédés d'extraction d'informations temporelles à partir d'éléments de contenu multimédia animés à l'aide d'un apprentissage automatique
CN111541910B (zh) * 2020-04-21 2021-04-20 华中科技大学 一种基于深度学习的视频弹幕评论自动生成方法及系统
CN113849667B (zh) * 2021-11-29 2022-03-29 北京明略昭辉科技有限公司 一种舆情监控方法、装置、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1374097A1 (fr) 2001-03-29 2004-01-02 BRITISH TELECOMMUNICATIONS public limited company Traitement d'images
CN101021904A (zh) * 2006-10-11 2007-08-22 鲍东山 视频内容分析系统
US20090080853A1 (en) 2007-09-24 2009-03-26 Fuji Xerox Co., Ltd. System and method for video summarization
CN102222227A (zh) * 2011-04-25 2011-10-19 中国华录集团有限公司 基于视频识别与提取影片图像的系统
WO2011148149A1 (fr) * 2010-05-28 2011-12-01 British Broadcasting Corporation Traitement de données audio-vidéo pour produire des métadonnées
CN103200463A (zh) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 一种视频摘要生成方法和装置
CN103761284A (zh) * 2014-01-13 2014-04-30 中国农业大学 一种视频检索方法和系统
US20150110471A1 (en) * 2013-10-22 2015-04-23 Google Inc. Capturing Media Content in Accordance with a Viewer Expression

Family Cites Families (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3780623B2 (ja) 1997-05-16 2006-05-31 株式会社日立製作所 動画像の記述方法
KR20040018395A (ko) 1999-01-29 2004-03-03 미쓰비시덴키 가부시키가이샤 영상 특징량 부호화 방법 및 영상 특징량 복호 방법
US6585521B1 (en) * 2001-12-21 2003-07-01 Hewlett-Packard Development Company, L.P. Video indexing based on viewers' behavior and emotion feedback
US6928407B2 (en) * 2002-03-29 2005-08-09 International Business Machines Corporation System and method for the automatic discovery of salient segments in speech transcripts
JP2004112379A (ja) 2002-09-19 2004-04-08 Fuji Xerox Co Ltd 画像検索システム
US20050071746A1 (en) * 2003-09-25 2005-03-31 Hart Peter E. Networked printer with hardware and software interfaces for peripheral devices
US7917514B2 (en) 2006-06-28 2011-03-29 Microsoft Corporation Visual and multi-dimensional search
KR100889936B1 (ko) 2007-06-18 2009-03-20 한국전자통신연구원 디지털 비디오 특징점 비교 방법 및 이를 이용한 디지털비디오 관리 시스템
US20090094113A1 (en) 2007-09-07 2009-04-09 Digitalsmiths Corporation Systems and Methods For Using Video Metadata to Associate Advertisements Therewith
US10108620B2 (en) 2010-04-29 2018-10-23 Google Llc Associating still images and videos
CN102117313A (zh) 2010-12-29 2011-07-06 天脉聚源(北京)传媒科技有限公司 一种视频检索方法和系统
CN102831891B (zh) 2011-06-13 2014-11-05 富士通株式会社 一种语音数据处理方法及系统
US20140244252A1 (en) * 2011-06-20 2014-08-28 Koemei Sa Method for preparing a transcript of a conversion
US9098533B2 (en) 2011-10-03 2015-08-04 Microsoft Technology Licensing, Llc Voice directed context sensitive visual search
US20140250110A1 (en) 2011-11-25 2014-09-04 Linjun Yang Image attractiveness based indexing and searching
US10372758B2 (en) 2011-12-22 2019-08-06 Tivo Solutions Inc. User interface for viewing targeted segments of multimedia content based on time-based metadata search criteria
US9110501B2 (en) 2012-04-17 2015-08-18 Samsung Electronics Co., Ltd. Method and apparatus for detecting talking segments in a video sequence using visual cues
KR101317047B1 (ko) 2012-07-23 2013-10-11 충남대학교산학협력단 얼굴표정을 이용한 감정인식 장치 및 그 제어방법
US9405962B2 (en) 2012-08-14 2016-08-02 Samsung Electronics Co., Ltd. Method for on-the-fly learning of facial artifacts for facial emotion recognition
US9124856B2 (en) 2012-08-31 2015-09-01 Disney Enterprises, Inc. Method and system for video event detection for contextual annotation and synchronization
US9247225B2 (en) 2012-09-25 2016-01-26 Intel Corporation Video indexing with viewer reaction estimation and visual cue detection
US9639871B2 (en) * 2013-03-14 2017-05-02 Apperture Investments, Llc Methods and apparatuses for assigning moods to content and searching for moods to select content
IL225480A (en) * 2013-03-24 2015-04-30 Igal Nir A method and system for automatically adding captions to broadcast media content
WO2015038749A1 (fr) 2013-09-13 2015-03-19 Arris Enterprises, Inc. Segmentation de contenu vidéo basée sur un contenu
TWI603213B (zh) * 2014-01-23 2017-10-21 國立交通大學 基於臉部辨識的音樂選取方法、音樂選取系統及電子裝置
US10191920B1 (en) * 2015-08-24 2019-01-29 Google Llc Graphical image retrieval based on emotional state of a user of a computing device
US20170371496A1 (en) * 2016-06-22 2017-12-28 Fuji Xerox Co., Ltd. Rapidly skimmable presentations of web meeting recordings
US10043062B2 (en) * 2016-07-13 2018-08-07 International Business Machines Corporation Generating auxiliary information for a media presentation
US10282599B2 (en) * 2016-07-20 2019-05-07 International Business Machines Corporation Video sentiment analysis tool for video messaging
WO2018049254A1 (fr) * 2016-09-09 2018-03-15 Cayke, Inc. Système et procédé de création, d'analyse et de catégorisation de média
US10546011B1 (en) * 2016-09-23 2020-01-28 Amazon Technologies, Inc. Time code to byte indexer for partial object retrieval

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1374097A1 (fr) 2001-03-29 2004-01-02 BRITISH TELECOMMUNICATIONS public limited company Traitement d'images
CN101021904A (zh) * 2006-10-11 2007-08-22 鲍东山 视频内容分析系统
US20090080853A1 (en) 2007-09-24 2009-03-26 Fuji Xerox Co., Ltd. System and method for video summarization
WO2011148149A1 (fr) * 2010-05-28 2011-12-01 British Broadcasting Corporation Traitement de données audio-vidéo pour produire des métadonnées
CN102222227A (zh) * 2011-04-25 2011-10-19 中国华录集团有限公司 基于视频识别与提取影片图像的系统
CN103200463A (zh) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 一种视频摘要生成方法和装置
US20150110471A1 (en) * 2013-10-22 2015-04-23 Google Inc. Capturing Media Content in Accordance with a Viewer Expression
CN103761284A (zh) * 2014-01-13 2014-04-30 中国农业大学 一种视频检索方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3545428A4

Also Published As

Publication number Publication date
EP3545428A1 (fr) 2019-10-02
US11328159B2 (en) 2022-05-10
US20190266406A1 (en) 2019-08-29
CN108701144B (zh) 2023-04-28
CN108701144A (zh) 2018-10-23
EP3545428A4 (fr) 2020-04-29

Similar Documents

Publication Publication Date Title
US11328159B2 (en) Automatically detecting contents expressing emotions from a video and enriching an image index
US11526576B2 (en) Method and system for displaying content relating to a subject matter of a displayed media program
CN109690529B (zh) 按事件将文档编译到时间线中
CN106649818B (zh) 应用搜索意图的识别方法、装置、应用搜索方法和服务器
JP7356206B2 (ja) コンテンツ推薦及び表示
CN107657048B (zh) 用户识别方法及装置
US10152479B1 (en) Selecting representative media items based on match information
US20180101614A1 (en) Machine Learning-Based Data Aggregation Using Social Media Content
US20180301161A1 (en) Systems and methods for manipulating electronic content based on speech recognition
WO2018000557A1 (fr) Procédé et appareil d'affichage de résultats de recherche
CN106708905B (zh) 视频内容搜索方法和装置
JP2018504727A (ja) 参考文書の推薦方法及び装置
CN109033261B (zh) 图像处理方法、装置、处理设备及其存储介质
CN110019675B (zh) 一种关键词提取的方法及装置
WO2012103129A1 (fr) Appariement sémantique par analyse de contenus
EP2786272A1 (fr) Extraction de sujets et association de vidéos
US9740695B2 (en) Method for enriching a multimedia content, and corresponding device
Ozkan et al. A large-scale database of images and captions for automatic face naming
CN104376095B (zh) 一种信息处理方法及电子设备
Williams et al. Does That Mean You're Happy? RNN-based Modeling of User Interaction Sequences to Detect Good Abandonment
CN109325135B (zh) 基于文本的视频生成方法、装置、计算机设备及存储介质
CN106919593B (zh) 一种搜索的方法和装置
JP2014164576A (ja) 予測対象コンテンツにおける将来的なコメント数を予測する予測サーバ、プログラム及び方法
Yang et al. Lecture video browsing using multimodal information resources
CN111797765B (zh) 图像处理方法、装置、服务器及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16922433

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016922433

Country of ref document: EP