US20180160200A1

US20180160200A1 - Methods and systems for identifying, incorporating, streamlining viewer intent when consuming media

Info

Publication number: US20180160200A1
Application number: US15/786,077
Authority: US
Inventors: Vaibhav Goel; Sharath Manjunath; Vidhya T. V; Vinay T. S
Original assignee: Streamingo Solutions Private Ltd
Current assignee: Streamingo Solutions Private Ltd
Priority date: 2016-12-03
Filing date: 2017-10-17
Publication date: 2018-06-07
Also published as: US20190294668A1; US10911840B2

Abstract

The embodiments herein provide systems and methods for identifying viewer intent for a multimedia of an electronic device, wherein a method includes generating a textual summary from at least one multimedia. Further, the method includes determining and displaying at least one of one or more keywords and one or more key phrase associated with the textual summary. Further, the method includes generating one or more paragraphs from the extracted textual summary to generate one or more chapters based on the at least one of one or more keywords and one or more key phrases appeared in a time stamp associated with the textual summary. Further, the method includes generating one or more index tables for the generated one or more chapters to enable a user to search inside the multimedia.

Description

CROSS REFERENCE TO RELATED APPLICATION

This application is based on and derives the benefit of Indian Provisional Application 201641041399 filed on Dec. 3, 2016, the contents of which are incorporated herein by reference.

TECHNICAL FIELD

The embodiments herein relate to electronic devices and, more particularly to identify viewer's intent vis-à-vis the multimedia content, and both facilitate productive viewing/listening of the content and incorporating said intent.

BACKGROUND

Generally, videos are being used to share information and messages on the Internet. It is now easy to make videos and capture your thought processes. It is possible to reach large audiences using video as the medium of communication. This large proliferation has resulted in there being too many videos with duplicate messages on the Internet today. There is no intelligent system, which helps to choose the right video to view and, and the one that can be understood most effectively. Additionally, there are few other unsolved problems today in video space such as: selecting a video, which caters to the users viewing context (what is my situation) and viewing intent (what I want to achieve). The viewer's short attention span means that there should be a way to help the viewer to quickly get a gist of important concepts described in the video. The other challenge is that there is no way to validate whether the viewer understood what is described inside the video. Another relates to the fact that the viewing context is typically limited to the title of the video. There is currently no way to provide the viewer with additional relevant content that would help to broaden the viewer perspective on the main topics discussed in the video.
Too much online video content, has made accessing information a stressful effort. To establish a relevant context to watch a video is a time-consuming effort, which often leads to bubbling of unnecessary information and waste of time and effort. For example, there are several different online videos available on any given topic, but it is very difficult to analyze and understand using the visual and spoken context of the video. In this situation, the context can only be established after watching the video and there are very high chances that this context is different from what is desired by the viewer. Not surprising that content providers find a gap between the expected and actual viewership of their online video content.
For example, if a viewer wants to watch a video, the viewer has to identify a right video to watch. Let's assume that the viewer wants to watch a video, which is discussing about how to calculate Net Present Value (NPV) when cash flows are uneven. Thus, how to calculate Net Present Value (NPV) when cash flows are uneven acts as a viewing intent and this process is called “setting the viewing intent”. Furthermore, based on this viewing intent, the viewer will go through the process of searching videos online. This search results in the viewer being presented with several different videos titled with Finance and Accounting, Business Finance, Net Present Value or the like Now, the viewer can select that video to watch which has titles closely matching with the set viewing intent. It is likely that the viewer may start watching the video based on the matching title of Net Present Value. However, the viewer may not be sure whether the selected video does contain the relevant content, which satisfies the viewer intent. If the video does not contain the relevant information, the viewer might continue to watch the selected filtered videos until an exact match for the desired viewing intent is found. The entire process is time consuming and tedious. Sometimes, the viewer can use the manually generated tags associated with the videos to choose the video based on the desired viewing intent. These tags are usually inappropriate and or misleading to the viewer.
Existing solutions disclose methods to cater to the viewer's context by providing index of video content and providing transcripts of the speech spoken inside the video, but these methods are manual, tedious and are too limited to meet the scale of the online video content. They are clearly limited because of the technical barriers posed by the expectations of such a method which should automatically create the viewer's context when watching a video and should meet the scale of time and be significantly less than the effort of online video content creation.

BRIEF DESCRIPTION OF THE FIGURES

The embodiments disclosed herein will be better understood from the following detailed description with reference to the drawings, in which:

FIG. 1 is a block diagram illustrating various units of an electronic device for identifying viewer intent related to a multimedia, according to embodiments as disclosed herein;

FIG. 2 is a flow diagram illustrating a method for identifying viewer intent related to a multimedia, according to embodiments as disclosed herein;

FIG. 3 is a block diagram illustrating a method of generating contextual data elements, according to embodiments as disclosed herein;

FIG. 4 is an example diagram illustrating a method for extracting content from a video, according to embodiments as disclosed herein;

FIG. 5 is an example diagram illustrating a method for establishing a video context, according to embodiments as disclosed herein;

FIG. 6 is an example block diagram illustrating a text summarizer, according to embodiments as disclosed herein; and

FIG. 7 is an example diagram illustrating a mechanism to search inside a video, according to embodiments as disclosed herein.

DETAILED DESCRIPTION OF EMBODIMENTS

The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
The embodiments herein provide systems and methods identifying viewer intent in a multimedia, wherein a method includes generating a textual summary from at least one multimedia. Further, the method includes determining and displaying at least one of one or more keywords and one or more key phrase associated with the textual summary. Further, the method includes generating one or more paragraphs from the extracted textual summary to generate one or more chapters based on the at least one of one or more keywords and one or more key phrases appeared in a time stamp associated with the textual summary. Further, the method includes generating one or more index tables for the generated one or more chapters to enable a user to search inside the multimedia.
FIG. 1 is a block diagram illustrating various units of an electronic device 100 for identifying viewer intent a multimedia, according to embodiments as disclosed herein.
In an embodiment, the electronic device 100 can be at least one of but not restricted to a mobile phone, a smartphone, tablet, a phablet, a personal digital assistant (PDA), a laptop, a computer, a wearable computing device, a smart TV, wearable device (for example, smart watch, smart band), or any other electronic device which has the capability of playing a multimedia content or accessing an application (such as a browser) which can access and display multimedia content. The electronic device includes a text summarizer 102, a keyword and key phase extractor 104, a paragraph generator 106, an index table generator 108, a communication interface unit 110 and a memory 112.
The text summarizer 102 can be configured to generate a textual summary from at least one multimedia content. This content can be for example, in the form of video, audio and video, textual content present in video format, animations with audio, text or the like. The keyword and key phase extractor 104 can be configured to determine and display at least one of one or more keyword and one or more key phrase associated with the textual summary. The paragraph generator 106 can be configured to generate one or more paragraphs from the extracted textual summary to generate one or more chapters based on at least one of one or more keywords and one or more key phrases appeared in a time stamp associated with the textual summary. The index table generator 108 can be configured to generate one or more index tables for the generated one or more chapters to enable a viewer to search inside the multimedia, while also used for creating titles of the said paragraphs and to estimate the duration of the chapters/paragraphs. Further, the method includes receiving an input from the viewer (wherein the input can comprise of at least one of keyword and key phrase) to play the multimedia associated with the at least one of keyword and key phrase. Further, the method includes receiving an input from the viewer on the generated at least one table of contents to play an interested portion of the multimedia. Further, the method includes receiving the viewer search intent as an input to identify a matching content in the multimedia corresponding to the viewer search intent. The viewer search intent can be an audio input, video/image input or textual input. For example, the viewer search intent can be at least one of a keyword, key phrase, sentence or the like.
The communication interface unit 110 can be configured to establish communication between the electronic device 100 and a network.
The memory 112 can be configured to store multimedia, textual summary, keywords and key phrases, paragraphs and chapters generated from the respective multimedia. The memory 112 may include one or more computer-readable storage media. The memory 108 may include non-volatile storage elements. Examples of such non-volatile storage elements may include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories. In addition, the memory 112 may, in some examples, be considered a non-transitory storage medium. The term “non-transitory” may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term “non-transitory” should not be interpreted to mean that the memory 112 is non-movable. In some examples, the memory 112 can be configured to store larger amounts of information than the memory. In certain examples, a non-transitory storage medium may store data that can, over time, change (e.g., in Random Access Memory (RAM) or cache).
FIG. 1 shows exemplary units of the electronic device 100, but it is to be understood that other embodiments are not limited thereon. In other embodiments, the electronic device 100 may include less or more number of units. Further, the labels or names of the units are used only for illustrative purpose and does not limit the scope of the embodiments herein. One or more units can be combined together to perform same or substantially similar function in the electronic device 100.
The embodiments herein provide an electronic device 100 configured to uses a combination of image processing, speech recognition, natural language processing, machine learning and neural networks to determine what is inside the multimedia (for example, video) by generating a summary of the video using a combination of video frame analysis and text summarization, wherein the text summarization includes extractive and abstractive summaries. Further, the electronic device 100 can be configured to use recurrent neural network techniques and domain classification on the text summarization to extract relevant keywords, keyphrases and table of contents. Further, the electronic device 100 can be configured to generate pointers to additional information in the form of text and visual content using a combination of the generate relevant key-words, key-phrases and table of contents and most recently captured internet browsing preferences of the viewer.
The embodiments herein extend the video viewing beyond the title of the video. This provides the viewer a method/way to know what is inside the video prior to watching the video. Reading the summary of the video content enables viewer to estimate how likely will the viewer's intent of viewing be covered by the video. The embodiments herein also enable the viewer to search the viewer intent of “information gathering” inside the video. Thus, the embodiments herein extended beyond the title of the video which helps the viewer to build and match the relevant intent of information gathering, resulting in a more enhanced and engaged experience. This can be implemented in various domains like online-education, video surveillance, online advertising, retail, product introduction, skill development programs, Judiciary, Banking, Broadcasting industry or the like.
The embodiments herein disclose methods and systems for identifying viewer intent in a multimedia of an electronic device. Referring now to the drawings, and more particularly to FIGS. 1 through 8, where similar reference characters denote corresponding features consistently throughout the figures, there are shown embodiments.
FIG. 2 is a flow diagram illustrating a method for identifying viewer intent in a multimedia, according to embodiments as disclosed herein.
At step 202, the method includes generating a textual summary from at least one multimedia content. The method allows the text summarizer 102 to generate the textual summary from at least one multimedia content.
At step 204, the method includes determining and displaying at least one of at least one keyword and at least one key phrase associated with the textual summary. The method allows the keyword and key phase extractor 104 to determining and displaying at least one of at least one keyword and at least one key phrase associated with the textual summary.
At step 206, the method includes generating at least one paragraph from the extracted textual summary to generate at least one chapter based on the at least one of at least one keyword and at least one key phrase appeared in a time stamp associated with the textual summary. The method allows the paragraph generator 106 to generate at least one paragraph from the extracted textual summary to generate at least one chapter based on the at least one of at least one keyword and at least one key phrase appeared in a time stamp associated with the textual summary.
At step 208, the method includes generating at least one index table for the generated at least one chapter to enable a user to search inside the multimedia. The method allows the Index table generator 108 to generate at least one index table for the generated at least one chapter to enable a user to search inside the multimedia
The various actions, acts, blocks, steps, or the like in the method and the flow diagram 200 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some of the actions, acts, blocks, steps, or the like may be omitted, added, modified, skipped, or the like without departing from the scope of the invention.
The embodiments herein can establish a contextual relation between content of the multimedia and a learning intent of the viewer/user. The contextual relation can be established with the help of textual summary, keyword(s) and key phrase(s), one or more paragraphs generated from the extracted textual summary, one or more chapters generated based on the keyword(s) and at key phrase(s) appeared in a time stamp associated with the textual summary and the generated index table(s).
FIG. 3 is a diagram illustrating a method of generating contextual data elements, according to embodiments as disclosed herein.
The multimedia may have audio and/or video, audio/speech content and visual content are extracted. Further, frames of visual content are extracted and classified for further processing of images comprising the frames and for identifying the context. The extracted/separated audio/speech content is analyzed and converted to a suitable textual summary/transcript through an automatic speech recognition (ASR) engine. Further necessary and pertinent post-processing is performed on the extracted transcript for cleaning up of errors and other corrections. This transcript is appropriately punctuated automatically and processed for identifying keywords which represent the context of the matter presented/spoken in the audio content. These elements, viz., the classification of the video and its frames, the identified keywords, the textual summary, gathering and collating information relevant to the content being processed with the help of the keywords, along with the interface to be able to search the video for the generated keywords and present the search results in the context of the video, comprise the overall information incorporated as a viewing/visual element in the context of the original multimedia.
The contextual data elements are information entities generated from the multimedia. The data elements represent the context of the multimedia and enable a viewer to match and validate intent of viewing with the content of that video. The contextual data elements help the viewer to identify the viewing intent beyond title of the video. The embodiments herein enables the electronic device 100 to generate the following contextual data elements:
Video Summary:
Video summary is an automatic summarization of video's audio and/or visual content into textual summary, which represents the most relevant context of the entire video in a condensed form. Video summary is quick to read and view. The video summary helps the viewer to get accurate information of what has been spoken or visualized inside the video. A viewer can use the video summary to validate the viewer intent of viewing the video. If there is a match between content and viewer's intent, the viewer can then continue to watch the video or else the viewer can decide to select some other video. This data element, allows the viewer to spend his time much more productively, while searching for relevant content to view.
Search Inside the Video:
The embodiments herein provides contextual data element. The contextual data elements are a unique enablement of viewer's ability to search inside the video and find a more relevant match to his or her intent of watching the video. This enables the user to find out a few things (what is inside the video, where in the video the content related to the viewer's intent occur. This answers the important question of whether this video has the content that matches the viewer intent.
Retrieving Other Relevant Information:
Information retrieval data elements are used to enhance the effectiveness of information gathering for the viewer by seamlessly enabling access to other related information. Other related information is assembled using a combination of Natural Language Processing (NLP) and Machine Learning (ML) techniques. These techniques are applied to a combination of inputs, which are in the context of the video and the analysis of recently captured online behavior traits of a viewer. Technical methods used to generate this data element, present additional information in the form of relevant text, and audio and video sources. These data element's presentation in the video, significantly adds to the viewer's intent to look for related information and learn more thereby broadening the task of information gathering.
Contextual Keywords:
The contextual data elements help to automate the process of creation of right descriptions and attributes. Additionally, contextual Keywords data elements help to classify the context of the information being searched for accurately. Using these keywords, the viewer can establish a match between his or her viewing intent and the content of the video. The contextual keywords data element extraction uses a combination of Natural Language Processing; Machine learning and Neural Networks techniques to establish the relevance of the keywords for a given context of the video. An accurate representation of the context of the video with relevant keywords, greatly improves search ability of the video itself.
Automatic Generation of Table of Contents:
A table of video content helps the viewer to navigate through the video in a structured manner. Embodiments herein disclose methods to generate the index automatically using the content of the video, which is synced with occurrence of the content on the time scale of the video.
FIG. 4 is an example diagram illustrating a method for extracting content from a video, according to embodiments as disclosed herein.
The first step of extracting video content involves crawling the online video and related information. The end result of the crawling can include a multi domain data corpus. The data corpus will contain documents, metadata, tagged audio and video content classified into different domains.
The content extraction from the video includes data extraction, noise elimination and data classification, wherein the data extraction includes extracting content from the selected online video. The content extraction engine extracts the content from the video, wherein the extracted content includes video frames, audio streams and video metadata. The content extraction also includes converting the extracted data into appropriate formats. For instance, audio speech data is converted into single channel, 16 bits and 16 kHz bandwidth format.
Further, the content extraction includes noise elimination. Once the extracted content is formatted to the appropriate format, the noise elimination unit filters out irrelevant data or the data, which cannot be utilized for further processing. For instance, identifying audio speech with multiple high and low frequencies compressed at very low bit rates will constitute an input to the elimination phase.
Further, the content extraction includes data classification, which comprises of filtering the irrelevant data from the extracted content; the data classification unit can be configured to classify the extracted data into different categories. For instance, the data classification unit extracts the video frames and groups them into different categories based on the images, changes in scenes and identification of text or the like. Further, the audio stream is categorized based on the spoken accent, dialects and gender. An appropriate categorization is done to metadata of the video as well. These categories together highlight features, which are uniquely tagged to help generate a useful context for the video during data processing.
FIG. 5 is an example diagram illustrating a method for establishing a video context, according to embodiments as disclosed herein.
The video context can be established using a matrix of machine learned contextual data models. The machine learnt contextual data models can be configured to develop new intelligent data from the extracted video content. As depicted in FIG. 5, establishing the video context results in building a matrix of data models and generating necessary inputs for the sequence of steps to generate new data that can be subsequently used to build the context of the video.
The matrix of data models can be built using inputs like: uniquely tagged audio, video and metadata content extracted from the online video using video content extraction and using separately built machine learnt multi domain data corpus. The multi domain data corpus represents different fields of information categorized into different domains. The domain categorization helps to impose relevant classification of context to the selected video.
The matrix represents the arrangement of different types of machine learnt data models carrying the context of the video, which can be used by different methods in their independent capacities to produce outputs in the form of a new data. This form of data captures the relevant context of the video and can be used to develop data/content elements, which can be used by the viewer to validate the viewing intent.
Speech to text decoder comprises of methods, which uniquely automate the process of selecting desired data models from the matrix such as speech acoustic models, domain language models, lexicon, and so on. Further, the selected data models can be configured to decode the audio content of the video into textual summary.
Further, the textual summary can be fed into a transcript generator to develop the textual summary into a structured text using different data models from the matrix. The structured text represents a clean, punctuated data, which is paraphrased using domain centric data models. This method also uses these data models to identify new contextual titles and subtitles from the structured text, which gets developed into index table for the contents of the video.
The structured text is taken as an input by a keyword generator, which uses a different contextual data model from the matrix and identifies keywords, which represent the context of the video. These keywords are used further by several other methods as a technique to preserve the video context and generate new data elements, which helps to extend the context beyond the title of the video.
FIG. 6 is an example block diagram illustrating a text summarizer, according to embodiments as disclosed herein.
The text summarizer 102 takes long structured text and video frames as inputs and generates information in a more condensed form, which provides a gist/summary of the video content to the viewer. Textual summary generated is presented both in text and video forms. This method uses both extractive and abstractive techniques to build the textual summary. While the extractive technique uses the text phrases from the original text, the abstractive technique generates new sentences with the help of data models. Sentences generated from both extractive and abstractive techniques receive scores using the scale of domain of the video content and then the ranking engine selects a set of sentences, which are presented as a summary data. Further the textual summary output is fed into a video frames handler, which aligns video frames with the summary text. Video summary re-ranker considers these aligned video frames and other video frames as inputs, which are classified and tagged using domain intelligence. With these inputs, video re-ranker generates a new video summary.
FIG. 7 is an example block diagram illustrating a mechanism to search inside a video, according to embodiments as disclosed herein.
The embodiments herein enable the viewer to find relevant sections inside the video. The electronic device 100 can be configured to receive a query phrase as an input in the form of text. The text query phrase is fed into a video frame parser and a text parser. The video frame parser determines relevant frames from the video, which matches with the description of the query phrase. Similarly, the text parser determines relevant sections inside the videos based on the textual summary. Outputs from both the parsers are fed into intelligent response builder. The intelligent response builder can be configured to use domain intelligence as one of the criterion to rank video frames and texts to identify relevant sections of the video. Finally these sections of the video as presented as a final response to the query from the viewer.
The embodiments herein provides a mechanism to summarize the multimedia, wherein the multimedia can includes at least one of but not limited to text, audio or video. The generated multimedia summarization enables the user to identify and categories the multimedia into different categories for example, mass media (for example, news or the like), sports content, film content, print matter, educational videos or the like. The embodiments herein also provide video thumbnails and provide a mechanism to perform a concept based search (like keyword search).
The embodiments herein help in generating viewership patterns and associated analyses for internal communications. The embodiments herein help in corporate trainings for absorbing training material and passing assessments.
The embodiments herein facilitates searching within and across all media, generates relationships and links between different categories—cross-search using keywords and key-phrases. The embodiments herein also help in collation and consolidation of material for easy consumption.
The embodiments herein help in law enforcement by generating a surveillance summary. Further, the embodiments herein aggregate data using search terms and relationships.
The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing functions to control the at least one hardware device. The electronic device 100 shown in FIG. 1 includes blocks, which can be at least one of a hardware sub-module, or a combination of hardware sub-modules and software module.
The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of embodiments and examples, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the claims as described herein.

Claims

We claim:

1. A method for identifying a viewer intent a multimedia of an electronic device (100), the method comprising:

generating, by the electronic device (100), a textual summary from at least one multimedia;

determining and displaying, by the electronic device (100), at least one of at least one keyword and at least one key phrase associated with the textual summary, wherein the at least one of at least one keyword and at least one key phrase is contextually associated with content of the multimedia;

generating, by the electronic device (100), at least one paragraph from the extracted textual summary to generate at least one chapter based on the at least one of at least one keyword and at least one key phrase appeared in a time stamp associated with the textual summary; and

generating, by the electronic device (100), at least one index table for the generated at least one chapter.

2. The method of claim 1, wherein the method further comprises receiving, by the electronic device (100), an input from a viewer on the displayed at least one of at least one keyword and key phrase to play the multimedia associated with the at least one of at least one keyword and key phrase.

3. The method of claim 1, wherein the method further comprises receiving, by the electronic device (100), an input from the viewer on the generated at least one table of contents to play an interested portion of the multimedia.

4. The method of claim 1, wherein the generated textual summary is at least one of an extractive summary and an abstractive summary.

5. The method of claim 1, wherein the method further includes receiving, by the electronic device (100), a viewer search intent as an input to identify a matching content in the multimedia corresponding to the viewer search intent.

6. An electronic device (100) for identifying a viewer intent a multimedia, the electronic device (100) comprises

a text summarizer (102) configured to generate a textual summary from at least one multimedia;

a keyword and key phrase extractor (104) configured to determine and display at least one of at least one keyword and at least one key phrase associated with the textual summary, wherein the at least one of at least one keyword and at least one key phrase are contextually associated with content of the multimedia;

a paragraph generator (106) configured to generate at least one paragraph from the extracted textual summary to generate at least one chapter based on the at least one of at least one keyword and at least one key phrase appeared in a time stamp associated with the textual summary; and

an index table generator (108) configured to generate at least one index table for the generated at least one chapter.

7. The electronic device (100) of claim 6, wherein the electronic device (100) is further configured to receive an input from a viewer on the displayed at least one of at least one keyword and key phrase to play the multimedia associated with the at least one of at least one keyword and key phrase.

8. The electronic device (100) of claim 6, wherein the electronic device (100) is further configured to receive an input from the viewer on the generated at least one index table to play an interested portion of the multimedia.

9. The electronic device (100) of claim 6, wherein the generated textual summary is at least one of an extractive summary and an abstractive summary.

10. The electronic device (100) of claim 6, wherein the electronic device (100) is further configured to receive a viewer search intent as an input to identify a matching content in the multimedia corresponding to the viewer search intent.