WO2022031283A1 - Contenu de flux vidéo - Google Patents

Contenu de flux vidéo Download PDF

Info

Publication number
WO2022031283A1
WO2022031283A1 PCT/US2020/045064 US2020045064W WO2022031283A1 WO 2022031283 A1 WO2022031283 A1 WO 2022031283A1 US 2020045064 W US2020045064 W US 2020045064W WO 2022031283 A1 WO2022031283 A1 WO 2022031283A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
region
slide
video stream
examples
Prior art date
Application number
PCT/US2020/045064
Other languages
English (en)
Inventor
Thomas da Silva PAULA
Juliano Cardoso Vacaro
Wagston Tassoni STAEHLER
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to PCT/US2020/045064 priority Critical patent/WO2022031283A1/fr
Publication of WO2022031283A1 publication Critical patent/WO2022031283A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
    • H04N21/8153Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics comprising still images, e.g. texture, background image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • Figure 1 is a flow diagram illustrating an example of a method for video stream content extraction
  • Figure 2 is a block diagram illustrating examples of components and/or elements for video stream content extraction
  • Figure 3 is a block diagram of an example of an apparatus that may be used in video stream content extraction
  • Figure 4 is a block diagram illustrating an example of a computer- readable medium for extracting video stream content
  • Figure 5 is a diagram illustrating an example of a video stream.
  • a video stream is image data.
  • a video stream may include a sequence of images (e.g., frames of pixel data).
  • a video stream may be produced with captured images (e.g., screen captures, image sensor captures, etc.).
  • an apparatus e.g., computing device
  • a video stream may be stored and/or transmitted.
  • a video stream may be stored as a file for playback.
  • a video stream may be transmitted and/or received over a link (e.g., wired link, wireless link, network, etc.).
  • a video stream may be provided over the Internet for a virtual meeting and/or virtual presentation.
  • the video stream may be captured from a live presentation (e.g., in-class presentation, meeting room presentation, etc., where a slide or slides may be presented using a projector).
  • a slide is a page of a presentation.
  • a slide may be expressed as image data (e.g., pixels) in a video stream.
  • a frame or frames of a video stream may depict a slide.
  • source file information for the slide may not be provided with the video stream.
  • source file information for a slide or slides may be stored in a file format such as a PowerPoint (.ppt, .pptx) file, Keynote (.key) file, OpenOffice presentation (.odp), and/or another file format.
  • Source file information may provide code that indicates the content (e.g., text) of a slide or slides.
  • the slide When a slide is depicted in a video stream, the slide may be expressed as pixel data without source file information. Accordingly, a slide in a video stream may not include source file information (e.g., text).
  • a source or sources may have additional information regarding the content (e.g., topic(s)) depicted in a video stream. For instance, it may be helpful to access wikis, research papers, online videos, and/or images related to the content.
  • Some of the techniques described herein may provide additional information related to content of a video stream (e.g., video in a local file, online video, online meeting video, etc.).
  • content of a video stream may be detected, parsed, and/or filtered. Additional information related to the content (e.g., additional information from a source or sources) may be determined and/or presented. For instance, additional information may be presented with (e.g., over) the video stream of a presentation.
  • Some examples of the techniques described herein may provide additional information (e.g., links to other videos, websites, internal papers, and/or other presentations, etc.), which may accelerate access to information.
  • Some examples of the techniques described herein may use the context of a presentation to determine (e.g., filter) the additional information to be presented, which may augment productivity in a meeting.
  • Some examples of the techniques described herein may be used with online video streams and/or offline video streams (e.g., recordings). For instance, some examples of the techniques may be performed offline, where the additional information may be provided with a recorded video stream (e.g., recorded meeting after the actual meeting).
  • Figure 1 is a flow diagram illustrating an example of a method 100 for video stream content extraction.
  • the method 100 and/or a method 100 element or elements may be performed by an apparatus (e.g., electronic device, computing device, server, etc.).
  • the method 100 may be performed by the apparatus 326 described in relation to Figure 3.
  • An apparatus may detect 102 a slide in a video stream. For example, the apparatus may detect whether the video stream depicts a slide, a location in the video stream where the slide is located, and/or dimensions of the slide in the video stream (e.g., in a frame of the video stream). In some examples, detecting the slide in the video stream may include detecting a region in the video stream that depicts the slide. For instance, the apparatus may detect whether a slide is presented on a screen. In some approaches, the apparatus may utilize a slide detector that determines a region (e.g., bounding box) in the video stream (e.g., frame) where the slide is depicted.
  • a region e.g., bounding box
  • detecting the slide in the video stream may include detecting the region utilizing a machine learning model that is trained with training images that include images of slides.
  • a machine learning model may include an artificial neural network, convolutional neural network (CNN), region-based convolutional neural network (R-CNN), mask R- CNN, faster R-CNN, you only look once (YOLO) CNN, etc.
  • the machine learning model may be trained with training images that include images of slides. For instance, the machine learning model may be trained with training images that include images of slides with annotations that indicate the locations of the slides in the training images.
  • detecting the slide in the video stream may include detecting the region utilizing a segmentation technique or techniques.
  • a segmentation technique may indicate pixels of the video stream (e.g., frame, image) corresponding to a slide, for instance.
  • Some examples of segmentation techniques may utilize a machine learning model or models to segment the video stream (e.g., frame, image).
  • the training images may include non-artificial images (e.g., images from video streams of online meetings) and/or artificial training images.
  • An artificial training image is an image generated automatically for training.
  • a program may be utilized to automatically generate slides to create training images.
  • a program may utilize a white background or sample images from a source or sources (e.g., database, online source, etc.) as slide backgrounds and may add text extracted from a source or sources as content on the slide.
  • the training images may be utilized to train the machine learning model (e.g., object detector(s)).
  • the machine learning model may be selected based on speed and/or accuracy considerations.
  • the object detector may be agnostic to a meeting application being utilized.
  • the slide detection may function without regard to a meeting application (e.g., Zoom, Teams, GoToMeeting, etc.) used.
  • the apparatus may crop the slide from the video stream (e.g., frame of the video stream) and/or may indicate the location and/or dimensions of the slide in the video stream (e.g., frame of the video stream).
  • the apparatus may extract 104 content from the slide.
  • Content is text or image information.
  • a slide may include text and/or an image or images.
  • extracting 104 content from the slide may include detecting a sub-slide region in the video stream.
  • a sub-slide region is a subregion of a slide.
  • the region (e.g., image region, frame region) in the video stream that depicts a slide may include a sub-slide region or sub-slide regions (e.g., portions(s), subset(s), etc.) that depict text and/or images.
  • different types of sub-slides may include different types of information.
  • a text region may include text
  • an image region may include an image or images.
  • detecting a sub-slide region in the video stream may include detecting a text region or an image region within the region in the video stream that depicts the slide.
  • the apparatus may utilize a content detector or detectors that detect(s) a text region or regions and/or an image region or regions.
  • the content detector may indicate the text region(s) and/or image region(s) (e.g., bounding box(es) of region(s) where text and/or images (or other relevant content) were detected).
  • the content detector or detectors may include a machine learning model or models trained to detect a sub-slide or sub-slides.
  • the content detector or detectors may include a machine learning model or models trained with training images of text and/or images.
  • the apparatus may utilize an artificial neural network, convolutional neural network (CNN), R-CNN, mask R-CNN, faster R-CNN, YOLO CNN, etc., to detect text region(s) and/or image region(s).
  • CNN convolutional neural network
  • R-CNN convolutional neural network
  • mask R-CNN mask R-CNN
  • faster R-CNN faster R-CNN
  • YOLO CNN YOLO CNN
  • the apparatus may utilize an object detector to detect 102 a slide and to detect content (e.g., text region(s) and/or image region(s)).
  • one detector may be utilized as a slide detector and a content detector.
  • separate detectors may be utilized to detect 102 the slide and content.
  • a content detector may be more computationally complex (e.g., may include more nodes, layers, connections, and/or weights) than the slide detector.
  • the content detector may be utilized in response to a positive slide detection from the slide detector.
  • the slide detector may utilize a different machine learning model relative to the content detector(s).
  • extracting 104 the content from the slide may include cropping a sub-slide region or regions (e.g., text region(s) and/or image region(s)).
  • extracting 104 content from the slide may include cropping a text region from the video stream (e.g., from a video frame).
  • a sub-slide region may be a part or portion of an image of a slide.
  • a sub-slide region may be a region of an image (e.g., frame) of a video stream and/or a portion of data of the video stream.
  • a subslide region may be a rectangular subset (e.g., tile, etc.) of the region of the slide in the video stream (e.g., frame of the video stream).
  • a sub-slide region may have a location and/or dimensions (e.g., height and width in pixels, inches, centimeters, etc.) that include text and/or image(s).
  • extracting 104 content from the slide may include performing character recognition on a text region or regions to produce text.
  • text regions(s) e.g., bounding boxes
  • OCR optical character recognition
  • Performing character recognition may produce text (e.g., text corresponding to each of the text regions and/or bounding boxes of text).
  • performing character recognition may produce a confidence value or values.
  • a confidence value may indicate a degree of confidence associated with recognized text (e.g., character(s), word(s), phrase(s), language, etc.).
  • the confidence value(s) may be utilized to decide whether to search for content. For example, if a first character recognition engine detects text with a low confidence value (e.g., a confidence value below a threshold), a second character recognition engine may be utilized.
  • extracting 104 content from the slide may include parsing text to categorize a text subset.
  • a text subset is a subset of text. Examples of a text subset may include a word or words, term, phrase, sentence, etc.
  • parsing text may include determining a text subset or text subsets and/or categorizing (e.g., labeling, classifying, etc.) the text subset(s). For instance, based on the text provided by character recognition, the apparatus may perform text processing to determine useful and/or relevant information.
  • parsing the text may include performing named entity recognition (NER). For instance, NER may determine text subset(s) and/or corresponding types from the text.
  • NER named entity recognition
  • NER may determine text subsets of “Monte Carlo” (with a “person” type), “MDP” (with an “org” type), and “Reinforcement Learning” (with an “org” type).
  • parsing the text to categorize the text subset(s) may include categorizing a text subset or subsets.
  • the apparatus may utilize a trained text classifier, which may label a text subset to a category. Labeling a text subset may provide context.
  • the text may include the term “Monte Carlo” as one of the determined text subsets. In a general sense, “Monte Carlo” may refer to the administrative area of the Principality of Monaco. However, the context of the text may indicate Monte Carlo methods.
  • the text classifier may label the text subset of “Monte Carlo” with a label or labels such as computational method and/or random sampling.
  • the text classifier may be trained based on a relatively large text corpus.
  • parsing the text may include other approaches to categorize (e.g., label) text (e.g., text subset(s)) and/or extract information from the text (e.g., text subset(s)).
  • extracting 104 content from the slide may include recognizing an object or objects in an image region or regions.
  • the apparatus may execute an image processing pipeline or pipelines to extract more information from the image region(s).
  • the image processing pipeline may include a model or models (e.g., machine learning model(s)) to recognize object(s) and/or scenario(s).
  • the apparatus may utilize a model (e.g., machine learning model) trained for object recognition, which may recognize an object or objects in the image region(s).
  • the model may be trained using a general-purpose dataset and/or a specificpurpose dataset.
  • extracting 104 content from the slide may include extracting a feature or features from an image region or regions.
  • the apparatus may perform feature extraction to enable retrieval of image(s) that are similar to the image(s) in the image region(s).
  • the feature(s) may represent an aspect or aspects (e.g., characteristic(s)) of the image(s) in the image region(s).
  • the feature(s) may be utilized to determine and/or retrieve image(s) with similar feature(s).
  • another approach or approaches may be utilized to determine label(s) and/or information based on the image region(s).
  • the apparatus may extract relevant data from a source or sources based on the content (e.g., extracted content, text, text subset, categorized text subset, text subset type, classified text subset, labeled text subset, image, and/or image feature(s), etc.).
  • Relevant data is data that is related to other data.
  • relevant data may be data that is related to the content.
  • a source is a source of information. Examples of sources may include databases, websites, data repositories, search engines, stored information, etc.
  • a source or sources may be local and/or internal to the apparatus (e.g., stored and/or accessed on the apparatus) and/or may be external and/or accessible via a network or networks (e.g., local area network (LAN), the Internet, etc.).
  • a network or networks e.g., local area network (LAN), the Internet, etc.
  • the apparatus may extract relevant data from a source or sources by requesting (e.g., querying) and/or receiving the relevant data from the source(s). For instance, the apparatus may extract relevant data from a search engine by sending a request to the search engine that is based on the content (e.g., extracted content, text, text subset, categorized text subset, text subset type, classified text subset, labeled text subset, image, and/or image feature(s), etc.) from the slide. The source(s) may return the relevant data.
  • the source(s) may return the relevant data.
  • the apparatus may format the extracted content into a request or requests (e.g., query(ies)) for an internal and/or external source or sources. Different numbers and/or types of sources may be utilized in different examples.
  • the apparatus may utilize a program or programs, application programming interface(s) (API(s)), etc., to request relevant data based on the content.
  • the apparatus may query an internal database or databases, query a database or databases on a LAN, query a website or websites (e.g., Google, Google Scholar, Institute of Electrical and Electronics Engineers (lEEEs) Xplore, and/or LexisNexis, etc.).
  • the apparatus may utilize Scholarly, which is a Python package that allows querying Google Scholar.
  • the apparatus may format the extracted content (e.g., text, text subset, category, etc.) into a query format (e.g., SPARQL Protocol and Resource Description Framework (RDF) Query Language (SPARQL)) to extract relevant data from a source or sources.
  • a query format e.g., SPARQL Protocol and Resource Description Framework (RDF) Query Language (SPARQL)
  • RDF Resource Description Framework
  • SPARQL SPARQL Protocol and Resource Description Framework
  • DBpedia may include data (e.g., data from Wikipedia) organized according to an ontology.
  • the formatted query may be utilized to extract the relevant data from the source(s).
  • features extracted from the image region(s) may be utilized to search a feature database or databases with images from different sources and respective features.
  • a search mechanism e.g., K-Nearest Neighbor (KNN) search, Approximate Nearest Neighbor search, etc.
  • KNN K-Nearest Neighbor
  • Approximate Nearest Neighbor search etc.
  • the apparatus may present 106 relevant data retrieved based on the content. For example, the apparatus may create a visualization or visualizations based on the relevant data. For instance, the apparatus may obtain and/or utilize the relevant data from the source(s) to build a visualization or visualizations that may be presented (e.g., presented on a display).
  • a visualization is a graphic.
  • a visualization may be a visual representation (e.g., tooltip, callout, text box, window, pop-up, text, image, etc.) of the relevant data.
  • presenting 106 the relevant data may include generating a graphic (e.g., highlighted bounding box, sign, symbol, callout, underline, overlay, superscript, subscript, and/or color marking, etc.) for text (e.g., a text subset) or an image region and presenting the relevant data in response to detecting an input corresponding to the graphic.
  • a graphic e.g., highlighted bounding box, sign, symbol, callout, underline, overlay, superscript, subscript, and/or color marking, etc.
  • presenting 106 the relevant data may include generating a highlighted bounding box for text (e.g., a text subset) or an image region and presenting the relevant data in a tooltip in response to detecting a cursor in a range of the highlighted bounding box.
  • the apparatus may draw a bounding box or boxes around text (e.g., text subset) and/or an image or images (e.g., image region(s)).
  • the apparatus may open a tooltip, window, pop-up, graphic, etc., that presents the relevant data corresponding to the content in the highlighted bounding box.
  • the apparatus may create the visualization(s) taking into account the size of the screen and/or window being used, and/or where the slide is being displayed in the screen.
  • the visualization may be created and/or presented depending on a type of source (e.g., whether the relevant data provided is text or an image, whether the source is Wikipedia or an internal database, etc.).
  • the method 100 may include detecting a slide change.
  • the method 100 may include detecting when a slide is discontinued from being shown in the video stream and/or when another slide is shown in the video stream.
  • the apparatus may compare pixel values in the region (e.g., slide region) with corresponding pixel values in a subsequent frame and/or image.
  • a difference e.g., average pixel value difference magnitude or another difference measure
  • the apparatus may discontinue presenting 106 the relevant data for the slide and/or may repeat the method 100 (for another detected slide, for instance).
  • slide detection 102 may be repeated for another image and/or frame.
  • content may be extracted 104 from the subsequent slide and/or relevant data may be presented for the subsequent slide.
  • an element or elements of the method 100 may be omitted or combined.
  • the method 100 may include one, some, or all of the operations, elements, etc., described in relation to any of Figures 1-5.
  • Figure 2 is a block diagram illustrating examples of components and/or elements 200 for video stream content extraction.
  • the components and/or elements 200 may perform an aspect or aspects of the method 100 described in relation to Figure 1.
  • the components and/or elements 200 may be included in and/or performed by an apparatus (e.g., electronic device, computing device, server, smartphone, laptop computer, tablet device, etc.).
  • the components and/or elements 200 may be included in and/or performed by the apparatus 326 described in relation to Figure 3.
  • one, some, or all of the components and/or elements 200 may be structured in circuitry and/or in a processor(s) with instructions.
  • the components and/or elements 200 may perform one, some, or all of the operations described in relation to Figures 1-5.
  • a video stream 202 may be provided to a slide detector 204.
  • the slide detector 204 may receive the video stream 202 from a remote device (e.g., networked device, external device linked to the apparatus, remote image sensor, etc.) and/or may obtain the video stream locally (e.g., from a file stored in local memory, from an integrated image sensor, etc.).
  • the slide detector 204 may detect a slide or slides in the video stream 202.
  • detecting the slide(s) may be performed as described in relation to Figure 1.
  • the slide detector 204 may detect a region 206 (e.g., bounding box) in the video stream 202 that depicts a slide.
  • the slide detector 204 may detect the slide in the video stream 202 by detecting the region 206 utilizing a machine learning model that is trained with training images that include images of slides.
  • the region 206 may indicate a location and/or dimensions of a slide in the video stream.
  • the region 206 may be provided to a sub-slide detector 208.
  • the slide detector 204 may detect multiple regions 206 corresponding to multiple slides in the video stream 202.
  • the sub-slide detector 208 may detect a sub-slide region or regions in the video stream 202. For example, the sub-slide detector 208 may detect a sub-slide region or regions from the slide region 206. In some examples, the sub-slide detector 208 may detect a text region or regions 210 and/or an image region or regions 218 within the region 206 in the video stream 202 that depicts a slide. In some examples, detecting a sub-slide region(s) may be performed as described in relation to Figure 1 . For example, the sub-slide detector 208 may determine a bounding box or bounding boxes for a text region or regions 210 and/or for an image region or regions 218.
  • the sub-slide detector 208 may crop the text region(s) 210 and/or the image region(s) 218 from the video stream 202 and/or slide region 206. In some examples, the subslide detector 208 may indicate a location and/or dimensions for text region(s) 210 and/or image region(s) 218 in the video stream 202 and/or slide region 206. [0036] In some examples, the sub-slide detector 208 may provide the text region(s) 210 to a character recognition engine 212. The character recognition engine 212 may perform character recognition on the text region(s) 210 to produce text 214. In some examples, performing character recognition may be performed as described in relation to Figure 1. For instance, the character recognition engine 212 may perform optical character recognition on the text region(s) 210 to produce text 214. In some examples, the text 214 may be provided to a text parser 216.
  • the text parser 216 may parse the text 214. In some examples, parsing the text 214 may be performed as described in relation to Figure 1 . For instance, the text parser may determine a text subset or subsets and/or may categorize the text (e.g., text subset(s)). The text subset(s) and/or category(ies) may be provided to a source data extractor 222. In some examples, the text parser 216 may extract text features of the text 214. For instance, the source data extractor 222 may utilize the text features to extract data (e.g., relevant data) from a source or sources and/or create data visualization(s) based on the extracted data.
  • data e.g., relevant data
  • the sub-slide detector 208 may provide the image region(s) 218 to an image parser 220.
  • parsing the image region(s) 218 may be performed as described in relation to Figure 1.
  • the image parser 220 may extract feature(s) from the image region(s) 218, may recognize an object or objects in the image region(s) 218, and/or may determine label(s) for the image region(s) 218 (e.g., for object(s) and/or feature(s) in the image region(s)).
  • the feature(s), object recognition information (e.g., recognized object(s)), and/or label(s) may be provided to the source data extractor 222.
  • the source data extractor 222 may extract data (e.g., relevant data) from a source or sources and/or create data visualization(s) based on the extracted data. In some examples, extracting the data and/or creating data visualization(s) 224 may be performed as described in relation to Figure 1 .
  • the source data extractor 222 may include a formatter, a data extraction engine, and/or a visualization builder.
  • the formatter may format the extracted content (e.g., text parser 216 output(s), text subset(s), text subset type(s), text subset category(ies), text feature(s), image parser 220 output(s), feature(s), object recognition information, recognized object(s), and/or label(s), etc.).
  • the formatter may format the extracted content into a request or requests (e.g., query(ies)) for an internal and/or external source(s).
  • formatting the extracted content may include formatting the extracted content into keyword(s), text string(s), feature vector(s), etc., that may be utilized to extract relevant data from the source(s).
  • the request(s) may be provided to the data extraction engine.
  • the data extraction engine may provide the request(s) to a source or sources and/or may obtain (e.g., receive) data (e.g., relevant data) from the source(s). For instance, the data extraction engine may provide and/or send the request(s) to database(s) and/or web server(s) to obtain and/or receive the data.
  • the data e.g., relevant data
  • the data may be provided to the visualization builder.
  • the visualization builder may create data visualization(s) 224 based on the data. For example, creating the data visualization(s) 224 may be performed as described in relation to Figure 1. For instance, the visualization builder may create a highlighted bounding box, tooltip, window, pop-up, graphic, etc., that includes the data (e.g., relevant data). In some examples, the visualization builder may create the data visualization(s) 224 based on a size of a screen for display, a size of a window, a location of the slide on the screen, etc.
  • the visualization builder may create a tooltip, window, pop-up, graphic, etc., that is within bounds of the screen (so the visualization would not be cutoff outside the range of the screen, for instance), and/or to reduce or avoid overlapping with the text region(s) and/or image region(s) of the slide.
  • the data visualization(s) 224 may be provided to a screen, to a Tenderer (e.g., graphics chip(s), display controller(s), etc.) and/or to a communication interface to present the data visualization(s) 224.
  • presenting the data visualization(s) 224 may include providing and/or sending the data visualization(s) 224 to a screen, Tenderer, and/or communication interface for output (e.g., to display the data visualization(s) on a local screen and/or a linked device or screen).
  • FIG. 3 is a block diagram of an example of an apparatus 326 that may be used in video stream content extraction.
  • the apparatus 326 may be an electronic device, such as a personal computer, a server computer, a smartphone, a tablet computer, etc.
  • the apparatus 326 may include and/or may be coupled to a processor 328 and/or a memory 330.
  • the apparatus 326 may include additional components (not shown) and/or some of the components described herein may be removed and/or modified without departing from the scope of this disclosure.
  • the processor 328 may be any of a central processing unit (CPU), a digital signal processor (DSP), a semiconductor-based microprocessor, graphics processing unit (GPU), field-programmable gate array (FPGA), an applicationspecific integrated circuit (ASIC), and/or other hardware device suitable for retrieval and execution of instructions stored in the memory 330.
  • the processor 328 may fetch, decode, and/or execute instructions stored in the memory 330.
  • the processor 328 may include an electronic circuit or circuits that include electronic components for performing an operation or operations of the instructions.
  • the processor 328 may be implemented to perform one, some, or all of the aspects, operations, elements, etc., described in relation to one, some, or all of Figures 1-5.
  • the memory 330 is an electronic, magnetic, optical, and/or other physical storage device that contains or stores electronic information (e.g., instructions and/or data).
  • the memory 330 may be, for example, Random Access Memory (RAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage device, an optical disc, and/or the like.
  • RAM Random Access Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • the memory 330 may be volatile and/or non-volatile memory, such as Dynamic Random Access Memory (DRAM), EEPROM, magnetoresistive random-access memory (MRAM), phase change RAM (PCRAM), memristor, flash memory, and/or the like.
  • DRAM Dynamic Random Access Memory
  • MRAM magnetoresistive random-access memory
  • PCRAM phase change RAM
  • memristor flash memory, and/or the like.
  • the memory 330 may be a non-transitory tangible machine-readable or computer-readable storage medium, where the term “non-transitory” does not encompass transitory propagating signals.
  • the memory 330 may include multiple devices (e.g., a RAM card and a solid-state drive (SSD)).
  • the apparatus 326 may include a communication interface 348 through which the processor 328 may communicate with an external device or devices (e.g., networked device, server, smartphone, printer, etc.).
  • the apparatus 326 may be in communication with (e.g., coupled to, have a communication link with) a screen and/or an image sensor.
  • the apparatus 326 may include an integrated screen and/or an integrated image sensor.
  • the communication interface 348 may include hardware and/or machine-readable instructions to enable the processor 328 to communicate with the external device or devices.
  • the communication interface 348 may enable a wired and/or wireless connection to the external device or devices.
  • the communication interface 348 may include a network interface card and/or may also include hardware and/or machine-readable instructions to enable the processor 328 to communicate with various input and/or output devices. Examples of output devices include a screen (e.g., display), speakers, etc. Examples of input devices include a keyboard, a mouse, a touch screen, etc., through which a user may input instructions and/or data into the apparatus 326.
  • the communication interface 348 may send and/or receive a video stream. For instance, the communication interface may send and/or receive a video stream of a virtual meeting. In some examples, the video stream of the virtual meeting may depict a slide or slides.
  • the memory 330 of the apparatus 326 may store image data 332, supplemental data 334, region detection instructions 336, text processing instructions 338, image processing instructions 340, and/or supplementation instructions 342.
  • the image data 332 is data that indicates an image or images.
  • the image data 332 may include video stream data, video frame(s), still image(s), etc.
  • the supplemental data 334 is data that is supplemental to the image data 332.
  • the supplemental data 334 may indicate data that is relevant to the image data 332, bounding box data indicating the location and/or dimension(s) of a bounding box(es) of content in the image data 332, visualization data indicating a visualization(s) of data relevant to the image data 332, etc.
  • the region detection instructions 336 are instructions to detect a region in a video stream of a virtual meeting that depicts a slide.
  • the region detection instructions 336 may include executable code to detect a region in a video stream of a virtual meeting that depicts a slide.
  • the region detection instructions 336 are instructions to detect a text region or regions and/or an image region or regions in the region (e.g., in the detected slide region).
  • the processor 328 may execute the region detection instructions 336 to detect a region in a video stream of a virtual meeting that depicts a slide and/or to detect text region(s) and/or image region(s) in the region as described in relation to Figure 1 and/or Figure 2.
  • the text processing instructions 338 are instructions to recognize and/or parse text in a text region or regions.
  • the processor 328 may execute the text processing instructions 338 to perform text recognition and/or to parse text in the text region(s) as described in relation to Figure 1 and/or Figure 2. For instance, the processor 328 may parse text in the text region(s) to determine a text subset or subsets.
  • the processor 328 may execute the text processing instructions 338 to categorize the text subset(s) to determine a category or categories. In some examples, categorizing the text subset(s) may be performed as described in relation to Figure 1 and/or Figure 2. For instance, the processor 328 may utilize a text classifier to determine a category or categories for the respective text subset(s).
  • the image processing instructions 340 are instructions to recognize and/or parse image(s) in an image region or regions.
  • the processor 328 may execute the image processing instructions 340 to perform image recognition and/or to parse image(s) in the image region(s) as described in relation to Figure 1 and/or Figure 2.
  • the processor 328 may extract feature(s), perform object recognition, and/or perform labeling for image(s) in the image region(s).
  • the processor 328 may utilize a machine learning model to perform object recognition.
  • the supplementation instructions 342 are instructions to supplement the image data 332 (e.g., video stream(s)).
  • the processor 328 may execute the supplementation instructions 342 to determine supplemental data 334 as described in relation to Figure 1 and/or Figure 2. For instance, the processor 328 may determine data (e.g., relevant data) corresponding to the image data 332 (e.g., corresponding to a slide or slides depicted by the image data 332) and/or may determine visualization(s) for the data.
  • the processor 328 may execute the supplementation instructions 342 to query a source or sources based on text (e.g., text subset(s)) and/or image(s).
  • the processor 328 may query a source or sources based on a text subset and a category to produce queried data.
  • the queried data may be data that is supplemental to and/or relevant to the image data 332 (e.g., to slide(s) depicted in the image data 332).
  • the processor 328 may execute the supplementation instructions 342 to display a bounding box of a text subset of a text region. For instance, the processor 328 may create a highlighted bounding box corresponding to a text subset of the text region. For instance, the text processing instructions 338 may be executed to recognize text and/or to determine text subsets.
  • the apparatus 326 e.g., processor 328
  • the processor 328 may determine a highlighted bounding box that includes a text subset by setting the sides of the highlighted bounding box as the minimum and maximum pixel locations (or minimum and maximum pixel locations with added padding or margin) of the characters of the text subset.
  • the processor 328 may execute the supplementation instructions 342 to display queried data corresponding to text (e.g., text subset(s)) and/or image(s). For instance, the processor 328 may create a visualization that includes the queried data (or a subset of the queried data) for a text subset or image. For instance, the supplementation instructions 342 may be executed to create the visualization(s) and/or to display the queried data in the visualization(s). For instance, the processor 328 may send the queried data and/or visualization(s) including the queried data to a screen, renderer, and/or display controller to display the queried data corresponding to text (e.g., text subset(s)) and/or image(s). In some examples, an element or elements of the apparatus 326 may be omitted or combined.
  • Some examples of the techniques described herein may be utilized to work with slides, online classes, presentations, and/or recorded presentations. For instance, some of the techniques may be utilized to augment and/or supplement a presentation (e.g., slide(s)) according to the content and/or context of the presentation.
  • remote presentations, online meetings, recorded presentations, live presentations being captured with an image sensor, and/or online classes may be supplemented, augmented, and/or enhanced.
  • supplemental information regarding people attending a meeting may be provided. For instance, a tooltip may be created that can display an organization chart, profile (e.g., Linkedln profile), and so on, which may help a presenter to tailor his/her speech according to the audience.
  • Some examples of the techniques described herein may provide customization of the supplemental data (e.g., displayed information) based on user information (e.g., user profile, user interest data, etc.).
  • source data extraction may format request(s) to and/or filter queried data from a source or sources based on user profile(s).
  • user profile e.g., user profile, user interest data, etc.
  • source data extraction may format request(s) to and/or filter queried data from a source or sources based on user profile(s).
  • a user may have an interest in topics like reinforcement learning and Bayesian inference, but may have less interest in clustering methods.
  • the apparatus may customize the request(s) to include related terms in which the user has an interest.
  • the apparatus may provide an interface for user input (e.g., text box, field, etc.) to receive input indicating user information (e.g., topics of interest, categories of interest, terms, etc.).
  • Figure 4 is a block diagram illustrating an example of a computer- readable medium 450 for extracting video stream content.
  • the computer- readable medium 450 is a non-transitory, tangible computer-readable medium.
  • the computer-readable medium 450 may be, for example, RAM, EEPROM, a storage device, an optical disc, and the like.
  • the computer- readable medium 450 may be volatile and/or non-volatile memory, such as DRAM, EEPROM, MRAM, PCRAM, memristor, flash memory, and the like.
  • the memory 330 described in relation to Figure 3 may be an example of the computer-readable medium 450 described in relation to Figure 4.
  • the computer-readable medium 450 may include code (e.g., data and/or instructions).
  • the computer-readable medium 450 may include slide portion detection instructions 452, region detection instructions 454, optical character recognition instructions 456, feature extraction instructions 458, information determination instructions 460, and/or visualization determination instructions 462.
  • the slide portion detection instructions 452 may include code to cause a processor to detect a portion of a video frame that depicts a presentation slide.
  • the portion may be detected as described in relation to Figure 1 , Figure 2, and/or Figure 3.
  • the region detection instructions 454 may include code to cause the processor to detect a text region and an image region in the portion.
  • the text region and the image region may be detected as described in relation to Figure 1 , Figure 2, and/or Figure 3.
  • the optical character recognition instructions 456 may include code to cause the processor to perform optical character recognition on the text region to determine text. In some examples, performing optical character recognition may be performed as described in relation to Figure 1 , Figure 2, and/or Figure 3.
  • the feature extraction instructions 458 may include code to cause the processor to extract features from the image region. In some examples, extracting features from the image region may be performed as described in relation to Figure 1 , Figure 2, and/or Figure 3.
  • the information determination instructions 460 may include code to cause the processor to determine first information based on the text and second information based on the features. In some examples, determining the first information based on the text and the second information based on the features may be performed as described in relation to Figure 1 , Figure 2, and/or Figure 3. In some examples, the information determination instructions 460 may include code to cause the processor to search a feature database based on the features to determine the second information, where the second information includes an image or images. For instance, the second information may include an image or images that have a feature or features similar to the features from the image region.
  • the visualization determination instructions 462 may include code to cause the processor to determine a first visualization of the first information based on a first source and/or a second visualization of the second information based on a second source. In some examples, determining the first visualization and/or the second visualization may be performed as described in relation to Figure 1 , Figure 2, and/or Figure 3. For instance, the first information may include information related to the text from a first source and/or the second information may include information related to the image(s) from a second source (e.g., different second source).
  • the first visualization may include the first information or a subset of the first information (e.g., a truncated portion of text) and/or the second visualization may include the second information or a subset of the second information (e.g., an image, a part of an image, an image thumbnail, etc.).
  • first information e.g., a truncated portion of text
  • second visualization may include the second information or a subset of the second information (e.g., an image, a part of an image, an image thumbnail, etc.).
  • FIG. 5 is a diagram illustrating an example of a video stream 564.
  • the video stream 564 depicts a slide 572 from an online meeting.
  • an apparatus may detect the slide 572.
  • the apparatus may detect text subsets 566a-d in a text region of the slide 572 and may detect an image region 568 of the slide 572.
  • the apparatus has created highlighted bounding boxes corresponding to the text subsets 566a-d and the image region 568.
  • the apparatus has created visualizations 570a-c that include data relevant to text subset A 566a, text subset D 566d, and the image region 568.
  • information regarding Monte Carlo techniques may be extracted from a source (e.g., dictionary, wiki, etc.) and included in visualization A 570a, which is relevant to text subset A 566a.
  • Information regarding “RL” (reinforcement learning) techniques may be extracted from another source (e.g., academic database, etc.) and included in visualization B 570b, which is relevant to text subset d 566d.
  • a source e.g., DBpedia
  • citations from the source may be presented in visualization B 570b.
  • Images similar to the image in the image region 568 may be extracted from a source (e.g., image database, etc.) and included in visualization C 570c, which is relevant to the image region 568.
  • a corresponding visualization 570a, 570b, and/or 570c may appear when an input (e.g., mouse cursor, touch, tap, click, etc.) is detected over a highlighted bounding box.
  • the highlighted bounding boxes may be presented over the video stream 564, and presentation of a visualization 570a, 570b, 570c may be triggered based on detected interaction with a corresponding highlighted bounding box.
  • a tooltip visualization may be hidden until an input is detected over a highlighted bounding box. For instance, a tooltip may not appear for text subset B 566b (for Markov decision process (“MDP”)) until an input is detected over the highlighted bounding box.
  • MDP Markov decision process
  • a visualization may include a link or links. For instance, if an event (e.g., mouse click, tap, keyboard input, etc.) corresponding to a link (e.g., “See more” link) is detected, the corresponding visualization may be expanded to show more and/or a window (e.g., browser program) may be opened to provide more data.
  • a visualization may indicate a source (e.g., “From Internal Database,” “From Google Scholar,” “From Wikipedia,” etc.).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Graphics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne des exemples de contenu de flux vidéo. Dans certains exemples, un procédé peut comprendre une étape consistant à détecter une diapositive. La diapositive peut être détectée dans un flux vidéo. Dans certains exemples, le procédé peut comprendre une étape consistant à extraire du contenu de la diapositive. Dans certains exemples, le procédé peut comprendre une étape consistant à présenter des données pertinentes récupérées sur la base du contenu.
PCT/US2020/045064 2020-08-05 2020-08-05 Contenu de flux vidéo WO2022031283A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2020/045064 WO2022031283A1 (fr) 2020-08-05 2020-08-05 Contenu de flux vidéo

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2020/045064 WO2022031283A1 (fr) 2020-08-05 2020-08-05 Contenu de flux vidéo

Publications (1)

Publication Number Publication Date
WO2022031283A1 true WO2022031283A1 (fr) 2022-02-10

Family

ID=80118424

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/045064 WO2022031283A1 (fr) 2020-08-05 2020-08-05 Contenu de flux vidéo

Country Status (1)

Country Link
WO (1) WO2022031283A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023235576A1 (fr) * 2022-06-04 2023-12-07 Zoom Video Communications, Inc. Extraction de contenu textuel à partir d'une vidéo d'une session de communication

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150296228A1 (en) * 2014-04-14 2015-10-15 David Mo Chen Systems and Methods for Performing Multi-Modal Video Datastream Segmentation
US20150324848A1 (en) * 2004-10-01 2015-11-12 Ricoh Co., Ltd. Dynamic Presentation of Targeted Information in a Mixed Media Reality Recognition System
WO2018057530A1 (fr) * 2016-09-21 2018-03-29 GumGum, Inc. Modèles d'apprentissage machine d'identification d'objets représentés dans des données d'image ou de vidéo

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150324848A1 (en) * 2004-10-01 2015-11-12 Ricoh Co., Ltd. Dynamic Presentation of Targeted Information in a Mixed Media Reality Recognition System
US20150296228A1 (en) * 2014-04-14 2015-10-15 David Mo Chen Systems and Methods for Performing Multi-Modal Video Datastream Segmentation
WO2018057530A1 (fr) * 2016-09-21 2018-03-29 GumGum, Inc. Modèles d'apprentissage machine d'identification d'objets représentés dans des données d'image ou de vidéo

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023235576A1 (fr) * 2022-06-04 2023-12-07 Zoom Video Communications, Inc. Extraction de contenu textuel à partir d'une vidéo d'une session de communication

Similar Documents

Publication Publication Date Title
US10108709B1 (en) Systems and methods for queryable graph representations of videos
JP5510167B2 (ja) ビデオ検索システムおよびそのためのコンピュータプログラム
US10671851B2 (en) Determining recommended object
CN110446063B (zh) 视频封面的生成方法、装置及电子设备
US8868609B2 (en) Tagging method and apparatus based on structured data set
CN115699109A (zh) 使用多模态融合框架来处理承载有图像的电子文档
WO2022134701A1 (fr) Procédé et appareil de traitement vidéo
CN113661487A (zh) 使用机器训练词条频率加权因子的产生密集嵌入向量的编码器
US20160026858A1 (en) Image based search to identify objects in documents
CN112364204A (zh) 视频搜索方法、装置、计算机设备及存储介质
CN111639228B (zh) 视频检索方法、装置、设备及存储介质
CN113806588B (zh) 搜索视频的方法和装置
US11854285B2 (en) Neural network architecture for extracting information from documents
US10699112B1 (en) Identification of key segments in document images
WO2023087934A1 (fr) Procédé de commande vocale, appareil, dispositif, et support de stockage informatique
US9866894B2 (en) Method for annotating an object in a multimedia asset
US11733823B2 (en) Synthetic media detection and management of trust notifications thereof
US20210326589A1 (en) Object detection and image classification based optical character recognition
WO2022031283A1 (fr) Contenu de flux vidéo
US20220301285A1 (en) Processing picture-text data
CN115098729A (zh) 视频处理方法、样本生成方法、模型训练方法及装置
Chen Multimedia databases and data management: a survey
US11170171B2 (en) Media data classification, user interaction and processors for application integration
US10922476B1 (en) Resource-efficient generation of visual layout information associated with network-accessible documents
WO2021213339A1 (fr) Procédé et système d'extraction et de stockage de métadonnées d'images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20948588

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20948588

Country of ref document: EP

Kind code of ref document: A1