WO2021060966A1 - A system and method for retrieving a presentation content - Google Patents

A system and method for retrieving a presentation content Download PDF

Info

Publication number
WO2021060966A1
WO2021060966A1 PCT/MY2020/050055 MY2020050055W WO2021060966A1 WO 2021060966 A1 WO2021060966 A1 WO 2021060966A1 MY 2020050055 W MY2020050055 W MY 2020050055W WO 2021060966 A1 WO2021060966 A1 WO 2021060966A1
Authority
WO
WIPO (PCT)
Prior art keywords
presentation
text
speech
presenter
screen
Prior art date
Application number
PCT/MY2020/050055
Other languages
French (fr)
Inventor
Gin Xian KOK
Khong Neng Choong
Chrishanton Vethanayagam SEBASTIAMPILLAI
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2021060966A1 publication Critical patent/WO2021060966A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present invention relates to a system and method for retrieving a presentation content.
  • An information retrieval system is a system that includes storage, retrieval, and maintenance of information.
  • the information in this context can be composed of text, images, audio, video, and other multimedia contents.
  • a user enters a keyword to search and to retrieve the information.
  • all the text, audio, video, and other multimedia contents tagged with similar keyword as the keyword in query appear as results.
  • An example of an information retrieval system and method is disclosed in a China Patent No. CN106384108 which relates to a method for retrieving a text content.
  • the method includes an image identification, text extraction, and voice identification where the method provides a text retrieval using keywords.
  • a database will interpret word contents provided with the keywords after a query module receives the text content from a text extracting module and a speech recognition module.
  • the patent relates to an image retrieval system.
  • the system performs both keyword-based and content-based image retrieval.
  • Depends on an input query the system finds images with keywords and/or images with similar low-level features such as colour, texture, and shape.
  • a user interface allows the user to identify images that are more relevant to the query, as well as images that are less relevant.
  • the system monitors the user feedback to refine any search and to train the system for future search queries.
  • a video of the presentation slides is streamed to the user together with its audio comprising a speech of a respective presenter explaining the presentation slides.
  • the presentation content which includes the video file of the presentation slides and the audio file of the presenter may be stored in a database for retrieval.
  • the retrieval of the presentation content can be done later if each presentation content stored is tagged with a keyword.
  • the existing information retrieval system can be used to retrieve the presentation content as a whole.
  • the existing information retrieval may not be used to pinpoint or search for a specific presentation slide together with the presenter’s speech for that particular presentation slide. This is because the presentation slide is packaged as part of the video file and the keyword only relates to the presentation content as a whole. The only way for the user to search for the specific presentation slide with its speech is to browse through the whole video file and audio file.
  • the present invention provides a system (100) and method for retrieving a presentation content.
  • the system (100) comprises a presenter device (101), a wireless presentation box (102), a post-processing server (103), a database server (105), and a viewer device (106).
  • the presenter device (101) is configured to stream a presentation video that includes the video stream and its audio stream of a presentation session.
  • the wireless presentation box (102) includes a speech recorder (502) configured to record the audio stream from the presenter device (101).
  • the postprocessing server (103) includes an image to text extractor (504) and a speech to text extractor (505).
  • the wireless presentation box (102) further includes a frame reader (501) and image content type classifier (503).
  • the frame reader (501) is configured to perform screen capturing of the video streaming from the presenter device (101) and the image content type classifier (503) is configured to detect the screen captures to be saved in the database server (105).
  • the post-processing server (103) further includes an object detector (506), a related text generator (507), a text language classifier (508), and a text language translator (509).
  • the object detector (506) is configured to detect common objects found in the screen captures.
  • the related text generator (507) is configured to generate related texts from at least one given input text.
  • the text language classifier (508) is configured to predict the language of at least one given text.
  • the text language translator (509) is configured to translate at least one input text from one language to another language.
  • the viewer device (106) is configured to allow the user to input a keyword for the search of a screen capture with its respective speech audio and retrieve the screen capture and its respective speech audio related to the keyword.
  • the system (100) further includes an application server (104) connected to the database server (105) and the viewer device (106), wherein the application server (104) includes a query engine module (511) that is configured to allow the viewer device (106) to search and retrieve a presentation content from the database server (105).
  • the image to text extractor (504) is configured to extract texts from each screen captures by using Optical Character Recognition.
  • the speech to text extractor (505) is configured to extract texts by using deep learning model.
  • the post-processing server (103) includes a session and presenter link manager (510) configured to create links between different presentation sessions and/or presenters.
  • the present invention also provides a method for storing a presentation content.
  • the method includes the steps of streaming a presentation video from a presenter device (101) to a wireless presentation box (102), receiving, decoding, and playing the presentation video stream by the wireless presentation box (102), performing screen capturing by a frame reader (501) and recording the presenter speech by a speech recorder (502), storing the screen captures and the recorded speech in a database server (105), and processing the stored screen captures and the recorded speech.
  • the step of performing screen capturing by a frame reader (501) includes the sub-steps of receiving a video frame from the presenter device (101) by the frame reader (501) and capturing the video frame that contains the required presentation content by the image content type classifier (503).
  • the step of processing the stored screen captures and the recorded speech includes the sub-steps of extracting a text from the screen captures by an image to text extractor (504), extracting a text from the presenter speech by a speech to text extractor (505), detecting objects from the presentation screen captures and converting object categories into text by the object detector (506), generating related words from the extracted texts, associating the screen captures with the extracted texts and related words, and creating links between presentation sessions and/or presenters by a session and presenter link manager (510).
  • FIG. 1 illustrates a block diagram of a system (100) for retrieving a presentation content according to an embodiment of the present invention.
  • FIG. 2 illustrates a flowchart of a method for storing a presentation content.
  • FIG. 3 illustrates a flowchart of sub-steps for processing stored screen captures and recorded speech of the method of FIG. 2.
  • FIG. 4 illustrates a flowchart of sub-steps for extracting text from the screen captures.
  • FIG. 5 illustrates a flowchart of sub-steps for obtaining text regions in the screen captures.
  • FIG. 6 illustrates a flowchart of sub-steps for extracting text from the presenter speech.
  • FIG. 7 illustrates a flowchart of sub-steps for detecting objects from the screen captures.
  • FIG. 8 illustrates a flowchart of sub-steps for creating links between presentation sessions and/or presenters.
  • FIG. 9 illustrates a flowchart of a method for retrieving a presentation content.
  • FIG. 1 illustrates a block diagram of a system for retrieving a presentation content according to an embodiment of the present invention.
  • the presentation content refers to one or more screen captures in respect of the presentation slides while presentation session refers to a whole presentation comprises all the presentation slides.
  • the system (100) is able to perform a search for specific presentation screen captures by entering keywords.
  • the system (100) comprises a presenter device (101), a wireless presentation box (102), a postprocessing server (103), an application server (104), a database server (105), and a viewer device (106).
  • the presenter device (101) is connected to the wireless presentation box (102) via a wireless connection such as Wi-Fi.
  • the presenter device (101) is configured to stream a live presentation video which includes the video stream and its audio stream of a presentation session.
  • the wireless presentation box (102) includes a frame reader (501), a speech recorder (502), and an image content type classifier (503).
  • the frame reader (501) is configured to perform screen capturing of the live presentation video during the streaming of the presentation video from the presenter device (101).
  • the speech recorder (502) is configured to record the live audio stream of the presenter during the streaming of the live presentation video from the presenter device (101). In particular, the speech recorder (502) automatically records the continuous live audio stream using a microphone attached to the wireless presentation box (102) and saves the audio into the database server (105).
  • the image content type classifier (503) is configured to detect which screen captures need to be saved for further processing, wherein the image content type classifier (503) is trained to determine the required screen captures.
  • the image content type classifier (503) applies one or more image processing techniques such as edge detection, corner detection, blob detection, and ridge detection to the screen captures.
  • the image processing techniques are applied to extract regions or features in the screen captures. Based on the extracted features, one or more classification model is used to detect which screen captures need to be saved, for example, Bag of Word model, Naive Bayes model, and k-nearest neighbours model.
  • the wireless presentation box (102) is further connected to the database server (105).
  • the database server (105) is configured to store the related presentation content as captured and recorded by the frame reader (501) and the speech recorder (502) respectively. Besides, the other contents that include the identity of the presenter device (101), a timestamp of particular screen capture, and the title of each presentation slide are also stored in the database server (105).
  • the post-processing server (103) is connected to the database server (105).
  • the post-processing server (103) comprises an image to text extractor (504), a speech to text extractor (505), an object detector (506), a related text generator (507), a text language classifier (508), a text language translator (509), and a session and presenter link manager (510).
  • the image to text extractor (504) is configured to extract texts from each screen captures.
  • the text is extracted using Optical Character Recognition, OCR software, wherein the software converts the images of typed or printed text into machine- encoded text.
  • the speech to text extractor (505) is configured to extract texts from the presenter speech recorded.
  • the text is extracted using deep learning model where the input is the speech spectrogram.
  • the deep learning model is trained using a plurality of example pairs of spectrogram and label sentences to compute the speech spectrogram.
  • the object detector (506) is configured to detect common objects found in the screen captures.
  • the common object is a term of the image processing discipline which refers to the ability to detect instances of object categories within the screen capture images.
  • the related text generator (507) is configured to generate related texts from given input texts. The words are generated using words stemming and also synonyms.
  • the text language classifier (508) is configured to predict the language of at least one given text. The text language classifier (508) is trained to detect the language based on the input texts.
  • the text language classifier (508) utilised one or more Language Modelling such as Frequent Words model, Short Words model, and N-gram based model to detect words in the input texts.
  • the Language Modelling learns the probability of word occurrence based on the input texts and represent the texts in numerical representation.
  • the classification model such as Naive Bayes model and k-nearest neighbours model are used to predict the text language. For example, if a given text is French, the text language classifier (508) automatically recognise the language. The language information is used to translate all the presentation texts to a common language such as English.
  • the text language translator (509) is configured to translate the input texts from one language to another. If a user tries to search using a different language, the user may be directed to any appropriate presentation sessions as the presentation sessions are translated from the original language to the language the user prefers, such as Malay.
  • the session and presenter link manager (510) is configured to create links between different presentation sessions and/or presenters. Particularly, the session and presenter link manager (510) manage to record the presenter, presentation sessions as well as the screen captures relevant to the presentation sessions. If a user searches for a common content delivered by two different presenters, then a link between both presenters is formed. Furthermore, the session and presenter link manager (510) may recommend the presentation sessions from both presenters.
  • the application server (104) is connected to the database server (105) and the viewer device (106) via the Internet.
  • the application server (104) includes a query engine module (511) which is configured to allow the viewer device (106) to the search and retrieval of the presentation content from the database server (105).
  • the viewer device (106) is configured to allow the user to input a keyword for the search of a screen capture with its respective speech audio and retrieve the screen capture and its respective speech audio related to the keyword.
  • FIG. 2 illustrates a flowchart of a method for storing a presentation content for retrieval of the presentation content. Initially, the presenter device (101) connects to the wireless presentation box (102) via Wi-Fi to stream a live presentation session as in step 1000.
  • the wireless presentation box (102) receives, decodes, and playbacks the live presentation video as in step 2000.
  • the frame reader (501) performs screen captures and the speech recorder (502) records the presenter speech as in step 3000.
  • the frame reader (501) receives a video frame from the presenter device (101) and the image content type classifier (503) determines if the video frame contains the required screen captures.
  • the image content type classifier (503) is trained to recognise which frames in the presentation slides are important to capture while the speech recorder (502) is programmed to record the presenter speech automatically when the presenter begins the presentation session.
  • both of the contents are stored in the database server (105) as in step 4000.
  • step 5000 the stored screen captures and the recorded speech are processed as in step 5000.
  • FIG. 3 illustrates a flowchart of sub-steps for processing the stored screen captures and recorded speech as in step 5000 of the method of FIG. 2.
  • the image to text extractor (504) extracts a text from the presentation screen captures.
  • the image to text extractor (504) uses OCR software, wherein the OCR software converts the images of typed or printed text into machine- encoded text.
  • the speech to text extractor (505) extracts the text from the presenter speech.
  • the object detector (506) detects objects from the screen captures and converts object categories into text. Particularly, any non-text data may be considered as objects to be detected.
  • the object detector (506) is trained to detect any objects and also instances of object categories within the screen captures.
  • step 5400 the text generator (507) generates related words from the extracted texts once the object detector (506) has converted the object categories into text.
  • step 5500 the screen captures are associated with the extracted texts and related words.
  • step 5600 the session and presenter link manager (510) creates links between presentation sessions and/or presenters.
  • FIG. 4 illustrates a flowchart of sub-steps for extracting text from the screen captures as in step 5100 of FIG.3.
  • text regions are obtained in the presentation screen captures as in step 5110.
  • the image to text extractor (504) performs an image processing technique on the text regions as in step 5120.
  • the image processing technique is a combination of processes performed to preserve important elements in the text regions such as edges, colour, texture, and compositions.
  • the image processing technique includes image denoising, image sharpening, low light enhancement, edge detection, and image cropping.
  • the image processing technique improves the quality of the text regions so that the text regions appear clearly and may be detected accurately in the next step.
  • the text is extracted using OCR software, wherein the OCR software distinguishes printed text characters inside a digital image as in step 5130.
  • the extracted text is stored in the database server (105) as in step 5140.
  • FIG. 5 illustrates a flowchart of sub-steps for obtaining text regions in the screen captures as in step 5110 of FIG.4.
  • a screen capture is supplied into a multioutput classifier as in step 5111.
  • the screen capture is split into multiple regions. If the output value of the region is 0, meaning that the corresponding region contains no text and vice versa.
  • FIG. 6 illustrates a flowchart of sub-steps for extracting text from the presenter speech as in step 5200 of FIG. 3.
  • the presenter speech file is loaded as a time signal to synchronise the presentation sessions with the recorded speech as in step 5210.
  • a spectrogram of the loaded time-based speech signal is computed as in step 5220.
  • the spectrogram of the speech signal is supplied to the deep learning sequence-to-sequence model as in step 5230.
  • the sequence-to-sequence model builds on top of language model by adding an encoder step and a decoder step.
  • the encoder step the model takes the speech spectrogram as an input and trains the sequence of characters in an encoded representation.
  • the decoder step the output sequence from the encoder is trained to get a translated sentence.
  • the deep learning sequence-to-sequence model provides texts of the presenter speech as an output.
  • the output texts are obtained and stored in the database server (105) as in step 5240.
  • FIG. 7 illustrates a flowchart of sub-steps for detecting objects from the screen captures as in step 5300 of FIG. 3.
  • the screen captures are passed to the object detector (506) as in step 5310.
  • the list of detected objects within the screen captures is initialised as an empty list as in step 5320, which means there is no object detected by the object detector (506).
  • the output is received from the object detector (506) as in step 5330.
  • the object in the list of objects detected within the screen captures are converted to words and stored in the database server (105) as in step 5340.
  • FIG. 8 illustrates a flowchart of sub-steps for creating links between presentation sessions and/or presenters as in step 5600 of FIG. 3.
  • the session and presenter link manager (510) receives presentation sessions and/or content as in step 5610.
  • the session and presenter link manager (510) transforms the presentation content and computes similarities as in step 5620. Finally, a link is added between sessions and/or content with the similarities over a threshold value as in step 5630. For example, if a user searches for a particular content that related to any presentation sessions and/or contents, the session and presenter link manager (510) may recommends the closest presentation sessions and/or content from different presenters.
  • the threshold value may be implemented in a 0-100% scale value with 100% refers to exact matching or similarity. For example, a value of 50% means if half of the content (selected keywords) are matched, and therefore, a link between them will be created.
  • FIG. 9 illustrates a flowchart of a method for retrieving a presentation content. Initially, a user enters keywords to search for screen captures as in step 6100. Next, the system (100) returns a list of potential screen captures and the corresponding sessions as in step 6200.
  • the user selects required screen captures as in step 6300. After the screen captures have been selected, the user is directed to the presentation sessions based on the selected screen captures as in step 6400.

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a system (100) and method for retrieving a presentation content. The system (100) comprising a presenter device (101) configured to stream a live presentation video that includes the video stream and its audio stream, a wireless presentation box (102) which includes a frame reader (501), a speech recorder (502), and an image content type classifier (503), a post-processing server (103), a database server (105) connected to the post-processing server (103), and a viewer device (106) configured to allow the user to input a keyword for the search of a screen capture and retrieve the screen capture and its respective speech audio related to the keyword. The system (100) further comprises an application server (104) which includes a query engine module (511) that is configured to allow the viewer device (106) to search and retrieve a presentation content from the database server (105).

Description

A SYSTEM AND METHOD FOR RETRIEVING A PRESENTATION CONTENT
FIELD OF INVENTION
The present invention relates to a system and method for retrieving a presentation content.
BACKGROUND OF THE INVENTION
An information retrieval system is a system that includes storage, retrieval, and maintenance of information. The information in this context can be composed of text, images, audio, video, and other multimedia contents. Typically, a user enters a keyword to search and to retrieve the information. Thus, all the text, audio, video, and other multimedia contents tagged with similar keyword as the keyword in query appear as results.
An example of an information retrieval system and method is disclosed in a China Patent No. CN106384108 which relates to a method for retrieving a text content. The method includes an image identification, text extraction, and voice identification where the method provides a text retrieval using keywords. A database will interpret word contents provided with the keywords after a query module receives the text content from a text extracting module and a speech recognition module.
Another example of an information retrieval system and method is disclosed in a United States Patent No. 7099860 B1 . The patent relates to an image retrieval system. The system performs both keyword-based and content-based image retrieval. Depends on an input query, the system finds images with keywords and/or images with similar low-level features such as colour, texture, and shape. A user interface allows the user to identify images that are more relevant to the query, as well as images that are less relevant. The system monitors the user feedback to refine any search and to train the system for future search queries.
In an online presentation session, a video of the presentation slides is streamed to the user together with its audio comprising a speech of a respective presenter explaining the presentation slides. Thus, the presentation content which includes the video file of the presentation slides and the audio file of the presenter may be stored in a database for retrieval. The retrieval of the presentation content can be done later if each presentation content stored is tagged with a keyword. Thus, the existing information retrieval system can be used to retrieve the presentation content as a whole. However, the existing information retrieval may not be used to pinpoint or search for a specific presentation slide together with the presenter’s speech for that particular presentation slide. This is because the presentation slide is packaged as part of the video file and the keyword only relates to the presentation content as a whole. The only way for the user to search for the specific presentation slide with its speech is to browse through the whole video file and audio file.
Therefore, there is a need to provide a system and method for retrieving a presentation content that addresses the aforementioned drawback.
SUMMARY OF INVENTION
The present invention provides a system (100) and method for retrieving a presentation content. The system (100) comprises a presenter device (101), a wireless presentation box (102), a post-processing server (103), a database server (105), and a viewer device (106). The presenter device (101) is configured to stream a presentation video that includes the video stream and its audio stream of a presentation session. The wireless presentation box (102) includes a speech recorder (502) configured to record the audio stream from the presenter device (101). The postprocessing server (103) includes an image to text extractor (504) and a speech to text extractor (505). The wireless presentation box (102) further includes a frame reader (501) and image content type classifier (503). The frame reader (501) is configured to perform screen capturing of the video streaming from the presenter device (101) and the image content type classifier (503) is configured to detect the screen captures to be saved in the database server (105). The post-processing server (103) further includes an object detector (506), a related text generator (507), a text language classifier (508), and a text language translator (509). The object detector (506) is configured to detect common objects found in the screen captures. The related text generator (507) is configured to generate related texts from at least one given input text. The text language classifier (508) is configured to predict the language of at least one given text. The text language translator (509) is configured to translate at least one input text from one language to another language. The viewer device (106) is configured to allow the user to input a keyword for the search of a screen capture with its respective speech audio and retrieve the screen capture and its respective speech audio related to the keyword. The system (100) further includes an application server (104) connected to the database server (105) and the viewer device (106), wherein the application server (104) includes a query engine module (511) that is configured to allow the viewer device (106) to search and retrieve a presentation content from the database server (105).
Preferably, the image to text extractor (504) is configured to extract texts from each screen captures by using Optical Character Recognition.
Preferably, the speech to text extractor (505) is configured to extract texts by using deep learning model.
Preferably, the post-processing server (103) includes a session and presenter link manager (510) configured to create links between different presentation sessions and/or presenters.
The present invention also provides a method for storing a presentation content. The method includes the steps of streaming a presentation video from a presenter device (101) to a wireless presentation box (102), receiving, decoding, and playing the presentation video stream by the wireless presentation box (102), performing screen capturing by a frame reader (501) and recording the presenter speech by a speech recorder (502), storing the screen captures and the recorded speech in a database server (105), and processing the stored screen captures and the recorded speech.
Preferably, the step of performing screen capturing by a frame reader (501) includes the sub-steps of receiving a video frame from the presenter device (101) by the frame reader (501) and capturing the video frame that contains the required presentation content by the image content type classifier (503).
Preferably, the step of processing the stored screen captures and the recorded speech includes the sub-steps of extracting a text from the screen captures by an image to text extractor (504), extracting a text from the presenter speech by a speech to text extractor (505), detecting objects from the presentation screen captures and converting object categories into text by the object detector (506), generating related words from the extracted texts, associating the screen captures with the extracted texts and related words, and creating links between presentation sessions and/or presenters by a session and presenter link manager (510).
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.
FIG. 1 illustrates a block diagram of a system (100) for retrieving a presentation content according to an embodiment of the present invention.
FIG. 2 illustrates a flowchart of a method for storing a presentation content.
FIG. 3 illustrates a flowchart of sub-steps for processing stored screen captures and recorded speech of the method of FIG. 2.
FIG. 4 illustrates a flowchart of sub-steps for extracting text from the screen captures.
FIG. 5 illustrates a flowchart of sub-steps for obtaining text regions in the screen captures.
FIG. 6 illustrates a flowchart of sub-steps for extracting text from the presenter speech.
FIG. 7 illustrates a flowchart of sub-steps for detecting objects from the screen captures.
FIG. 8 illustrates a flowchart of sub-steps for creating links between presentation sessions and/or presenters.
FIG. 9 illustrates a flowchart of a method for retrieving a presentation content. DESCRIPTION OF THE PREFERRED EMBODIMENT
A preferred embodiment of the present invention will be described herein below with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the description with unnecessary detail.
Reference is made to FIG. 1 which illustrates a block diagram of a system for retrieving a presentation content according to an embodiment of the present invention. The presentation content refers to one or more screen captures in respect of the presentation slides while presentation session refers to a whole presentation comprises all the presentation slides. The system (100) is able to perform a search for specific presentation screen captures by entering keywords. The system (100) comprises a presenter device (101), a wireless presentation box (102), a postprocessing server (103), an application server (104), a database server (105), and a viewer device (106).
The presenter device (101) is connected to the wireless presentation box (102) via a wireless connection such as Wi-Fi. The presenter device (101) is configured to stream a live presentation video which includes the video stream and its audio stream of a presentation session.
The wireless presentation box (102) includes a frame reader (501), a speech recorder (502), and an image content type classifier (503). The frame reader (501) is configured to perform screen capturing of the live presentation video during the streaming of the presentation video from the presenter device (101). The speech recorder (502) is configured to record the live audio stream of the presenter during the streaming of the live presentation video from the presenter device (101). In particular, the speech recorder (502) automatically records the continuous live audio stream using a microphone attached to the wireless presentation box (102) and saves the audio into the database server (105). The image content type classifier (503) is configured to detect which screen captures need to be saved for further processing, wherein the image content type classifier (503) is trained to determine the required screen captures. The image content type classifier (503) applies one or more image processing techniques such as edge detection, corner detection, blob detection, and ridge detection to the screen captures. The image processing techniques are applied to extract regions or features in the screen captures. Based on the extracted features, one or more classification model is used to detect which screen captures need to be saved, for example, Bag of Word model, Naive Bayes model, and k-nearest neighbours model. The wireless presentation box (102) is further connected to the database server (105).
The database server (105) is configured to store the related presentation content as captured and recorded by the frame reader (501) and the speech recorder (502) respectively. Besides, the other contents that include the identity of the presenter device (101), a timestamp of particular screen capture, and the title of each presentation slide are also stored in the database server (105).
The post-processing server (103) is connected to the database server (105). The post-processing server (103) comprises an image to text extractor (504), a speech to text extractor (505), an object detector (506), a related text generator (507), a text language classifier (508), a text language translator (509), and a session and presenter link manager (510).
The image to text extractor (504) is configured to extract texts from each screen captures. The text is extracted using Optical Character Recognition, OCR software, wherein the software converts the images of typed or printed text into machine- encoded text.
The speech to text extractor (505) is configured to extract texts from the presenter speech recorded. The text is extracted using deep learning model where the input is the speech spectrogram. The deep learning model is trained using a plurality of example pairs of spectrogram and label sentences to compute the speech spectrogram.
The object detector (506) is configured to detect common objects found in the screen captures. Particularly, the common object is a term of the image processing discipline which refers to the ability to detect instances of object categories within the screen capture images. The related text generator (507) is configured to generate related texts from given input texts. The words are generated using words stemming and also synonyms. The text language classifier (508) is configured to predict the language of at least one given text. The text language classifier (508) is trained to detect the language based on the input texts. The text language classifier (508) utilised one or more Language Modelling such as Frequent Words model, Short Words model, and N-gram based model to detect words in the input texts. In particular, the Language Modelling learns the probability of word occurrence based on the input texts and represent the texts in numerical representation. Based on the numerical representation, the classification model such as Naive Bayes model and k-nearest neighbours model are used to predict the text language. For example, if a given text is French, the text language classifier (508) automatically recognise the language. The language information is used to translate all the presentation texts to a common language such as English. The text language translator (509) is configured to translate the input texts from one language to another. If a user tries to search using a different language, the user may be directed to any appropriate presentation sessions as the presentation sessions are translated from the original language to the language the user prefers, such as Malay. The session and presenter link manager (510) is configured to create links between different presentation sessions and/or presenters. Particularly, the session and presenter link manager (510) manage to record the presenter, presentation sessions as well as the screen captures relevant to the presentation sessions. If a user searches for a common content delivered by two different presenters, then a link between both presenters is formed. Furthermore, the session and presenter link manager (510) may recommend the presentation sessions from both presenters.
The application server (104) is connected to the database server (105) and the viewer device (106) via the Internet. The application server (104) includes a query engine module (511) which is configured to allow the viewer device (106) to the search and retrieval of the presentation content from the database server (105).
The viewer device (106) is configured to allow the user to input a keyword for the search of a screen capture with its respective speech audio and retrieve the screen capture and its respective speech audio related to the keyword. FIG. 2 illustrates a flowchart of a method for storing a presentation content for retrieval of the presentation content. Initially, the presenter device (101) connects to the wireless presentation box (102) via Wi-Fi to stream a live presentation session as in step 1000.
Next, the wireless presentation box (102) receives, decodes, and playbacks the live presentation video as in step 2000.
As the live presentation video is streaming through the wireless presentation box (102), the frame reader (501) performs screen captures and the speech recorder (502) records the presenter speech as in step 3000. Specifically, the frame reader (501) receives a video frame from the presenter device (101) and the image content type classifier (503) determines if the video frame contains the required screen captures. The image content type classifier (503) is trained to recognise which frames in the presentation slides are important to capture while the speech recorder (502) is programmed to record the presenter speech automatically when the presenter begins the presentation session.
Once the frame reader (501) and the speech recorder (502) have performed screen capturing and recording the presenter speech, both of the contents are stored in the database server (105) as in step 4000.
Finally, the stored screen captures and the recorded speech are processed as in step 5000.
FIG. 3 illustrates a flowchart of sub-steps for processing the stored screen captures and recorded speech as in step 5000 of the method of FIG. 2.
In step 5100, the image to text extractor (504) extracts a text from the presentation screen captures. The image to text extractor (504) uses OCR software, wherein the OCR software converts the images of typed or printed text into machine- encoded text.
In step 5200, the speech to text extractor (505) extracts the text from the presenter speech. In step 5300, the object detector (506) detects objects from the screen captures and converts object categories into text. Particularly, any non-text data may be considered as objects to be detected. The object detector (506) is trained to detect any objects and also instances of object categories within the screen captures.
In step 5400, the text generator (507) generates related words from the extracted texts once the object detector (506) has converted the object categories into text.
In step 5500, the screen captures are associated with the extracted texts and related words.
In step 5600, the session and presenter link manager (510) creates links between presentation sessions and/or presenters.
FIG. 4 illustrates a flowchart of sub-steps for extracting text from the screen captures as in step 5100 of FIG.3. Initially, text regions are obtained in the presentation screen captures as in step 5110. Once the text regions have been obtained, the image to text extractor (504) performs an image processing technique on the text regions as in step 5120. The image processing technique is a combination of processes performed to preserve important elements in the text regions such as edges, colour, texture, and compositions. The image processing technique includes image denoising, image sharpening, low light enhancement, edge detection, and image cropping. The image processing technique improves the quality of the text regions so that the text regions appear clearly and may be detected accurately in the next step.
Thereon, the text is extracted using OCR software, wherein the OCR software distinguishes printed text characters inside a digital image as in step 5130. Finally, the extracted text is stored in the database server (105) as in step 5140.
FIG. 5 illustrates a flowchart of sub-steps for obtaining text regions in the screen captures as in step 5110 of FIG.4. Initially, a screen capture is supplied into a multioutput classifier as in step 5111. In particular, the screen capture is split into multiple regions. If the output value of the region is 0, meaning that the corresponding region contains no text and vice versa.
Next, the partitions of the screen capture that contain texts are obtained as in step 5112. Finally, the connected regions are combined to become larger image regions as in step 5113.
FIG. 6 illustrates a flowchart of sub-steps for extracting text from the presenter speech as in step 5200 of FIG. 3. Initially, the presenter speech file is loaded as a time signal to synchronise the presentation sessions with the recorded speech as in step 5210. Next, a spectrogram of the loaded time-based speech signal is computed as in step 5220.
Thereon, the spectrogram of the speech signal is supplied to the deep learning sequence-to-sequence model as in step 5230. Specifically, the sequence-to-sequence model builds on top of language model by adding an encoder step and a decoder step. In the encoder step, the model takes the speech spectrogram as an input and trains the sequence of characters in an encoded representation. In the decoder step, the output sequence from the encoder is trained to get a translated sentence. Thus, the deep learning sequence-to-sequence model provides texts of the presenter speech as an output. Finally, the output texts are obtained and stored in the database server (105) as in step 5240.
Reference is now made to FIG. 7 which illustrates a flowchart of sub-steps for detecting objects from the screen captures as in step 5300 of FIG. 3. Initially, the screen captures are passed to the object detector (506) as in step 5310. Next, the list of detected objects within the screen captures is initialised as an empty list as in step 5320, which means there is no object detected by the object detector (506).
Thereon, the output is received from the object detector (506) as in step 5330. Finally, the object in the list of objects detected within the screen captures are converted to words and stored in the database server (105) as in step 5340.
FIG. 8 illustrates a flowchart of sub-steps for creating links between presentation sessions and/or presenters as in step 5600 of FIG. 3. Initially, the session and presenter link manager (510) receives presentation sessions and/or content as in step 5610.
Next, the session and presenter link manager (510) transforms the presentation content and computes similarities as in step 5620. Finally, a link is added between sessions and/or content with the similarities over a threshold value as in step 5630. For example, if a user searches for a particular content that related to any presentation sessions and/or contents, the session and presenter link manager (510) may recommends the closest presentation sessions and/or content from different presenters. The threshold value may be implemented in a 0-100% scale value with 100% refers to exact matching or similarity. For example, a value of 50% means if half of the content (selected keywords) are matched, and therefore, a link between them will be created.
FIG. 9 illustrates a flowchart of a method for retrieving a presentation content. Initially, a user enters keywords to search for screen captures as in step 6100. Next, the system (100) returns a list of potential screen captures and the corresponding sessions as in step 6200.
Thereon, the user selects required screen captures as in step 6300. After the screen captures have been selected, the user is directed to the presentation sessions based on the selected screen captures as in step 6400.
While embodiments of the invention have been illustrated and described, it is not intended that these embodiments illustrate and describe all possible forms of the invention. Rather, the words used in the specifications are words of description rather than limitation and various changes may be made without departing from the scope of the invention.

Claims

1 . A system (100) for retrieving a presentation content comprising: a) a presenter device (101) configured to stream a presentation video that includes the presentation video stream and its audio stream of a presentation session; b) a wireless presentation box (102) includes a speech recorder (502) configured to record the audio stream from the presenter device (101); c) a post-processing server (103) includes an image to text extractor (504) and a speech to text extractor (505); d) a database server (105) connected to the post-processing server (103); and e) a viewer device (106); wherein the system (100) is characterised by: f) the wireless presentation box (102) further includes: i. a frame reader (501) configured to perform screen capturing of the video streaming from the presenter device (101); and ii. an image content type classifier (503) configured to detect the screen captures to be saved in the database server (105); g) the post-processing server (103) further includes: i. an object detector (506) configured to detect common objects found in the screen captures; ii. a related text generator (507) configured to generate related texts from given input texts; iii. a text language classifier (508) configured to predict the language of at least one given text; and iv. a text language translator (509) configured to translate input texts from one language to another language; h) the viewer device (106) configured to allow a user to input a keyword for searching of a screen capture with its respective speech audio and retrieving the screen capture and its respective speech audio related to the keyword; and i) an application server (104) connected to the database server (105) and the viewer device (106), wherein the application server (104) includes a query engine module (511) that is configured to allow the viewer device (106) to search and retrieve a presentation content from the database server (105).
2. The system (100) as claimed in claim 1 , wherein the image to text extractor (504) is configured to extract texts from each screen capture by using Optical
Character Recognition.
3. The system (100) as claimed in claim 1, wherein the speech to text extractor (505) is configured to extract texts by using deep learning model.
4. The system (100) as claimed in claim 1, wherein the post-processing server (103) includes a session and presenter link manager (510) configured to create links between different presentation sessions and/or presenters. 5. A method for storing a presentation content for retrieval of the presentation content is characterised by the steps of: a) streaming a presentation video from a presenter device (101) to a wireless presentation box (102); b) receiving, decoding, and playing the presentation video stream by the wireless presentation box (102); c) performing screen capturing by a frame reader (501) and recording the presenter speech by a speech recorder (502); d) storing the screen captures and the recorded speech in a database server (105); and e) processing the stored screen captures and the recorded speech.
6. The method as claimed in claim 5, wherein the step of performing screen capturing by a frame reader (501) includes: a) receiving a video frame from the presenter device (101) by the frame reader (501); and b) capturing the video frame that contains the required presentation content by the image content type classifier (503). 7. The method as claimed in claim 5, wherein the step of processing the stored screen captures and the recorded speech includes: a) extracting a text from the screen captures by an image to text extractor
(504); b) extracting a text from the presenter speech by a speech to text extractor
(505); c) detecting objects from the presentation screen captures and converting object categories into text by the object detector (506); d) generating related words from the extracted texts; e) associating the screen captures with the extracted texts and related words; and f) creating links between presentation sessions and/or presenters by a session and presenter link manager (510).
PCT/MY2020/050055 2019-09-27 2020-07-22 A system and method for retrieving a presentation content WO2021060966A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2019005725 2019-09-27
MYPI2019005725 2019-09-27

Publications (1)

Publication Number Publication Date
WO2021060966A1 true WO2021060966A1 (en) 2021-04-01

Family

ID=75165284

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2020/050055 WO2021060966A1 (en) 2019-09-27 2020-07-22 A system and method for retrieving a presentation content

Country Status (1)

Country Link
WO (1) WO2021060966A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003242155A (en) * 2002-02-15 2003-08-29 Nippon Telegr & Teleph Corp <Ntt> Flex seminar service system
JP2006134051A (en) * 2004-11-05 2006-05-25 Fuji Xerox Co Ltd Translation device, translation method and program
US20080162469A1 (en) * 2006-12-27 2008-07-03 Hajime Terayoko Content register device, content register method and content register program
US20110081075A1 (en) * 2009-10-05 2011-04-07 John Adcock Systems and methods for indexing presentation videos
JP2015032905A (en) * 2013-07-31 2015-02-16 キヤノンマーケティングジャパン株式会社 Information processing device, information processing method, and program
JP2015125650A (en) * 2013-12-26 2015-07-06 日本放送協会 Topic extraction device and program
JP2019066928A (en) * 2017-09-28 2019-04-25 京セラドキュメントソリューションズ株式会社 Knowledge utilization promoting device and knowledge utilization promoting method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003242155A (en) * 2002-02-15 2003-08-29 Nippon Telegr & Teleph Corp <Ntt> Flex seminar service system
JP2006134051A (en) * 2004-11-05 2006-05-25 Fuji Xerox Co Ltd Translation device, translation method and program
US20080162469A1 (en) * 2006-12-27 2008-07-03 Hajime Terayoko Content register device, content register method and content register program
US20110081075A1 (en) * 2009-10-05 2011-04-07 John Adcock Systems and methods for indexing presentation videos
JP2015032905A (en) * 2013-07-31 2015-02-16 キヤノンマーケティングジャパン株式会社 Information processing device, information processing method, and program
JP2015125650A (en) * 2013-12-26 2015-07-06 日本放送協会 Topic extraction device and program
JP2019066928A (en) * 2017-09-28 2019-04-25 京セラドキュメントソリューションズ株式会社 Knowledge utilization promoting device and knowledge utilization promoting method

Similar Documents

Publication Publication Date Title
US11197036B2 (en) Multimedia stream analysis and retrieval
US9489577B2 (en) Visual similarity for video content
KR100828166B1 (en) Method of extracting metadata from result of speech recognition and character recognition in video, method of searching video using metadta and record medium thereof
US20180109843A1 (en) Methods and systems for aggregation and organization of multimedia data acquired from a plurality of sources
CN103761261B (en) A kind of media search method and device based on speech recognition
US7908141B2 (en) Extracting and utilizing metadata to improve accuracy in speech to text conversions
US10572528B2 (en) System and method for automatic detection and clustering of articles using multimedia information
KR102376201B1 (en) System and method for generating multimedia knowledge base
CN112738556B (en) Video processing method and device
CN111767461A (en) Data processing method and device
CN101778233A (en) Data processing apparatus, data processing method, and program
CN112382295B (en) Speech recognition method, device, equipment and readable storage medium
CN114357206A (en) Education video color subtitle generation method and system based on semantic analysis
CN108229285B (en) Object classification method, object classifier training method and device and electronic equipment
US20230401389A1 (en) Enhanced Natural Language Processing Search Engine for Media Content
Yang et al. Lecture video browsing using multimodal information resources
WO2021060966A1 (en) A system and method for retrieving a presentation content
EP3905060A1 (en) Artificial intelligence for content discovery
Tapu et al. TV news retrieval based on story segmentation and concept association
KR20220055648A (en) Method and apparatus for generating video script
JP2002014973A (en) Video retrieving system and method, and recording medium with video retrieving program recorded thereon
JP4305921B2 (en) Video topic splitting method
JP6858003B2 (en) Classification search system
Khollam et al. A Survey on Content Based Lecture Video Retrieval Using Speech and Video Text information
Varma et al. Video Indexing System Based on Multimodal Information Extraction Using Combination of ASR and OCR

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20869815

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20869815

Country of ref document: EP

Kind code of ref document: A1