WO2021060966A1

WO2021060966A1 - A system and method for retrieving a presentation content

Info

Publication number: WO2021060966A1
Application number: PCT/MY2020/050055
Authority: WO
Inventors: Gin Xian KOK; Khong Neng Choong; Chrishanton Vethanayagam SEBASTIAMPILLAI
Original assignee: Mimos Berhad
Priority date: 2019-09-27
Filing date: 2020-07-22
Publication date: 2021-04-01

Abstract

The present invention relates to a system (100) and method for retrieving a presentation content. The system (100) comprising a presenter device (101) configured to stream a live presentation video that includes the video stream and its audio stream, a wireless presentation box (102) which includes a frame reader (501), a speech recorder (502), and an image content type classifier (503), a post-processing server (103), a database server (105) connected to the post-processing server (103), and a viewer device (106) configured to allow the user to input a keyword for the search of a screen capture and retrieve the screen capture and its respective speech audio related to the keyword. The system (100) further comprises an application server (104) which includes a query engine module (511) that is configured to allow the viewer device (106) to search and retrieve a presentation content from the database server (105).

Description

A SYSTEM AND METHOD FOR RETRIEVING A PRESENTATION CONTENT

FIELD OF INVENTION

The present invention relates to a system and method for retrieving a presentation content.

BACKGROUND OF THE INVENTION

An information retrieval system is a system that includes storage, retrieval, and maintenance of information. The information in this context can be composed of text, images, audio, video, and other multimedia contents. Typically, a user enters a keyword to search and to retrieve the information. Thus, all the text, audio, video, and other multimedia contents tagged with similar keyword as the keyword in query appear as results.

An example of an information retrieval system and method is disclosed in a China Patent No. CN106384108 which relates to a method for retrieving a text content. The method includes an image identification, text extraction, and voice identification where the method provides a text retrieval using keywords. A database will interpret word contents provided with the keywords after a query module receives the text content from a text extracting module and a speech recognition module.

Another example of an information retrieval system and method is disclosed in a United States Patent No. 7099860 B1 . The patent relates to an image retrieval system. The system performs both keyword-based and content-based image retrieval. Depends on an input query, the system finds images with keywords and/or images with similar low-level features such as colour, texture, and shape. A user interface allows the user to identify images that are more relevant to the query, as well as images that are less relevant. The system monitors the user feedback to refine any search and to train the system for future search queries.

In an online presentation session, a video of the presentation slides is streamed to the user together with its audio comprising a speech of a respective presenter explaining the presentation slides. Thus, the presentation content which includes the video file of the presentation slides and the audio file of the presenter may be stored in a database for retrieval. The retrieval of the presentation content can be done later if each presentation content stored is tagged with a keyword. Thus, the existing information retrieval system can be used to retrieve the presentation content as a whole. However, the existing information retrieval may not be used to pinpoint or search for a specific presentation slide together with the presenter’s speech for that particular presentation slide. This is because the presentation slide is packaged as part of the video file and the keyword only relates to the presentation content as a whole. The only way for the user to search for the specific presentation slide with its speech is to browse through the whole video file and audio file.

Therefore, there is a need to provide a system and method for retrieving a presentation content that addresses the aforementioned drawback.

SUMMARY OF INVENTION

The present invention provides a system (100) and method for retrieving a presentation content. The system (100) comprises a presenter device (101), a wireless presentation box (102), a post-processing server (103), a database server (105), and a viewer device (106). The presenter device (101) is configured to stream a presentation video that includes the video stream and its audio stream of a presentation session. The wireless presentation box (102) includes a speech recorder (502) configured to record the audio stream from the presenter device (101). The postprocessing server (103) includes an image to text extractor (504) and a speech to text extractor (505). The wireless presentation box (102) further includes a frame reader (501) and image content type classifier (503). The frame reader (501) is configured to perform screen capturing of the video streaming from the presenter device (101) and the image content type classifier (503) is configured to detect the screen captures to be saved in the database server (105). The post-processing server (103) further includes an object detector (506), a related text generator (507), a text language classifier (508), and a text language translator (509). The object detector (506) is configured to detect common objects found in the screen captures. The related text generator (507) is configured to generate related texts from at least one given input text. The text language classifier (508) is configured to predict the language of at least one given text. The text language translator (509) is configured to translate at least one input text from one language to another language. The viewer device (106) is configured to allow the user to input a keyword for the search of a screen capture with its respective speech audio and retrieve the screen capture and its respective speech audio related to the keyword. The system (100) further includes an application server (104) connected to the database server (105) and the viewer device (106), wherein the application server (104) includes a query engine module (511) that is configured to allow the viewer device (106) to search and retrieve a presentation content from the database server (105).

Preferably, the image to text extractor (504) is configured to extract texts from each screen captures by using Optical Character Recognition.

Preferably, the speech to text extractor (505) is configured to extract texts by using deep learning model.

Preferably, the post-processing server (103) includes a session and presenter link manager (510) configured to create links between different presentation sessions and/or presenters.

The present invention also provides a method for storing a presentation content. The method includes the steps of streaming a presentation video from a presenter device (101) to a wireless presentation box (102), receiving, decoding, and playing the presentation video stream by the wireless presentation box (102), performing screen capturing by a frame reader (501) and recording the presenter speech by a speech recorder (502), storing the screen captures and the recorded speech in a database server (105), and processing the stored screen captures and the recorded speech.

Preferably, the step of performing screen capturing by a frame reader (501) includes the sub-steps of receiving a video frame from the presenter device (101) by the frame reader (501) and capturing the video frame that contains the required presentation content by the image content type classifier (503).

Preferably, the step of processing the stored screen captures and the recorded speech includes the sub-steps of extracting a text from the screen captures by an image to text extractor (504), extracting a text from the presenter speech by a speech to text extractor (505), detecting objects from the presentation screen captures and converting object categories into text by the object detector (506), generating related words from the extracted texts, associating the screen captures with the extracted texts and related words, and creating links between presentation sessions and/or presenters by a session and presenter link manager (510).

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.

FIG. 1 illustrates a block diagram of a system (100) for retrieving a presentation content according to an embodiment of the present invention.

FIG. 2 illustrates a flowchart of a method for storing a presentation content.

FIG. 3 illustrates a flowchart of sub-steps for processing stored screen captures and recorded speech of the method of FIG. 2.

FIG. 4 illustrates a flowchart of sub-steps for extracting text from the screen captures.

FIG. 5 illustrates a flowchart of sub-steps for obtaining text regions in the screen captures.

FIG. 6 illustrates a flowchart of sub-steps for extracting text from the presenter speech.

FIG. 7 illustrates a flowchart of sub-steps for detecting objects from the screen captures.

FIG. 8 illustrates a flowchart of sub-steps for creating links between presentation sessions and/or presenters.

FIG. 9 illustrates a flowchart of a method for retrieving a presentation content. DESCRIPTION OF THE PREFERRED EMBODIMENT

A preferred embodiment of the present invention will be described herein below with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the description with unnecessary detail.

Reference is made to FIG. 1 which illustrates a block diagram of a system for retrieving a presentation content according to an embodiment of the present invention. The presentation content refers to one or more screen captures in respect of the presentation slides while presentation session refers to a whole presentation comprises all the presentation slides. The system (100) is able to perform a search for specific presentation screen captures by entering keywords. The system (100) comprises a presenter device (101), a wireless presentation box (102), a postprocessing server (103), an application server (104), a database server (105), and a viewer device (106).

The presenter device (101) is connected to the wireless presentation box (102) via a wireless connection such as Wi-Fi. The presenter device (101) is configured to stream a live presentation video which includes the video stream and its audio stream of a presentation session.

The wireless presentation box (102) includes a frame reader (501), a speech recorder (502), and an image content type classifier (503). The frame reader (501) is configured to perform screen capturing of the live presentation video during the streaming of the presentation video from the presenter device (101). The speech recorder (502) is configured to record the live audio stream of the presenter during the streaming of the live presentation video from the presenter device (101). In particular, the speech recorder (502) automatically records the continuous live audio stream using a microphone attached to the wireless presentation box (102) and saves the audio into the database server (105). The image content type classifier (503) is configured to detect which screen captures need to be saved for further processing, wherein the image content type classifier (503) is trained to determine the required screen captures. The image content type classifier (503) applies one or more image processing techniques such as edge detection, corner detection, blob detection, and ridge detection to the screen captures. The image processing techniques are applied to extract regions or features in the screen captures. Based on the extracted features, one or more classification model is used to detect which screen captures need to be saved, for example, Bag of Word model, Naive Bayes model, and k-nearest neighbours model. The wireless presentation box (102) is further connected to the database server (105).

The database server (105) is configured to store the related presentation content as captured and recorded by the frame reader (501) and the speech recorder (502) respectively. Besides, the other contents that include the identity of the presenter device (101), a timestamp of particular screen capture, and the title of each presentation slide are also stored in the database server (105).

The post-processing server (103) is connected to the database server (105). The post-processing server (103) comprises an image to text extractor (504), a speech to text extractor (505), an object detector (506), a related text generator (507), a text language classifier (508), a text language translator (509), and a session and presenter link manager (510).

The image to text extractor (504) is configured to extract texts from each screen captures. The text is extracted using Optical Character Recognition, OCR software, wherein the software converts the images of typed or printed text into machine- encoded text.

The speech to text extractor (505) is configured to extract texts from the presenter speech recorded. The text is extracted using deep learning model where the input is the speech spectrogram. The deep learning model is trained using a plurality of example pairs of spectrogram and label sentences to compute the speech spectrogram.

The object detector (506) is configured to detect common objects found in the screen captures. Particularly, the common object is a term of the image processing discipline which refers to the ability to detect instances of object categories within the screen capture images. The related text generator (507) is configured to generate related texts from given input texts. The words are generated using words stemming and also synonyms. The text language classifier (508) is configured to predict the language of at least one given text. The text language classifier (508) is trained to detect the language based on the input texts. The text language classifier (508) utilised one or more Language Modelling such as Frequent Words model, Short Words model, and N-gram based model to detect words in the input texts. In particular, the Language Modelling learns the probability of word occurrence based on the input texts and represent the texts in numerical representation. Based on the numerical representation, the classification model such as Naive Bayes model and k-nearest neighbours model are used to predict the text language. For example, if a given text is French, the text language classifier (508) automatically recognise the language. The language information is used to translate all the presentation texts to a common language such as English. The text language translator (509) is configured to translate the input texts from one language to another. If a user tries to search using a different language, the user may be directed to any appropriate presentation sessions as the presentation sessions are translated from the original language to the language the user prefers, such as Malay. The session and presenter link manager (510) is configured to create links between different presentation sessions and/or presenters. Particularly, the session and presenter link manager (510) manage to record the presenter, presentation sessions as well as the screen captures relevant to the presentation sessions. If a user searches for a common content delivered by two different presenters, then a link between both presenters is formed. Furthermore, the session and presenter link manager (510) may recommend the presentation sessions from both presenters.

The application server (104) is connected to the database server (105) and the viewer device (106) via the Internet. The application server (104) includes a query engine module (511) which is configured to allow the viewer device (106) to the search and retrieval of the presentation content from the database server (105).

The viewer device (106) is configured to allow the user to input a keyword for the search of a screen capture with its respective speech audio and retrieve the screen capture and its respective speech audio related to the keyword. FIG. 2 illustrates a flowchart of a method for storing a presentation content for retrieval of the presentation content. Initially, the presenter device (101) connects to the wireless presentation box (102) via Wi-Fi to stream a live presentation session as in step 1000.

Next, the wireless presentation box (102) receives, decodes, and playbacks the live presentation video as in step 2000.

As the live presentation video is streaming through the wireless presentation box (102), the frame reader (501) performs screen captures and the speech recorder (502) records the presenter speech as in step 3000. Specifically, the frame reader (501) receives a video frame from the presenter device (101) and the image content type classifier (503) determines if the video frame contains the required screen captures. The image content type classifier (503) is trained to recognise which frames in the presentation slides are important to capture while the speech recorder (502) is programmed to record the presenter speech automatically when the presenter begins the presentation session.

Once the frame reader (501) and the speech recorder (502) have performed screen capturing and recording the presenter speech, both of the contents are stored in the database server (105) as in step 4000.

Finally, the stored screen captures and the recorded speech are processed as in step 5000.

FIG. 3 illustrates a flowchart of sub-steps for processing the stored screen captures and recorded speech as in step 5000 of the method of FIG. 2.

In step 5100, the image to text extractor (504) extracts a text from the presentation screen captures. The image to text extractor (504) uses OCR software, wherein the OCR software converts the images of typed or printed text into machine- encoded text.

In step 5200, the speech to text extractor (505) extracts the text from the presenter speech. In step 5300, the object detector (506) detects objects from the screen captures and converts object categories into text. Particularly, any non-text data may be considered as objects to be detected. The object detector (506) is trained to detect any objects and also instances of object categories within the screen captures.

In step 5400, the text generator (507) generates related words from the extracted texts once the object detector (506) has converted the object categories into text.

In step 5500, the screen captures are associated with the extracted texts and related words.

In step 5600, the session and presenter link manager (510) creates links between presentation sessions and/or presenters.

FIG. 4 illustrates a flowchart of sub-steps for extracting text from the screen captures as in step 5100 of FIG.3. Initially, text regions are obtained in the presentation screen captures as in step 5110. Once the text regions have been obtained, the image to text extractor (504) performs an image processing technique on the text regions as in step 5120. The image processing technique is a combination of processes performed to preserve important elements in the text regions such as edges, colour, texture, and compositions. The image processing technique includes image denoising, image sharpening, low light enhancement, edge detection, and image cropping. The image processing technique improves the quality of the text regions so that the text regions appear clearly and may be detected accurately in the next step.

Thereon, the text is extracted using OCR software, wherein the OCR software distinguishes printed text characters inside a digital image as in step 5130. Finally, the extracted text is stored in the database server (105) as in step 5140.

FIG. 5 illustrates a flowchart of sub-steps for obtaining text regions in the screen captures as in step 5110 of FIG.4. Initially, a screen capture is supplied into a multioutput classifier as in step 5111. In particular, the screen capture is split into multiple regions. If the output value of the region is 0, meaning that the corresponding region contains no text and vice versa.

Next, the partitions of the screen capture that contain texts are obtained as in step 5112. Finally, the connected regions are combined to become larger image regions as in step 5113.

FIG. 6 illustrates a flowchart of sub-steps for extracting text from the presenter speech as in step 5200 of FIG. 3. Initially, the presenter speech file is loaded as a time signal to synchronise the presentation sessions with the recorded speech as in step 5210. Next, a spectrogram of the loaded time-based speech signal is computed as in step 5220.

Thereon, the spectrogram of the speech signal is supplied to the deep learning sequence-to-sequence model as in step 5230. Specifically, the sequence-to-sequence model builds on top of language model by adding an encoder step and a decoder step. In the encoder step, the model takes the speech spectrogram as an input and trains the sequence of characters in an encoded representation. In the decoder step, the output sequence from the encoder is trained to get a translated sentence. Thus, the deep learning sequence-to-sequence model provides texts of the presenter speech as an output. Finally, the output texts are obtained and stored in the database server (105) as in step 5240.

Reference is now made to FIG. 7 which illustrates a flowchart of sub-steps for detecting objects from the screen captures as in step 5300 of FIG. 3. Initially, the screen captures are passed to the object detector (506) as in step 5310. Next, the list of detected objects within the screen captures is initialised as an empty list as in step 5320, which means there is no object detected by the object detector (506).

Thereon, the output is received from the object detector (506) as in step 5330. Finally, the object in the list of objects detected within the screen captures are converted to words and stored in the database server (105) as in step 5340.

FIG. 8 illustrates a flowchart of sub-steps for creating links between presentation sessions and/or presenters as in step 5600 of FIG. 3. Initially, the session and presenter link manager (510) receives presentation sessions and/or content as in step 5610.

Next, the session and presenter link manager (510) transforms the presentation content and computes similarities as in step 5620. Finally, a link is added between sessions and/or content with the similarities over a threshold value as in step 5630. For example, if a user searches for a particular content that related to any presentation sessions and/or contents, the session and presenter link manager (510) may recommends the closest presentation sessions and/or content from different presenters. The threshold value may be implemented in a 0-100% scale value with 100% refers to exact matching or similarity. For example, a value of 50% means if half of the content (selected keywords) are matched, and therefore, a link between them will be created.

FIG. 9 illustrates a flowchart of a method for retrieving a presentation content. Initially, a user enters keywords to search for screen captures as in step 6100. Next, the system (100) returns a list of potential screen captures and the corresponding sessions as in step 6200.

Thereon, the user selects required screen captures as in step 6300. After the screen captures have been selected, the user is directed to the presentation sessions based on the selected screen captures as in step 6400.

While embodiments of the invention have been illustrated and described, it is not intended that these embodiments illustrate and describe all possible forms of the invention. Rather, the words used in the specifications are words of description rather than limitation and various changes may be made without departing from the scope of the invention.

Claims

1 . A system (100) for retrieving a presentation content comprising: a) a presenter device (101) configured to stream a presentation video that includes the presentation video stream and its audio stream of a presentation session; b) a wireless presentation box (102) includes a speech recorder (502) configured to record the audio stream from the presenter device (101); c) a post-processing server (103) includes an image to text extractor (504) and a speech to text extractor (505); d) a database server (105) connected to the post-processing server (103); and e) a viewer device (106); wherein the system (100) is characterised by: f) the wireless presentation box (102) further includes: i. a frame reader (501) configured to perform screen capturing of the video streaming from the presenter device (101); and ii. an image content type classifier (503) configured to detect the screen captures to be saved in the database server (105); g) the post-processing server (103) further includes: i. an object detector (506) configured to detect common objects found in the screen captures; ii. a related text generator (507) configured to generate related texts from given input texts; iii. a text language classifier (508) configured to predict the language of at least one given text; and iv. a text language translator (509) configured to translate input texts from one language to another language; h) the viewer device (106) configured to allow a user to input a keyword for searching of a screen capture with its respective speech audio and retrieving the screen capture and its respective speech audio related to the keyword; and i) an application server (104) connected to the database server (105) and the viewer device (106), wherein the application server (104) includes a query engine module (511) that is configured to allow the viewer device (106) to search and retrieve a presentation content from the database server (105).

2. The system (100) as claimed in claim 1 , wherein the image to text extractor (504) is configured to extract texts from each screen capture by using Optical

Character Recognition.

3. The system (100) as claimed in claim 1, wherein the speech to text extractor (505) is configured to extract texts by using deep learning model.

4. The system (100) as claimed in claim 1, wherein the post-processing server (103) includes a session and presenter link manager (510) configured to create links between different presentation sessions and/or presenters. 5. A method for storing a presentation content for retrieval of the presentation content is characterised by the steps of: a) streaming a presentation video from a presenter device (101) to a wireless presentation box (102); b) receiving, decoding, and playing the presentation video stream by the wireless presentation box (102); c) performing screen capturing by a frame reader (501) and recording the presenter speech by a speech recorder (502); d) storing the screen captures and the recorded speech in a database server (105); and e) processing the stored screen captures and the recorded speech.

6. The method as claimed in claim 5, wherein the step of performing screen capturing by a frame reader (501) includes: a) receiving a video frame from the presenter device (101) by the frame reader (501); and b) capturing the video frame that contains the required presentation content by the image content type classifier (503). 7. The method as claimed in claim 5, wherein the step of processing the stored screen captures and the recorded speech includes: a) extracting a text from the screen captures by an image to text extractor

(504); b) extracting a text from the presenter speech by a speech to text extractor

(505); c) detecting objects from the presentation screen captures and converting object categories into text by the object detector (506); d) generating related words from the extracted texts; e) associating the screen captures with the extracted texts and related words; and f) creating links between presentation sessions and/or presenters by a session and presenter link manager (510).