WO2022047516A1 - Système et procédé d'annotation audio - Google Patents

Système et procédé d'annotation audio Download PDF

Info

Publication number
WO2022047516A1
WO2022047516A1 PCT/AU2020/050926 AU2020050926W WO2022047516A1 WO 2022047516 A1 WO2022047516 A1 WO 2022047516A1 AU 2020050926 W AU2020050926 W AU 2020050926W WO 2022047516 A1 WO2022047516 A1 WO 2022047516A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
gaze
passage
user
audio
Prior art date
Application number
PCT/AU2020/050926
Other languages
English (en)
Inventor
Anam Ahmad KHAN
Eduardo VELLOSO
James Bailey
Original Assignee
The University Of Melbourne
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The University Of Melbourne filed Critical The University Of Melbourne
Priority to PCT/AU2020/050926 priority Critical patent/WO2022047516A1/fr
Publication of WO2022047516A1 publication Critical patent/WO2022047516A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present invention relates to a system and method for audio annotation on a document.
  • Annotation or note-taking while reading a digital document is a useful tool for document management, comprehension and the like.
  • Annotation in a desktop device environment has its challenges and is made more difficult on mobile devices such as mobile phones and tablets.
  • the present invention provides, a system for anchoring an audio annotation to a passage within an electronic document, the system including: a controller having a processor; a recording component operable by the controller, the recording component including a microphone and an eye-tracker component to capture the gaze of the user, wherein the processor carries out the steps of: in response to an audio input to the microphone, while the audio input is being received, evaluating via the eye-tracker component, the users gaze, thereby determining the passage in the document that the user's gaze is directed to; and mapping the audio input to the passage in the document.
  • the present invention provides an interactive gaze assisted audio annotation taking system which enables one or more users to implicitly annotate (which may include voice annotations or voice to text annotations) passages on a document in a seamless manner.
  • the passage in the document may be text, or may be non-text such as a figure, table, chart picture of the like, or the text may refer to the text associated with a table or graphic (such as a label).
  • Implicit annotation refers to tagging or annotating with passages based on natural user behaviours. In the present invention, the behaviour may correspond to the users gaze activity and the temporal order in which they read the document.
  • the ability to implicitly tag or annotate does not require user to deliberately perform any action to associate the annotation to the relevant passage.
  • the present invention may take the form of a PDF viewer application whereby the present invention is embedded, with an eye-tracker component to evaluate a user’s gaze and facilitate the annotation of audio with reference to passages thereby enabling the user to accurately and without conscious thought to make an annotation at a place in a document.
  • audio input is hands-free, making it suitable for interaction with mobile devices and users can also make the annotation without taking their attention away from the document.
  • Audio input may take any suitable form by way of a microphone and may include natural language processing to make translation between speech and text fast and accurate.
  • the present invention allows for the task of audio annotation to be seamless making it easy and convenient for users to interact with digital text on their device by way of utilising user’s gaze as a resource for anchoring the audio annotation to passages.
  • the evaluation includes determining position data associated with a slider on the electronic document.
  • the position data may include the position of the slider in the document and the page number of the document at the time the audio input was received.
  • the evaluation includes determining fixation gaze data, the fixation gaze data being data that is observed during a window spanning the audio annotation.
  • the fixation gaze data may include one or more gaze points observed in the window and grouped into fixations to determine a dispersion and/or duration threshold.
  • the evaluation includes a machine learning component trained on one or more of gaze and/or temporal features of one or more users that reflects the reading and annotation-taking patterns of the user.
  • Any suitable temporal feature may be recorded, but may include, for example, the duration the user has spent reading a passage and the temporal order within which the passage has been read before recording an annotation or the like.
  • indicia may be displayed for the audio mapped to the passage in the document.
  • a highlight may be provided in the relevant passage of the document upon user engagement with the indicia.
  • the present invention provides a method for anchoring an audio annotation to a passage within an electronic document, the method including: receiving an audio input to a microphone, and while the audio input is being received, evaluating via an eye-tracker component, the users gaze, thereby determining the passage in the document that the user's gaze is directed to and mapping the audio input to the passage in the document.
  • Figure 1 is a schematic diagram of an example network that can be utilised to give effect to the system according to an embodiment of the invention
  • Figure 2 is a diagram illustrating devices that may be utilised with the system and method of the present invention
  • Figure 3 is a schematic diagram illustrating operation of the system and method of the present invention.
  • Figure 4 is a schematic diagram illustrating operation of the system and method of the present invention in use by a user.
  • Figure 5 is flow diagram illustrating the process steps adopted by the system and method of the present invention.
  • the system 100 includes one or more servers 120 which include one or more databases 125 and one or more devices 110a, 110b, 110c (associated with a user for example) which may be communicatively coupled to a cloud computing environment 130, “the cloud” and interconnected via a network 115 such as the internet or a mobile communications network. It will also be appreciated that the system and method may reside on the one or more devices 110a, 110b, 110c.
  • Devices 110a, 110b, 110c may take any suitable form and may include for example smartphones, tablets, laptop computers, desktop computers, server computers, among other forms of computer systems. Each of the devices 110a, 110b, 110c include a , a microphone and eye-tracker component which will be further described with reference to Figure 2.
  • cloud has many connotations, according to embodiments described herein, the term includes a set of network services that are capable of being used remotely over a network, and the method described herein may be implemented as a set of instructions stored in a memory and executed by a cloud computing platform.
  • the software application may provide a service to one or more servers 120, or support other software applications provided by a third party servers. Examples of services include a website, a database, software as a service, or other web services.
  • the transfer of information and/or data over the network 115 can be achieved using wired communications means or wireless communications means. It will be appreciated that embodiments of the invention may be realised over different networks, such as a MAN (metropolitan area network), WAN (wide area network) or LAN (local area network). Also, embodiments need not take place over a network, and the method steps could occur entirely on a client or server processing system.
  • MAN metropolitan area network
  • WAN wide area network
  • LAN local area network
  • FIG. 2 illustrates the devices 110A, 110B and 110C that are utilised with the system and method of the present invention.
  • each of the devices includes a microphone 210A, 210B or 210C and an eye-tracker component 215A, 215B or 215C respectively.
  • the eye-tracker component 215A, 215B or 215C captures the users gaze on the document that is displayed on device 110A, 110B or 110C.
  • Microphone 210A, 210B or 210C is provided to capture audio input from the user in order to anchor their audio annotation to a particular passage within a document.
  • the passage in the document may be text, or may be non-text such as a figure, table, chart picture of the like, or the text may refer to the text associated with a table or graphic (such as a label).
  • a camera associated with the device and the eye-tracker component 215A, 215B or 215C may be one and the same unit where possible (to save on space in the device or on the display of the device).
  • Any suitable eye-tracker component may be provided such as a Tobii 4C although it will be appreciated that any particular eye-tracker may be utilised.
  • Any suitable frequency or frame rate may be captured by the eye-tracker component dependent on system resources, and may be, for example, a frequency of 90 Hz or higher. A lower frequency or frame rate may also be possible (i.e. in the order of 30 Hz).
  • the microphone 210A, 210B or 210C may be an internal microphone associated with the device or may be an external microphone.
  • FIG. 3 is a schematic diagram illustrating the system 300 of the present invention in operation.
  • a user 305 is associated with an input component 310 which may consist of, for example, a device 110C.
  • the device 110C includes a microphone 210C for receiving audio from the user 305 as well as an eye-tracker component 215C for tracking the gaze of the user 305.
  • a gaze feature generation component 315 and machine learning predictive model component 320 are also provided.
  • a text extraction component 325 and an annotated document 330 which is being viewed by the user on the device 110C.
  • a user 305 can open up the digital document 330 on their device 205C which is typically a PDF file (but need not be). Upon loading the digital document 330, passages from the PDF are extracted via computer vision or the like.
  • Extraction of passages from the PDF is carried out by extracting and change PDF pages to an image.
  • the extracted images may then be applied to optical character recognition engine, for example PyTessBaseAPI which is a library which provides functions to segment images into text components.
  • optical character recognition engine for example PyTessBaseAPI which is a library which provides functions to segment images into text components.
  • the paragraphs are then extracted from the images by giving the paragraph as an input component.
  • the extracted paragraphs are then saved as bounded boxes for use by the system and method of the present invention in anchoring the user annotation to the appropriate passage in the document.
  • the gaze coordinates are continuously recorded by the eye-tracker component 215C.
  • the user 305 would typically press a recording button provided on an interface associated with the device 215C or may be by way of a voice command.
  • the gaze co-ordinates are mapped to the page co-ordinates to keep track of where the user 305 was looking on the page while reading and making the annotations.
  • the page co-ordinates are allocated to various extracted passages and region-based gaze and temporal features are calculated.
  • the audio input from the user may be raw audio data which may be processed to extract and provide the audio annotation associated with the document.
  • raw audio data from the microphone may be filtered to remove noise or and/or frequency smoothing may be applied.
  • An audio signal threshold may further be applied (for example a signal at 26dB) in order to remove silent segments in the audio input.
  • audio segments within the audio input for example, audio segments having a duration below a threshold amount of time may be discarded.
  • a Region-of-Analysis may be defined for each audio input. Since in some situations, gaze patterns while a user records audio may indicate that the audio is not directly related to the passages where a user was looking while they spoke but may be related to the passages the user had read before recording the annotation.
  • the ROA may start from the end time of a particular audio input to the end time of the successive audio input under consideration. For example, the ROA for each audio input may be a period of silence from the time from the end of the previous audio input until the end of the present audio input.
  • computer vision may be utilised to parse what the user is looking at. This may be combined with ROA allowing the image data frames to be fetched within the ROA in order to better map the gaze data to pixels in the document.
  • the machine learning component 320 is preferably pre-trained such that a feature vector is provided and consists of extracted gaze features of one or more participants while they read and make audio annotations. This assists with the system predicting the text regions and audio annotations of a user.
  • the pre-training of the machine learning component may be via data of number of users in which each user reads one or more documents and makes audio annotations while recording their gaze.
  • the users explicitly highlight the passages to which each audio annotation corresponds to.
  • the features extracted from the gaze data act to serve as features and the passage highlighted by the user may serve as the ground truth for the machine learning model.
  • the text is extracted at component 325 and the predicted passages are highlighted in the document 330 with indicia such as a sound icon being anchored to the top of the predicted passage.
  • the user 305 can then retrieve the recorded audio annotation and visualise the anchored passage by tapping on the indicia displayed beside the passage. Tapping on the indicia plays the recorded audio annotation as well highlighting the relevant passage.
  • the annotation may be provided in audio format and/or may be converted into text such that the user 305 has the option of listening to the audio annotation or viewing a speak to text translation of the audio annotation highlighted at the relevant point in the document.
  • the audio annotations are anchored to a passage implicitly in that there is no requirement for the user to manually highlight a passage for annotation.
  • the machine learning component 320 predicts the passages which the audio annotation is associated with.
  • the goal of the machine learning component 320 is to map a feature vector of the gaze and temporal features computed from the passages to classify whether an audio annotation was related to a specific passage or not.
  • the classifier may either predict “not annotated” (i.e. the audio annotation was not made with reference to that particular passage) or “annotated” (i.e. audio annotation was made with reference to that particular passage). To solve this binary classification problem for each passage, the classifier may be trained on the whole dataset.
  • a generic classifier is preferable (i.e. to avoid overfitting the classifier to the audio annotation behaviour of a particular user).
  • the classifier may be trained by way of a “leave-one-user-out” cross-validation.
  • the classifier may take any suitable form, for example a classifier in the form of a random forest classifier may be provided.
  • the classifier generally is used for predicting a category of a new observation based on the data observed during the training phase, with the goal of the classifier to classify whether an audio annotation was related to a specific passage or not. This can be done in a number of ways, but for example, by first training the classifier on data from a number of users (for example, 32 users) who are taking annotations while reading a document. A set of gaze and temporal features are extracted from the user’s reading and annotation-taking behaviour that are indicative of whether a read passage is to be mapped to an audio annotation. Once the classifier is trained and there is reasonable performance from the classifier, this classifier may be utilised to make a prediction for an audio annotation recorded by any user.
  • mapping of audio annotations to passages within a document without any additional information from the user is carried out by way of a analysing gaze behaviour during audio annotation. For example, it has been found that mapping is not as straight forward as merely selecting where the user is looking when speaking but the machine learning component 320 may be provided to overcome this challenge based on a collected data set such that anchoring audio annotations to the correct passage may be provided within a reasonable level of performance. For example, the “AUC” (Area Under Curve) being equal to 0.89.
  • the machine learning component 320 may be replaced by a slider component 320 which, while not as effective, can also provide a suitable result.
  • the slider component 320 (which can replace the machine learning component 320), maps a user's recorded audio annotations to the reference text by way of a slider value on the electronic document which they are viewing.
  • a vertical slider value associated with the document may be retrieved together with a page number of the document at the time that the user started to record the audio annotations.
  • the slider value and the page number can then be used to retrieve the image frame which was being displayed on the display when the user started to record the annotations.
  • the audio annotation is then mapped to the paragraph which was displayed at the top of the image frame.
  • the machine learning component 320 may be replaced with a gaze data component 320, which is gaze data observed and analysed during a window spanning the audio annotation.
  • the gaze points observed in this window may be cluttered into "fixation" groups to determine a dispersion threshold (by setting the value or dispersion and duration to a threshold parameter, for example, 200 and 100 respectively).
  • the fixation points may be assigned to the nearest passage.
  • the fixation count may be counted as one gaze feature for each passage. In this way the audio annotation is primarily mapped to the passage at which the user has looked at the most whilst speaking.
  • Fixations may be generated by way of a Dispersion-Threshold Identification algorithm or the like.
  • the Dispersion-Threshold Identification algorithm produces accurate results by using only two parameters, dispersion, and duration threshold, which may be set to 20 and 100, respectively. It will be appreciated that not all fixations will be within passages, due to calibration offsets and tracking errors. In that regard each fixation outside the passages may be assigned to the nearest extracted passage by using hierarchical clustering.
  • the slider and fixation arrangements are limited but useful arrangements.
  • a user must explicitly position the slider to the appropriate point in the document (which would require non-passive effort from the user).
  • the fixation arrangement requires the user to explicitly look and fixate (again requiring nonpassive effort from the user) at the passage which is to be mapped with the recorded audio annotation.
  • the machine learning arrangement when trained correctly on features that capture the broad reading and annotation taking patterns of users allow more efficient mapping of audio annotation to passages.
  • the slider arrangement or fixation arrangement may also be provided where for example a more simplified arrangement is desired. A slider would be an option when the eye-tracker component is not available or offline.
  • the fixation arrangement may be useful if the user does not want the algorithm to decide and the user wants to direct the system to map an audio annotation to a passage.
  • Figure 4 is a schematic diagram illustrating the steps carried out in operation form the point of view of a user using the system and method of the present invention.
  • software associated with the system 400 is running on the device 205C.
  • the user clicks the start button on the display of the device 205C or via a keyboard or alternatively via speech to text commands on the document on the screen of the device 205C.
  • the user speaks and the system maps audio annotations to the passages in the document which in this case is a PDF file.
  • the audio annotation from the user based on the prediction of the machine learning component are then anchored to the reference passages which are inferred from their gaze behaviour.
  • the system 400 offers two features, notably recording and retrieval.
  • recording and retrieval For example, to record an audio annotation while reading the document the user either presses the recording button at the left side of the document viewer or, in a possible embodiment, via a speech to text command and speaks out loud their annotations in relation to the passage that there are looking at. A prediction is then made by the machine learning component regarding the reference passage based on the users gaze activity.
  • the audio annotation is saved in a memory and an indicia which may take the form of a sound icon may then be provided an displayed beside the reference passage to provide an indication that an annotation has been made as shown in 415. Clicking on the audio annotation indicia plays the recorded audio annotation and preferably also highlights the referenced passages when playback is occurring.
  • the reader presses the stop button to finish making a recording or they may do so by voice to text command.
  • the relevant text portion is highlighted as the audio annotation is playing.
  • FIG. 5 is a flow diagram illustrating a method for anchoring an audio annotation to a passage within an electronic document.
  • the method 500 starts at step 505 where a user (say 305) associated with, for example a device (i.e. 205C) which may be a mobile phone, tablet desktop computer or the like, has a document open and a microphone and eye-tracker component in operation associated with the device.
  • a user say 305 associated with, for example a device (i.e. 205C) which may be a mobile phone, tablet desktop computer or the like, has a document open and a microphone and eye-tracker component in operation associated with the device.
  • audio input is received from the user 305 to a microphone 210C associated with the device while they are looking at the device.
  • Control then moves to step 510 in which, while the audio input is being received, the user's gaze is evaluated via the eyetracker component.
  • Control then moves to step 515 in which the passage in the document that the user's gaze is directed to is determined.
  • the evaluation may occur in a number of different ways.
  • the evaluation may occur by way of the position of a slider associated with the electronic document.
  • the slider having a value of 0 where 0 is the start of the page to MAX, where MAX is the end of the page, and determining the location of the user's gaze.
  • the evaluation may occur by way of fixation gaze data, which is gaze data observed and analysed during a window spanning the audio annotation.
  • the gaze points observed in this window may be cluttered into "fixations” to determines a dispersion threshold (by setting the value or dispersion and duration to a threshold parameter, for example, 200 and 100 respectively).
  • the fixation points may be assigned to the nearest passage.
  • the fixation count may be counted as one gaze feature for each passage. In this way the audio annotation is primarily mapped to the passage at which the user has looked at the most whilst speaking.
  • a classifier may be utilised (which may be for example a logistic regression classifier), but a feature vector in this embodiment, consists of the count of fixation feature for each passage. Therefore, the system and method of the present invention may predict the passage at which the user has looked the most while recording the annotation which is indicated by the fixation count feature.
  • the evaluation occurs by way of a machine learning component trained on one or more of gaze and/or temporal features that reflects the reading and annotation-taking patterns of the user.
  • the classifier is trained and then is fed a feature vector as described with reference to Figure 3.
  • the method may further include the steps of providing an indicia for display on the document associated with the audio annotation, preferably the display is an audio icon.
  • the method may further include the step of determining via the audio input the start and stop of an audio annotation. For example, it may be by way of the user dictation a voice command or the microphone sensing a voice command when the software is being used.
  • the method may further include the step of providing playback of the audio annotation to the user and highlighting the relevant passage while the audio annotation is being played.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne un système et un procédé destiné à ancrer une annotation audio à un passage au sein d'un document électronique, qui font intervenir une commande dotée d'un processeur; un composant d'enregistrement utilisable par la commande, le composant d'enregistrement comprenant un microphone et un composant de suivi d'yeux servant à capturer le regard de l'utilisateur. Le processeur réalise les étapes consistant: en réponse à une entrée audio vers le microphone, tandis que l'entrée audio est en cours de réception, à évaluer par l'intermédiaire du composant de suivi d'yeux, le regard de l'utilisateur, à déterminer ainsi le passage dans le document vers lequel le regard de l'utilisateur est dirigé; et à associer l'entrée audio au passage dans le document.
PCT/AU2020/050926 2020-09-04 2020-09-04 Système et procédé d'annotation audio WO2022047516A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/AU2020/050926 WO2022047516A1 (fr) 2020-09-04 2020-09-04 Système et procédé d'annotation audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/AU2020/050926 WO2022047516A1 (fr) 2020-09-04 2020-09-04 Système et procédé d'annotation audio

Publications (1)

Publication Number Publication Date
WO2022047516A1 true WO2022047516A1 (fr) 2022-03-10

Family

ID=80492318

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2020/050926 WO2022047516A1 (fr) 2020-09-04 2020-09-04 Système et procédé d'annotation audio

Country Status (1)

Country Link
WO (1) WO2022047516A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230177258A1 (en) * 2021-12-02 2023-06-08 At&T Intellectual Property I, L.P. Shared annotation of media sub-content

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120310642A1 (en) * 2011-06-03 2012-12-06 Apple Inc. Automatically creating a mapping between text data and audio data
US20160283455A1 (en) * 2015-03-24 2016-09-29 Fuji Xerox Co., Ltd. Methods and Systems for Gaze Annotation
US10002311B1 (en) * 2017-02-10 2018-06-19 International Business Machines Corporation Generating an enriched knowledge base from annotated images

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120310642A1 (en) * 2011-06-03 2012-12-06 Apple Inc. Automatically creating a mapping between text data and audio data
US20160283455A1 (en) * 2015-03-24 2016-09-29 Fuji Xerox Co., Ltd. Methods and Systems for Gaze Annotation
US10002311B1 (en) * 2017-02-10 2018-06-19 International Business Machines Corporation Generating an enriched knowledge base from annotated images

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KHAN A.A.: "Gaze assisted voice note taking system", PROCEEDINGS OF THE 2019 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 9 September 2019 (2019-09-09), pages 367 - 371, XP058441089, DOI: https://doi.org/10.1145/3341162.3349308 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230177258A1 (en) * 2021-12-02 2023-06-08 At&T Intellectual Property I, L.P. Shared annotation of media sub-content

Similar Documents

Publication Publication Date Title
AU2020200239B2 (en) System and method for user-behavior based content recommendations
JP6946869B2 (ja) 複数のメディアセグメントを備えるメディアファイルのサマリを生成する方法、プログラム、およびメディア分析デバイス
US8385588B2 (en) Recording audio metadata for stored images
TWI412953B (zh) 基於從所捕獲的三維影像流所偵測之使用者行為信號控制文件
JP4882486B2 (ja) スライド画像判定デバイスおよびスライド画像判定プログラム
CN110444198B (zh) 检索方法、装置、计算机设备和存储介质
CN109087670B (zh) 情绪分析方法、系统、服务器及存储介质
CN109783796B (zh) 预测文本内容中的样式破坏
US20070174326A1 (en) Application of metadata to digital media
US11044282B1 (en) System and method for augmented reality video conferencing
JP7069802B2 (ja) ユーザ指向トピック選択及びブラウジングのためのシステム及び方法、複数のコンテンツ項目を表示する方法、プログラム、及びコンピューティングデバイス
CN109947971B (zh) 图像检索方法、装置、电子设备及存储介质
WO2022037600A1 (fr) Procédé et appareil d'enregistrement de résumé, ainsi que dispositif informatique et support de stockage
US20200403816A1 (en) Utilizing volume-based speaker attribution to associate meeting attendees with digital meeting content
RU2733816C1 (ru) Способ обработки речевой информации, устройство и запоминающий носитель информации
JP5464786B2 (ja) 情報処理装置、制御方法、及び制御プログラム
WO2022047516A1 (fr) Système et procédé d'annotation audio
US11437038B2 (en) Recognition and restructuring of previously presented materials
Vinciarelli et al. Application of information retrieval technologies to presentation slides
US20220147703A1 (en) Voice activated clinical reporting systems and methods thereof
US20140297678A1 (en) Method for searching and sorting digital data
US20240135973A1 (en) Video segment selection and editing using transcript interactions
JP2005149329A (ja) 意図抽出支援装置およびこれを用いた操作性評価システムならびにこれらに用いられるプログラム
JP2022179178A (ja) 集中度判定プログラム、集中度判定方法、および集中度判定装置
JP2023061165A (ja) 情報処理装置、及び制御方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20951838

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 110823)

122 Ep: pct application non-entry in european phase

Ref document number: 20951838

Country of ref document: EP

Kind code of ref document: A1